diff options
Diffstat (limited to 'Objects/stringlib/stringlib_find_two_way_notes.txt')
| -rw-r--r-- | Objects/stringlib/stringlib_find_two_way_notes.txt | 431 |
1 files changed, 431 insertions, 0 deletions
diff --git a/Objects/stringlib/stringlib_find_two_way_notes.txt b/Objects/stringlib/stringlib_find_two_way_notes.txt new file mode 100644 index 0000000000..afe45431a7 --- /dev/null +++ b/Objects/stringlib/stringlib_find_two_way_notes.txt @@ -0,0 +1,431 @@ +This document explains Crochemore and Perrin's Two-Way string matching +algorithm, in which a smaller string (the "pattern" or "needle") +is searched for in a longer string (the "text" or "haystack"), +determining whether the needle is a substring of the haystack, and if +so, at what index(es). It is to be used by Python's string +(and bytes-like) objects when calling `find`, `index`, `__contains__`, +or implicitly in methods like `replace` or `partition`. + +This is essentially a re-telling of the paper + + Crochemore M., Perrin D., 1991, Two-way string-matching, + Journal of the ACM 38(3):651-675. + +focused more on understanding and examples than on rigor. See also +the code sample here: + + http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 + +The algorithm runs in O(len(needle) + len(haystack)) time and with +O(1) space. However, since there is a larger preprocessing cost than +simpler algorithms, this Two-Way algorithm is to be used only when the +needle and haystack lengths meet certain thresholds. + + +These are the basic steps of the algorithm: + + * "Very carefully" cut the needle in two. + * For each alignment attempted: + 1. match the right part + * On failure, jump by the amount matched + 1 + 2. then match the left part. + * On failure jump by max(len(left), len(right)) + 1 + * If the needle is periodic, don't re-do comparisons; maintain + a "memory" of how many characters you already know match. + + +-------- Matching the right part -------- + +We first scan the right part of the needle to check if it matches the +the aligned characters in the haystack. We scan left-to-right, +and if a mismatch occurs, we jump ahead by the amount matched plus 1. + +Example: + + text: ........EFGX................... + pattern: ....abcdEFGH.... + cut: <<<<>>>> + +Matched 3, so jump ahead by 4: + + text: ........EFGX................... + pattern: ....abcdEFGH.... + cut: <<<<>>>> + +Why are we allowed to do this? Because we cut the needle very +carefully, in such a way that if the cut is ...abcd + EFGH... then +we have + + d != E + cd != EF + bcd != EFG + abcd != EFGH + ... and so on. + +If this is true for every pair of equal-length substrings around the +cut, then the following alignments do not work, so we can skip them: + + text: ........EFG.................... + pattern: ....abcdEFGH.... + ^ (Bad because d != E) + text: ........EFG.................... + pattern: ....abcdEFGH.... + ^^ (Bad because cd != EF) + text: ........EFG.................... + pattern: ....abcdEFGH.... + ^^^ (Bad because bcd != EFG) + +Skip 3 alignments => increment alignment by 4. + + +-------- If len(left_part) < len(right_part) -------- + +Above is the core idea, and it begins to suggest how the algorithm can +be linear-time. There is one bit of subtlety involving what to do +around the end of the needle: if the left half is shorter than the +right, then we could run into something like this: + + text: .....EFG...... + pattern: cdEFGH + +The same argument holds that we can skip ahead by 4, so long as + + d != E + cd != EF + ?cd != EFG + ??cd != EFGH + etc. + +The question marks represent "wildcards" that always match; they're +outside the limits of the needle, so there's no way for them to +invalidate a match. To ensure that the inequalities above are always +true, we need them to be true for all possible '?' values. We thus +need cd != FG and cd != GH, etc. + + +-------- Matching the left part -------- + +Once we have ensured the right part matches, we scan the left part +(order doesn't matter, but traditionally right-to-left), and if we +find a mismatch, we jump ahead by +max(len(left_part), len(right_part)) + 1. That we can jump by +at least len(right_part) + 1 we have already seen: + + text: .....EFG..... + pattern: abcdEFG + Matched 3, so jump by 4, + using the fact that d != E, cd != EF, and bcd != EFG. + +But we can also jump by at least len(left_part) + 1: + + text: ....cdEF..... + pattern: abcdEF + Jump by len('abcd') + 1 = 5. + + Skip the alignments: + text: ....cdEF..... + pattern: abcdEF + text: ....cdEF..... + pattern: abcdEF + text: ....cdEF..... + pattern: abcdEF + text: ....cdEF..... + pattern: abcdEF + +This requires the following facts: + d != E + cd != EF + bcd != EF? + abcd != EF?? + etc., for all values of ?s, as above. + +If we have both sets of inequalities, then we can indeed jump by +max(len(left_part), len(right_part)) + 1. Under the assumption of such +a nice splitting of the needle, we now have enough to prove linear +time for the search: consider the forward-progress/comparisons ratio +at each alignment position. If a mismatch occurs in the right part, +the ratio is 1 position forward per comparison. On the other hand, +if a mismatch occurs in the left half, we advance by more than +len(needle)//2 positions for at most len(needle) comparisons, +so this ratio is more than 1/2. This average "movement speed" is +bounded below by the constant "1 position per 2 comparisons", so we +have linear time. + + +-------- The periodic case -------- + +The sets of inequalities listed so far seem too good to be true in +the general case. Indeed, they fail when a needle is periodic: +there's no way to split 'AAbAAbAAbA' in two such that + + (the stuff n characters to the left of the split) + cannot equal + (the stuff n characters to the right of the split) + for all n. + +This is because no matter how you cut it, you'll get +s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the +needle in two so that n can be as big as possible. If we were to +split it as + + AAbA + AbAAbA + +then A == A at the split, so this is bad (we failed at length 1), but +if we split it as + + AA + bAAbAAbA + +we at least have A != b and AA != bA, and we fail at length 3 +since ?AA == bAA. We already knew that a cut to make length-3 +mismatch was impossible due to the period, but we now see that the +bound is sharp; we can get length-1 and length-2 to mismatch. + +This is exactly the content of the *critical factorization theorem*: +that no matter the period of the original needle, you can cut it in +such a way that (with the appropriate question marks), +needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period. + +Even "non-periodic" strings are periodic with a period equal to +their length, so for such needles, the CFT already guarantees that +the algorithm described so far will work, since we can cut the needle +so that the length-k chunks on either side of the cut mismatch for all +k < len(needle). Looking closer at the algorithm, we only actually +require that k go up to max(len(left_part), len(right_part)). +So long as the period exceeds that, we're good. + +The more general shorter-period case is a bit harder. The essentials +are the same, except we use the periodicity to our advantage by +"remembering" periods that we've already compared. In our running +example, say we're computing + + "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA". + +We cut as AA + bAAbAAbA, and then the algorithm runs as follows: + + First alignment: + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ^^X + - Mismatch at third position, so jump by 3. + - This requires that A!=b and AA != bA. + + Second alignment: + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ^^^^^^^^ + X + - Matched entire right part + - Mismatch at left part. + - Jump forward a period, remembering the existing comparisons + + Third alignment: + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + mmmmmmm^^X + - There's "memory": a bunch of characters were already matched. + - Two more characters match beyond that. + - The 8th character of the right part mismatched, so jump by 8 + - The above rule is more complicated than usual: we don't have + the right inequalities for lengths 1 through 7, but we do have + shifted copies of the length-1 and length-2 inequalities, + along with knowledge of the mismatch. We can skip all of these + alignments at once: + + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ~ A != b at the cut + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ~~ AA != bA at the cut + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ^^^^X 7-3=4 match, and the 5th misses. + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ~ A != b at the cut + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ~~ AA != bA at the cut + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ^X 7-3-3=1 match and the 2nd misses. + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ~ A != b at the cut + + Fourth alignment: + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ^X + - Second character mismatches, so jump by 2. + + Fifth alignment: + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + ^^^^^^^^ + X + - Right half matches, so use memory and skip ahead by period=3 + + Sixth alignment: + bbbAbbAAbAAbAAbbbAAbAAbAAbAA + AAbAAbAAbA + mmmmmmmm^^ + - Right part matches, left part is remembered, found a match! + +The one tricky skip by 8 here generalizes: if we have a period of p, +then the CFT says we can ensure the cut has the inequality property +for lengths 1 through p-1, and jumping by p would line up the +matching characters and mismatched character one period earlier. +Inductively, this proves that we can skip by the number of characters +matched in the right half, plus 1, just as in the original algorithm. + +To make it explicit, the memory is set whenever the entire right part +is matched and is then used as a starting point in the next alignment. +In such a case, the alignment jumps forward one period, and the right +half matches all except possibly the last period. Additionally, +if we cut so that the left part has a length strictly less than the +period (we always can!), then we can know that the left part already +matches. The memory is reset to 0 whenever there is a mismatch in the +right part. + +To prove linearity for the periodic case, note that if a right-part +character mismatches, then we advance forward 1 unit per comparison. +On the other hand, if the entire right part matches, then the skipping +forward by one period "defers" some of the comparisons to the next +alignment, where they will then be spent at the usual rate of +one comparison per step forward. Even if left-half comparisons +are always "wasted", they constitute less than half of all +comparisons, so the average rate is certainly at least 1 move forward +per 2 comparisons. + + +-------- When to choose the periodic algorithm --------- + +The periodic algorithm is always valid but has an overhead of one +more "memory" register and some memory computation steps, so the +here-described-first non-periodic/long-period algorithm -- skipping by +max(len(left_part), len(right_part)) + 1 rather than the period -- +should be preferred when possible. + +Interestingly, the long-period algorithm does not require an exact +computation of the period; it works even with some long-period, but +undeniably "periodic" needles: + + Cut: AbcdefAbc == Abcde + fAbc + +This cut gives these inequalities: + + e != f + de != fA + cde != fAb + bcde != fAbc + Abcde != fAbc? + The first failure is a period long, per the CFT: + ?Abcde == fAbc?? + +A sufficient condition for using the long-period algorithm is having +the period of the needle be greater than +max(len(left_part), len(right_part)). This way, after choosing a good +split, we get all of the max(len(left_part), len(right_part)) +inequalities around the cut that were required in the long-period +version of the algorithm. + +With all of this in mind, here's how we choose: + + (1) Choose a "critical factorization" of the needle -- a cut + where we have period minus 1 inequalities in a row. + More specifically, choose a cut so that the left_part + is less than one period long. + (2) Determine the period P_r of the right_part. + (3) Check if the left part is just an extension of the pattern of + the right part, so that the whole needle has period P_r. + Explicitly, check if + needle[0:cut] == needle[0+P_r:cut+P_r] + If so, we use the periodic algorithm. If not equal, we use the + long-period algorithm. + +Note that if equality holds in (3), then the period of the whole +string is P_r. On the other hand, suppose equality does not hold. +The period of the needle is then strictly greater than P_r. Here's +a general fact: + + If p is a substring of s and p has period r, then the period + of s is either equal to r or greater than len(p). + +We know that needle_period != P_r, +and therefore needle_period > len(right_part). +Additionally, we'll choose the cut (see below) +so that len(left_part) < needle_period. + +Thus, in the case where equality does not hold, we have that +needle_period >= max(len(left_part), len(right_part)) + 1, +so the long-period algorithm works, but otherwise, we know the period +of the needle. + +Note that this decision process doesn't always require an exact +computation of the period -- we can get away with only computing P_r! + + +-------- Computing the cut -------- + +Our remaining tasks are now to compute a cut of the needle with as +many inequalities as possible, ensuring that cut < needle_period. +Meanwhile, we must also compute the period P_r of the right_part. + +The computation is relatively simple, essentially doing this: + + suffix1 = max(needle[i:] for i in range(len(needle))) + suffix2 = ... # the same as above, but invert the alphabet + cut1 = len(needle) - len(suffix1) + cut2 = len(needle) - len(suffix2) + cut = max(cut1, cut2) # the later cut + +For cut2, "invert the alphabet" is different than saying min(...), +since in lexicographic order, we still put "py" < "python", even +if the alphabet is inverted. Computing these, along with the method +of computing the period of the right half, is easiest to read directly +from the source code in fastsearch.h, in which these are computed +in linear time. + +Crochemore & Perrin's Theorem 3.1 give that "cut" above is a +critical factorization less than the period, but a very brief sketch +of their proof goes something like this (this is far from complete): + + * If this cut splits the needle as some + needle == (a + w) + (w + b), meaning there's a bad equality + w == w, it's impossible for w + b to be bigger than both + b and w + w + b, so this can't happen. We thus have all of + the ineuqalities with no question marks. + * By maximality, the right part is not a substring of the left + part. Thus, we have all of the inequalities involving no + left-side question marks. + * If you have all of the inequalities without right-side question + marks, we have a critical factorization. + * If one such inequality fails, then there's a smaller period, + but the factorization is nonetheless critical. Here's where + you need the redundancy coming from computing both cuts and + choosing the later one. + + +-------- Some more Bells and Whistles -------- + +Beyond Crochemore & Perrin's original algorithm, we can use a couple +more tricks for speed in fastsearch.h: + + 1. Even though C&P has a best-case O(n/m) time, this doesn't occur + very often, so we add a Boyer-Moore bad character table to + achieve sublinear time in more cases. + + 2. The prework of computing the cut/period is expensive per + needle character, so we shouldn't do it if it won't pay off. + For this reason, if the needle and haystack are long enough, + only automatically start with two-way if the needle's length + is a small percentage of the length of the haystack. + + 3. In cases where the needle and haystack are large but the needle + makes up a significant percentage of the length of the + haystack, don't pay the expensive two-way preprocessing cost + if you don't need to. Instead, keep track of how many + character comparisons are equal, and if that exceeds + O(len(needle)), then pay that cost, since the simpler algorithm + isn't doing very well. |
