diff options
Diffstat (limited to 'src')
-rw-r--r-- | src/math/big/natdiv.go | 684 |
1 files changed, 611 insertions, 73 deletions
diff --git a/src/math/big/natdiv.go b/src/math/big/natdiv.go index 1330990c2c..882bb6d3ba 100644 --- a/src/math/big/natdiv.go +++ b/src/math/big/natdiv.go @@ -2,10 +2,506 @@ // Use of this source code is governed by a BSD-style // license that can be found in the LICENSE file. +/* + +Multi-precision division. Here be dragons. + +Given u and v, where u is n+m digits, and v is n digits (with no leading zeros), +the goal is to return quo, rem such that u = quo*v + rem, where 0 ≤ rem < v. +That is, quo = ⌊u/v⌋ where ⌊x⌋ denotes the floor (truncation to integer) of x, +and rem = u - quo·v. + + +Long Division + +Division in a computer proceeds the same as long division in elementary school, +but computers are not as good as schoolchildren at following vague directions, +so we have to be much more precise about the actual steps and what can happen. + +We work from most to least significant digit of the quotient, doing: + + • Guess a digit q, the number of v to subtract from the current + section of u to zero out the topmost digit. + • Check the guess by multiplying q·v and comparing it against + the current section of u, adjusting the guess as needed. + • Subtract q·v from the current section of u. + • Add q to the corresponding section of the result quo. + +When all digits have been processed, the final remainder is left in u +and returned as rem. + +For example, here is a sketch of dividing 5 digits by 3 digits (n=3, m=2). + + q₂ q₁ q₀ + _________________ + v₂ v₁ v₀ ) u₄ u₃ u₂ u₁ u₀ + ↓ ↓ ↓ | | + [u₄ u₃ u₂]| | + - [ q₂·v ]| | + ----------- ↓ | + [ rem | u₁]| + - [ q₁·v ]| + ----------- ↓ + [ rem | u₀] + - [ q₀·v ] + ------------ + [ rem ] + +Instead of creating new storage for the remainders and copying digits from u +as indicated by the arrows, we use u's storage directly as both the source +and destination of the subtractions, so that the remainders overwrite +successive overlapping sections of u as the division proceeds, using a slice +of u to identify the current section. This avoids all the copying as well as +shifting of remainders. + +Division of u with n+m digits by v with n digits (in base B) can in general +produce at most m+1 digits, because: + + • u < B^(n+m) [B^(n+m) has n+m+1 digits] + • v ≥ B^(n-1) [B^(n-1) is the smallest n-digit number] + • u/v < B^(n+m) / B^(n-1) [divide bounds for u, v] + • u/v < B^(m+1) [simplify] + +The first step is special: it takes the top n digits of u and divides them by +the n digits of v, producing the first quotient digit and an n-digit remainder. +In the example, q₂ = ⌊u₄u₃u₂ / v⌋. + +The first step divides n digits by n digits to ensure that it produces only a +single digit. + +Each subsequent step appends the next digit from u to the remainder and divides +those n+1 digits by the n digits of v, producing another quotient digit and a +new n-digit remainder. + +Subsequent steps divide n+1 digits by n digits, an operation that in general +might produce two digits. However, as used in the algorithm, that division is +guaranteed to produce only a single digit. The dividend is of the form +rem·B + d, where rem is a remainder from the previous step and d is a single +digit, so: + + • rem ≤ v - 1 [rem is a remainder from dividing by v] + • rem·B ≤ v·B - B [multiply by B] + • d ≤ B - 1 [d is a single digit] + • rem·B + d ≤ v·B - 1 [add] + • rem·B + d < v·B [change ≤ to <] + • (rem·B + d)/v < B [divide by v] + + +Guess and Check + +At each step we need to divide n+1 digits by n digits, but this is for the +implementation of division by n digits, so we can't just invoke a division +routine: we _are_ the division routine. Instead, we guess at the answer and +then check it using multiplication. If the guess is wrong, we correct it. + +How can this guessing possibly be efficient? It turns out that the following +statement (let's call it the Good Guess Guarantee) is true. + +If + + • q = ⌊u/v⌋ where u is n+1 digits and v is n digits, + • q < B, and + • the topmost digit of v = vₙ₋₁ ≥ B/2, + +then q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ satisfies q ≤ q̂ ≤ q+2. (Proof below.) + +That is, if we know the answer has only a single digit and we guess an answer +by ignoring the bottom n-1 digits of u and v, using a 2-by-1-digit division, +then that guess is at least as large as the correct answer. It is also not +too much larger: it is off by at most two from the correct answer. + +Note that in the first step of the overall division, which is an n-by-n-digit +division, the 2-by-1 guess uses an implicit uₙ = 0. + +Note that using a 2-by-1-digit division here does not mean calling ourselves +recursively. Instead, we use an efficient direct hardware implementation of +that operation. + +Note that because q is u/v rounded down, q·v must not exceed u: u ≥ q·v. +If a guess q̂ is too big, it will not satisfy this test. Viewed a different way, +the remainder r̂ for a given q̂ is u - q̂·v, which must be positive. If it is +negative, then the guess q̂ is too big. + +This gives us a way to compute q. First compute q̂ with 2-by-1-digit division. +Then, while u < q̂·v, decrement q̂; this loop executes at most twice, because +q̂ ≤ q+2. + + +Scaling Inputs + +The Good Guess Guarantee requires that the top digit of v (vₙ₋₁) be at least B/2. +For example in base 10, ⌊172/19⌋ = 9, but ⌊18/1⌋ = 18: the guess is wildly off +because the first digit 1 is smaller than B/2 = 5. + +We can ensure that v has a large top digit by multiplying both u and v by the +right amount. Continuing the example, if we multiply both 172 and 19 by 3, we +now have ⌊516/57⌋, the leading digit of v is now ≥ 5, and sure enough +⌊51/5⌋ = 10 is much closer to the correct answer 9. It would be easier here +to multiply by 4, because that can be done with a shift. Specifically, we can +always count the number of leading zeros i in the first digit of v and then +shift both u and v left by i bits. + +Having scaled u and v, the value ⌊u/v⌋ is unchanged, but the remainder will +be scaled: 172 mod 19 is 1, but 516 mod 57 is 3. We have to divide the remainder +by the scaling factor (shifting right i bits) when we finish. + +Note that these shifts happen before and after the entire division algorithm, +not at each step in the per-digit iteration. + +Note the effect of scaling inputs on the size of the possible quotient. +In the scaled u/v, u can gain a digit from scaling; v never does, because we +pick the scaling factor to make v's top digit larger but without overflowing. +If u and v have n+m and n digits after scaling, then: + + • u < B^(n+m) [B^(n+m) has n+m+1 digits] + • v ≥ B^n / 2 [vₙ₋₁ ≥ B/2, so vₙ₋₁·B^(n-1) ≥ B^n/2] + • u/v < B^(n+m) / (B^n / 2) [divide bounds for u, v] + • u/v < 2 B^m [simplify] + +The quotient can still have m+1 significant digits, but if so the top digit +must be a 1. This provides a different way to handle the first digit of the +result: compare the top n digits of u against v and fill in either a 0 or a 1. + + +Refining Guesses + +Before we check whether u < q̂·v, we can adjust our guess to change it from +q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ into the refined guess ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋. +Although not mentioned above, the Good Guess Guarantee also promises that this +3-by-2-digit division guess is more precise and at most one away from the real +answer q. The improvement from the 2-by-1 to the 3-by-2 guess can also be done +without n-digit math. + +If we have a guess q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ and we want to see if it also equal to +⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, we can use the same check we would for the full division: +if uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂, then the guess is too large and should be reduced. + +Checking uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ < 0, +and + + uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ = (uₙuₙ₋₁·B + uₙ₋₂) - q̂·(vₙ₋₁·B + vₙ₋₂) + [splitting off the bottom digit] + = (uₙuₙ₋₁ - q̂·vₙ₋₁)·B + uₙ₋₂ - q̂·vₙ₋₂ + [regrouping] + +The expression (uₙuₙ₋₁ - q̂·vₙ₋₁) is the remainder of uₙuₙ₋₁ / vₙ₋₁. +If the initial guess returns both q̂ and its remainder r̂, then checking +whether uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as checking r̂·B + uₙ₋₂ < q̂·vₙ₋₂. + +If we find that r̂·B + uₙ₋₂ < q̂·vₙ₋₂, then we can adjust the guess by +decrementing q̂ and adding vₙ₋₁ to r̂. We repeat until r̂·B + uₙ₋₂ ≥ q̂·vₙ₋₂. +(As before, this fixup is only needed at most twice.) + +Now that q̂ = ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, as mentioned above it is at most one +away from the correct q, and we've avoided doing any n-digit math. +(If we need the new remainder, it can be computed as r̂·B + uₙ₋₂ - q̂·vₙ₋₂.) + +The final check u < q̂·v and the possible fixup must be done at full precision. +For random inputs, a fixup at this step is exceedingly rare: the 3-by-2 guess +is not often wrong at all. But still we must do the check. Note that since the +3-by-2 guess is off by at most 1, it can be convenient to perform the final +u < q̂·v as part of the computation of the remainder r = u - q̂·v. If the +subtraction underflows, decremeting q̂ and adding one v back to r is enough to +arrive at the final q, r. + +That's the entirety of long division: scale the inputs, and then loop over +each output position, guessing, checking, and correcting the next output digit. + +For a 2n-digit number divided by an n-digit number (the worst size-n case for +division complexity), this algorithm uses n+1 iterations, each of which must do +at least the 1-by-n-digit multiplication q̂·v. That's O(n) iterations of +O(n) time each, so O(n²) time overall. + + +Recursive Division + +For very large inputs, it is possible to improve on the O(n²) algorithm. +Let's call a group of n/2 real digits a (very) “wide digit”. We can run the +standard long division algorithm explained above over the wide digits instead of +the actual digits. This will result in many fewer steps, but the math involved in +each step is more work. + +Where basic long division uses a 2-by-1-digit division to guess the initial q̂, +the new algorithm must use a 2-by-1-wide-digit division, which is of course +really an n-by-n/2-digit division. That's OK: if we implement n-digit division +in terms of n/2-digit division, the recursion will terminate when the divisor +becomes small enough to handle with standard long division or even with the +2-by-1 hardware instruction. + +For example, here is a sketch of dividing 10 digits by 4, proceeding with +wide digits corresponding to two regular digits. The first step, still special, +must leave off a (regular) digit, dividing 5 by 4 and producing a 4-digit +remainder less than v. The middle steps divide 6 digits by 4, guaranteed to +produce two output digits each (one wide digit) with 4-digit remainders. +The final step must use what it has: the 4-digit remainder plus one more, +5 digits to divide by 4. + + q₆ q₅ q₄ q₃ q₂ q₁ q₀ + _______________________________ + v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀ + ↓ ↓ ↓ ↓ ↓ | | | | | + [u₉ u₈ u₇ u₆ u₅]| | | | | + - [ q₆q₅·v ]| | | | | + ----------------- ↓ ↓ | | | + [ rem |u₄ u₃]| | | + - [ q₄q₃·v ]| | | + -------------------- ↓ ↓ | + [ rem |u₂ u₁]| + - [ q₂q₁·v ]| + -------------------- ↓ + [ rem |u₀] + - [ q₀·v ] + ------------------ + [ rem ] + +An alternative would be to look ahead to how well n/2 divides into n+m and +adjust the first step to use fewer digits as needed, making the first step +more special to make the last step not special at all. For example, using the +same input, we could choose to use only 4 digits in the first step, leaving +a full wide digit for the last step: + + q₆ q₅ q₄ q₃ q₂ q₁ q₀ + _______________________________ + v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀ + ↓ ↓ ↓ ↓ | | | | | | + [u₉ u₈ u₇ u₆]| | | | | | + - [ q₆·v ]| | | | | | + -------------- ↓ ↓ | | | | + [ rem |u₅ u₄]| | | | + - [ q₅q₄·v ]| | | | + -------------------- ↓ ↓ | | + [ rem |u₃ u₂]| | + - [ q₃q₂·v ]| | + -------------------- ↓ ↓ + [ rem |u₁ u₀] + - [ q₁q₀·v ] + --------------------- + [ rem ] + +Today, the code in divRecursiveStep works like the first example. Perhaps in +the future we will make it work like the alternative, to avoid a special case +in the final iteration. + +Either way, each step is a 3-by-2-wide-digit division approximated first by +a 2-by-1-wide-digit division, just as we did for regular digits in long division. +Because the actual answer we want is a 3-by-2-wide-digit division, instead of +multiplying q̂·v directly during the fixup, we can use the quick refinement +from long division (an n/2-by-n/2 multiply) to correct q to its actual value +and also compute the remainder (as mentioned above), and then stop after that, +never doing a full n-by-n multiply. + +Instead of using an n-by-n/2-digit division to produce n/2 digits, we can add +(not discard) one more real digit, doing an (n+1)-by-(n/2+1)-digit division that +produces n/2+1 digits. That single extra digit tightens the Good Guess Guarantee +to q ≤ q̂ ≤ q+1 and lets us drop long division's special treatment of the first +digit. These benefits are discussed more after the Good Guess Guarantee proof +below. + + +How Fast is Recursive Division? + +For a 2n-by-n-digit division, this algorithm runs a 4-by-2 long division over +wide digits, producing two wide digits plus a possible leading regular digit 1, +which can be handled without a recursive call. That is, the algorithm uses two +full iterations, each using an n-by-n/2-digit division and an n/2-by-n/2-digit +multiplication, along with a few n-digit additions and subtractions. The standard +n-by-n-digit multiplication algorithm requires O(n²) time, making the overall +algorithm require time T(n) where + + T(n) = 2T(n/2) + O(n) + O(n²) + +which, by the Bentley-Haken-Saxe theorem, ends up reducing to T(n) = O(n²). +This is not an improvement over regular long division. + +When the number of digits n becomes large enough, Karatsuba's algorithm for +multiplication can be used instead, which takes O(n^log₂3) = O(n^1.6) time. +(Karatsuba multiplication is implemented in func karatsuba in nat.go.) +That makes the overall recursive division algorithm take O(n^1.6) time as well, +which is an improvement, but again only for large enough numbers. + +It is not critical to make sure that every recursion does only two recursive +calls. While in general the number of recursive calls can change the time +analysis, in this case doing three calls does not change the analysis: + + T(n) = 3T(n/2) + O(n) + O(n^log₂3) + +ends up being T(n) = O(n^log₂3). Because the Karatsuba multiplication taking +time O(n^log₂3) is itself doing 3 half-sized recursions, doing three for the +division does not hurt the asymptotic performance. Of course, it is likely +still faster in practice to do two. + + +Proof of the Good Guess Guarantee + +Given numbers x, y, let us break them into the quotients and remainders when +divided by some scaling factor S, with the added constraints that the quotient +x/y and the high part of y are both less than some limit T, and that the high +part of y is at least half as big as T. + + x₁ = ⌊x/S⌋ y₁ = ⌊y/S⌋ + x₀ = x mod S y₀ = y mod S + + x = x₁·S + x₀ 0 ≤ x₀ < S x/y < T + y = y₁·S + y₀ 0 ≤ y₀ < S T/2 ≤ y₁ < T + +And consider the two truncated quotients: + + q = ⌊x/y⌋ + q̂ = ⌊x₁/y₁⌋ + +We will prove that q ≤ q̂ ≤ q+2. + +The guarantee makes no real demands on the scaling factor S: it is simply the +magnitude of the digits cut from both x and y to produce x₁ and y₁. +The guarantee makes only limited demands on T: it must be large enough to hold +the quotient x/y, and y₁ must have roughly the same size. + +To apply to the earlier discussion of 2-by-1 guesses in long division, +we would choose: + + S = Bⁿ⁻¹ + T = B + x = u + x₁ = uₙuₙ₋₁ + x₀ = uₙ₋₂...u₀ + y = v + y₁ = vₙ₋₁ + y₀ = vₙ₋₂...u₀ + +These simpler variables avoid repeating those longer expressions in the proof. + +Note also that, by definition, truncating division ⌊x/y⌋ satisfies + + x/y - 1 < ⌊x/y⌋ ≤ x/y. + +This fact will be used a few times in the proofs. + +Proof that q ≤ q̂: + + q̂·y₁ = ⌊x₁/y₁⌋·y₁ [by definition, q̂ = ⌊x₁/y₁⌋] + > (x₁/y₁ - 1)·y₁ [x₁/y₁ - 1 < ⌊x₁/y₁⌋] + = x₁ - y₁ [distribute y₁] + + So q̂·y₁ > x₁ - y₁. + Since q̂·y₁ is an integer, q̂·y₁ ≥ x₁ - y₁ + 1. + + q̂ - q = q̂ - ⌊x/y⌋ [by definition, q = ⌊x/y⌋] + ≥ q̂ - x/y [⌊x/y⌋ < x/y] + = (1/y)·(q̂·y - x) [factor out 1/y] + ≥ (1/y)·(q̂·y₁·S - x) [y = y₁·S + y₀ ≥ y₁·S] + ≥ (1/y)·((x₁ - y₁ + 1)·S - x) [above: q̂·y₁ ≥ x₁ - y₁ + 1] + = (1/y)·(x₁·S - y₁·S + S - x) [distribute S] + = (1/y)·(S - x₀ - y₁·S) [-x = -x₁·S - x₀] + > -y₁·S / y [x₀ < S, so S - x₀ < 0; drop it] + ≥ -1 [y₁·S ≤ y] + + So q̂ - q > -1. + Since q̂ - q is an integer, q̂ - q ≥ 0, or equivalently q ≤ q̂. + +Proof that q̂ ≤ q+2: + + x₁/y₁ - x/y = x₁·S/y₁·S - x/y [multiply left term by S/S] + ≤ x/y₁·S - x/y [x₁S ≤ x] + = (x/y)·(y/y₁·S - 1) [factor out x/y] + = (x/y)·((y - y₁·S)/y₁·S) [move -1 into y/y₁·S fraction] + = (x/y)·(y₀/y₁·S) [y - y₁·S = y₀] + = (x/y)·(1/y₁)·(y₀/S) [factor out 1/y₁] + < (x/y)·(1/y₁) [y₀ < S, so y₀/S < 1] + ≤ (x/y)·(2/T) [y₁ ≥ T/2, so 1/y₁ ≤ 2/T] + < T·(2/T) [x/y < T] + = 2 [T·(2/T) = 2] + + So x₁/y₁ - x/y < 2. + + q̂ - q = ⌊x₁/y₁⌋ - q [by definition, q̂ = ⌊x₁/y₁⌋] + = ⌊x₁/y₁⌋ - ⌊x/y⌋ [by definition, q = ⌊x/y⌋] + ≤ x₁/y₁ - ⌊x/y⌋ [⌊x₁/y₁⌋ ≤ x₁/y₁] + < x₁/y₁ - (x/y - 1) [⌊x/y⌋ > x/y - 1] + = (x₁/y₁ - x/y) + 1 [regrouping] + < 2 + 1 [above: x₁/y₁ - x/y < 2] + = 3 + + So q̂ - q < 3. + Since q̂ - q is an integer, q̂ - q ≤ 2. + +Note that when x/y < T/2, the bounds tighten to x₁/y₁ - x/y < 1 and therefore +q̂ - q ≤ 1. + +Note also that in the general case 2n-by-n division where we don't know that +x/y < T, we do know that x/y < 2T, yielding the bound q̂ - q ≤ 4. So we could +remove the special case first step of long division as long as we allow the +first fixup loop to run up to four times. (Using a simple comparison to decide +whether the first digit is 0 or 1 is still more efficient, though.) + +Finally, note that when dividing three leading base-B digits by two (scaled), +we have T = B² and x/y < B = T/B, a much tighter bound than x/y < T. +This in turn yields the much tighter bound x₁/y₁ - x/y < 2/B. This means that +⌊x₁/y₁⌋ and ⌊x/y⌋ can only differ when x/y is less than 2/B greater than an +integer. For random x and y, the chance of this is 2/B, or, for large B, +approximately zero. This means that after we produce the 3-by-2 guess in the +long division algorithm, the fixup loop essentially never runs. + +In the recursive algorithm, the extra digit in (2·⌊n/2⌋+1)-by-(⌊n/2⌋+1)-digit +division has exactly the same effect: the probability of needing a fixup is the +same 2/B. Even better, we can allow the general case x/y < 2T and the fixup +probability only grows to 4/B, still essentially zero. + + +References + +There are no great references for implementing long division; thus this comment. +Here are some notes about what to expect from the obvious references. + +Knuth Volume 2 (Seminumerical Algorithms) section 4.3.1 is the usual canonical +reference for long division, but that entire series is highly compressed, never +repeating a necessary fact and leaving important insights to the exercises. +For example, no rationale whatsoever is given for the calculation that extends +q̂ from a 2-by-1 to a 3-by-2 guess, nor why it reduces the error bound. +The proof that the calculation even has the desired effect is left to exercises. +The solutions to those exercises provided at the back of the book are entirely +calculations, still with no explanation as to what is going on or how you would +arrive at the idea of doing those exact calculations. Nowhere is it mentioned +that this test extends the 2-by-1 guess into a 3-by-2 guess. The proof of the +Good Guess Guarantee is only for the 2-by-1 guess and argues by contradiction, +making it difficult to understand how modifications like adding another digit +or adjusting the quotient range affects the overall bound. + +All that said, Knuth remains the canonical reference. It is dense but packed +full of information and references, and the proofs are simpler than many other +presentations. The proofs above are reworkings of Knuth's to remove the +arguments by contradiction and add explanations or steps that Knuth omitted. +But beware of errors in older printings. Take the published errata with you. + +Brinch Hansen's “Multiple-length Division Revisited: a Tour of the Minefield” +starts with a blunt critique of Knuth's presentation (among others) and then +presents a more detailed and easier to follow treatment of long division, +including an implementation in Pascal. But the algorithm and implementation +work entirely in terms of 3-by-2 division, which is much less useful on modern +hardware than an algorithm using 2-by-1 division. The proofs are a bit too +focused on digit counting and seem needlessly complex, especially compared to +the ones given above. + +Burnikel and Ziegler's “Fast Recursive Division” introduced the key insight of +implementing division by an n-digit divisor using recursive calls to division +by an n/2-digit divisor, relying on Karatsuba multiplication to yield a +sub-quadratic run time. However, the presentation decisions are made almost +entirely for the purpose of simplifying the run-time analysis, rather than +simplifying the presentation. Instead of a single algorithm that loops over +quotient digits, the paper presents two mutually-recursive algorithms, for +2n-by-n and 3n-by-2n. The paper also does not present any general (n+m)-by-n +algorithm. + +The proofs in the paper are remarkably complex, especially considering that +the algorithm is at its core just long division on wide digits, so that the +usual long division proofs apply essentially unaltered. +*/ + package big import "math/bits" +// div returns q, r such that q = ⌊u/v⌋ and r = u%v = u - q·v. +// It uses z and z2 as the storage for q and r. func (z nat) div(z2, u, v nat) (q, r nat) { if len(v) == 0 { panic("division by zero") @@ -18,6 +514,8 @@ func (z nat) div(z2, u, v nat) (q, r nat) { } if len(v) == 1 { + // Short division: long optimized for a single-word divisor. + // In that case, the 2-by-1 guess is all we need at each step. var r2 Word q, r2 = z.divW(u, v[0]) r = z2.setWord(r2) @@ -28,7 +526,9 @@ func (z nat) div(z2, u, v nat) (q, r nat) { return } -// q = (x-r)/y, with 0 <= r < y +// divW returns q, r such that q = ⌊x/y⌋ and r = x%y = x - q·y. +// It uses z as the storage for q. +// Note that y is a single digit (Word), not a big number. func (z nat) divW(x nat, y Word) (q nat, r Word) { m := len(x) switch { @@ -56,6 +556,8 @@ func (x nat) modW(d Word) (r Word) { return divWVW(q, 0, x, d) } +// divWVW overwrites z with ⌊x/y⌋, returning the remainder r. +// The caller must ensure that len(z) = len(x). func divWVW(z []Word, xn Word, x []Word, y Word) (r Word) { r = xn if len(x) == 1 { @@ -70,34 +572,33 @@ func divWVW(z []Word, xn Word, x []Word, y Word) (r Word) { return r } -// q = (uIn-r)/vIn, with 0 <= r < vIn -// Uses z as storage for q, and u as storage for r if possible. -// See Knuth, Volume 2, section 4.3.1, Algorithm D. -// Preconditions: -// len(vIn) >= 2 -// len(uIn) >= len(vIn) -// u must not alias z +// div returns q, r such that q = ⌊uIn/vIn⌋ and r = uIn%vIn = uIn - q·vIn. +// It uses z and u as the storage for q and r. +// The caller must ensure that len(vIn) ≥ 2 (use divW otherwise) +// and that len(uIn) ≥ len(vIn) (the answer is 0, uIn otherwise). func (z nat) divLarge(u, uIn, vIn nat) (q, r nat) { n := len(vIn) m := len(uIn) - n - // D1. + // Scale the inputs so vIn's top bit is 1 (see “Scaling Inputs” above). + // vIn is treated as a read-only input (it may be in use by another + // goroutine), so we must make a copy. + // uIn is copied to u. shift := nlz(vIn[n-1]) - // do not modify vIn, it may be used by another goroutine simultaneously vp := getNat(n) v := *vp shlVU(v, vIn, shift) - - // u may safely alias uIn or vIn, the value of uIn is used to set u and vIn was already used u = u.make(len(uIn) + 1) u[len(uIn)] = shlVU(u[0:len(uIn)], uIn, shift) - // z may safely alias uIn or vIn, both values were used already + // The caller should not pass aliased z and u, since those are + // the two different outputs, but correct just in case. if alias(z, u) { - z = nil // z is an alias for u - cannot reuse + z = nil } q = z.make(m + 1) + // Use basic or recursive long division depending on size. if n < divRecursiveThreshold { q.divBasic(u, v) } else { @@ -106,19 +607,17 @@ func (z nat) divLarge(u, uIn, vIn nat) (q, r nat) { putNat(vp) q = q.norm() + + // Undo scaling of remainder. shrVU(u, u, shift) r = u.norm() return q, r } -// divBasic performs word-by-word division of u by v. -// The quotient is written in pre-allocated q. -// The remainder overwrites input u. -// -// Precondition: -// - q is large enough to hold the quotient u / v -// which has a maximum length of len(u)-len(v)+1. +// divBasic implements long division as described above. +// It overwrites q with ⌊u/v⌋ and overwrites u with the remainder r. +// q must be large enough to hold ⌊u/v⌋. func (q nat) divBasic(u, v nat) { n := len(v) m := len(u) - n @@ -126,45 +625,56 @@ func (q nat) divBasic(u, v nat) { qhatvp := getNat(n + 1) qhatv := *qhatvp - // D2. + // Set up for divWW below, precomputing reciprocal argument. vn1 := v[n-1] rec := reciprocalWord(vn1) + + // Compute each digit of quotient. for j := m; j >= 0; j-- { - // D3. + // Compute the 2-by-1 guess q̂. + // The first iteration must invent a leading 0 for u. qhat := Word(_M) var ujn Word if j+n < len(u) { ujn = u[j+n] } + + // ujn ≤ vn1, or else q̂ would be more than one digit. + // For ujn == vn1, we set q̂ to the max digit M above. + // Otherwise, we compute the 2-by-1 guess. if ujn != vn1 { var rhat Word qhat, rhat = divWW(ujn, u[j+n-1], vn1, rec) - // x1 | x2 = q̂v_{n-2} + // Refine q̂ to a 3-by-2 guess. See “Refining Guesses” above. vn2 := v[n-2] x1, x2 := mulWW(qhat, vn2) - // test if q̂v_{n-2} > br̂ + u_{j+n-2} ujn2 := u[j+n-2] - for greaterThan(x1, x2, rhat, ujn2) { + for greaterThan(x1, x2, rhat, ujn2) { // x1x2 > r̂ u[j+n-2] qhat-- prevRhat := rhat rhat += vn1 - // v[n-1] >= 0, so this tests for overflow. + // If r̂ overflows, then + // r̂ u[j+n-2]v[n-1] is now definitely > x1 x2. if rhat < prevRhat { break } + // TODO(rsc): No need for a full mulWW. + // x2 += vn2; if x2 overflows, x1++ x1, x2 = mulWW(qhat, vn2) } } - // D4. - // Compute the remainder u - (q̂*v) << (_W*j). - // The subtraction may overflow if q̂ estimate was off by one. + // Compute q̂·v. qhatv[n] = mulAddVWW(qhatv[0:n], v, qhat, 0) qhl := len(qhatv) if j+qhl > len(u) && qhatv[n] == 0 { qhl-- } + + // Subtract q̂·v from the current section of u. + // If it underflows, q̂·v > u, which we fix up + // by decrementing q̂ and adding v back. c := subVV(u[j:j+qhl], u[j:], qhatv) if c != 0 { c := addVV(u[j:j+n], u[j:], v) @@ -176,6 +686,8 @@ func (q nat) divBasic(u, v nat) { qhat-- } + // Save quotient digit. + // Caller may know the top digit is zero and not leave room for it. if j == m && m == len(q) && qhat == 0 { continue } @@ -185,30 +697,34 @@ func (q nat) divBasic(u, v nat) { putNat(qhatvp) } -// greaterThan reports whether (x1<<_W + x2) > (y1<<_W + y2) +// greaterThan reports whether the two digit numbers x1 x2 > y1 y2. +// TODO(rsc): In contradiction to most of this file, x1 is the high +// digit and x2 is the low digit. This should be fixed. func greaterThan(x1, x2, y1, y2 Word) bool { return x1 > y1 || x1 == y1 && x2 > y2 } +// divRecursiveThreshold is the number of divisor digits +// at which point divRecursive is faster than divBasic. const divRecursiveThreshold = 100 -// divRecursive performs word-by-word division of u by v. -// The quotient is written in pre-allocated z. -// The remainder overwrites input u. -// -// Precondition: -// - len(z) >= len(u)-len(v) -// -// See Burnikel, Ziegler, "Fast Recursive Division", Algorithm 1 and 2. +// divRecursive implements recursive division as described above. +// It overwrites z with ⌊u/v⌋ and overwrites u with the remainder r. +// z must be large enough to hold ⌊u/v⌋. +// This function is just for allocating and freeing temporaries +// around divRecursiveStep, the real implementation. func (z nat) divRecursive(u, v nat) { - // Recursion depth is less than 2 log2(len(v)) - // Allocate a slice of temporaries to be reused across recursion. + // Recursion depth is (much) less than 2 log₂(len(v)). + // Allocate a slice of temporaries to be reused across recursion, + // plus one extra temporary not live across the recursion. recDepth := 2 * bits.Len(uint(len(v))) - // large enough to perform Karatsuba on operands as large as v tmp := getNat(3 * len(v)) temps := make([]*nat, recDepth) + z.clear() z.divRecursiveStep(u, v, 0, tmp, temps) + + // Free temporaries. for _, n := range temps { if n != nil { putNat(n) @@ -217,72 +733,92 @@ func (z nat) divRecursive(u, v nat) { putNat(tmp) } -// divRecursiveStep computes the division of u by v. -// - z must be large enough to hold the quotient -// - the quotient will overwrite z -// - the remainder will overwrite u +// divRecursiveStep is the actual implementation of recursive division. +// It adds ⌊u/v⌋ to z and overwrites u with the remainder r. +// z must be large enough to hold ⌊u/v⌋. +// It uses temps[depth] (allocating if needed) as a temporary live across +// the recursive call. It also uses tmp, but not live across the recursion. func (z nat) divRecursiveStep(u, v nat, depth int, tmp *nat, temps []*nat) { + // u is a subsection of the original and may have leading zeros. + // TODO(rsc): The v = v.norm() is useless and should be removed. + // We know (and require) that v's top digit is ≥ B/2. u = u.norm() v = v.norm() - if len(u) == 0 { z.clear() return } + + // Fall back to basic division if the problem is now small enough. n := len(v) if n < divRecursiveThreshold { z.divBasic(u, v) return } + + // Nothing to do if u is shorter than v (implies u < v). m := len(u) - n if m < 0 { return } - // Produce the quotient by blocks of B words. - // Division by v (length n) is done using a length n/2 division - // and a length n/2 multiplication for each block. The final - // complexity is driven by multiplication complexity. + // We consider B digits in a row as a single wide digit. + // (See “Recursive Division” above.) + // + // TODO(rsc): rename B to Wide, to avoid confusion with _B, + // which is something entirely different. + // TODO(rsc): Look into whether using ⌈n/2⌉ is better than ⌊n/2⌋. B := n / 2 // Allocate a nat for qhat below. if temps[depth] == nil { - temps[depth] = getNat(n) + temps[depth] = getNat(n) // TODO(rsc): Can be just B+1. } else { *temps[depth] = temps[depth].make(B + 1) } + // Compute each wide digit of the quotient. + // + // TODO(rsc): Change the loop to be + // for j := (m+B-1)/B*B; j > 0; j -= B { + // which will make the final step a regular step, letting us + // delete what amounts to an extra copy of the loop body below. j := m for j > B { - // Divide u[j-B:j+n] by vIn. Keep remainder in u - // for next block. - // - // The following property will be used (Lemma 2): - // if u = u1 << s + u0 - // v = v1 << s + v0 - // then floor(u1/v1) >= floor(u/v) + // Divide u[j-B:j+n] (3 wide digits) by v (2 wide digits). + // First make the 2-by-1-wide-digit guess using a recursive call. + // Then extend the guess to the full 3-by-2 (see “Refining Guesses”). // - // Moreover, the difference is at most 2 if len(v1) >= len(u/v) - // We choose s = B-1 since len(v)-s >= B+1 >= len(u/v) + // For the 2-by-1-wide-digit guess, instead of doing 2B-by-B-digit, + // we use a (2B+1)-by-(B+1) digit, which handles the possibility that + // the result has an extra leading 1 digit as well as guaranteeing + // that the computed q̂ will be off by at most 1 instead of 2. + + // s is the number of digits to drop from the 3B- and 2B-digit chunks. + // We drop B-1 to be left with 2B+1 and B+1. s := (B - 1) - // Except for the first step, the top bits are always - // a division remainder, so the quotient length is <= n. + + // uu is the up-to-3B-digit section of u we are working on. uu := u[j-B:] + // Compute the 2-by-1 guess q̂, leaving r̂ in uu[s:B+n]. qhat := *temps[depth] qhat.clear() qhat.divRecursiveStep(uu[s:B+n], v[s:], depth+1, tmp, temps) qhat = qhat.norm() - // Adjust the quotient: - // u = u_h << s + u_l - // v = v_h << s + v_l - // u_h = q̂ v_h + rh - // u = q̂ (v - v_l) + rh << s + u_l - // After the above step, u contains a remainder: - // u = rh << s + u_l - // and we need to subtract q̂ v_l - // - // But it may be a bit too large, in which case q̂ needs to be smaller. + + // Extend to a 3-by-2 quotient and remainder. + // Because divRecursiveStep overwrote the top part of uu with + // the remainder r̂, the full uu already contains the equivalent + // of r̂·B + uₙ₋₂ from the “Refining Guesses” discussion. + // Subtracting q̂·vₙ₋₂ from it will compute the full-length remainder. + // If that subtraction underflows, q̂·v > u, which we fix up + // by decrementing q̂ and adding v back, same as in long division. + + // TODO(rsc): Instead of subtract and fix-up, this code is computing + // q̂·vₙ₋₂ and decrementing q̂ until that product is ≤ u. + // But we can do the subtraction directly, as in the comment above + // and in long division, because we know that q̂ is wrong by at most one. qhatv := tmp.make(3 * n) qhatv.clear() qhatv = qhatv.mul(qhat, v[:s]) @@ -309,6 +845,8 @@ func (z nat) divRecursiveStep(u, v nat, depth int, tmp *nat, temps []*nat) { j -= B } + // TODO(rsc): Rewrite loop as described above and delete all this code. + // Now u < (v<<B), compute lower bits in the same way. // Choose shift = B-1 again. s := B - 1 |