1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
|
Implementation of mpfr_sum: sum of n mpfr_t values
══════════════════════════════════════════════════
Introduction
────────────
The original mpfr_sum algorithm, used up to GNU MPFR 3.1.x, was very
inefficient on some inputs and could take all the memory, making the
whole program crash (and possibly other programs on the system too,
due to the lack of memory at some critical point). It was based on
the article:
James Demmel and Yozo Hida, Fast and accurate floating-point summation
with application to computational geometry, Numerical Algorithms,
volume 37, number 1-4, pages 101--112, 2004.
which consists in adding the inputs one by one in decreasing magnitude.
But in MPFR, each number has its own precision and the exponent range is
large, so that this algorithm doesn't work well. The computation of the
sum was done with mpfr_add's in some fixed precision, in a conventional
Ziv loop. The inefficiency (with possible crash) was seen when a large
precision was needed due to the Table Maker's Dilemma. A huge precision
could be needed in case of internal cancellations and numbers of very
different order of magnitudes. This problem was not easily fixable to
handle all the possible cases.
Moreover it was decided to drop Demmel's algorithm because it is based
on floating-point operations while the basic blocks of MPFR are integer
operations, and they could be used directly to compute a sum. Another
drawback of Demmel's algorithm for the particular case of MPFR is that
there is no good way to avoid the Ziv loop:
* One could consider the largest precision among all the numbers
(each input and the output), but this could mean computations in
an internal precision much larger than actually necessary.
* Split the inputs so that all the values have the same precision
(suggestion from Demmel), but, in particular in small precision,
this introduces an important overhead.
The time/memory complexity must no longer depend on the value of the
exponents of the inputs, i.e. their order of magnitude.
Instead of being high level (based on mpfr_add), the algorithm and
implementation should be low level, based on integer operations,
equivalently seen as fixed-point operations. They must be efficient
both for the "general case" and for particular cases, with global
cancellations (the result being much smaller than some of the inputs)
and/or the Table Maker's Dilemma. So, we will use functions from the
mpn layer of GMP only, involving virtual splitting of MPFR numbers.
As a bonus, this will also solve overflow, underflow and normalization
issues, since everything is done in fixed point and the output exponent
will be considered only at the end (early overflow detection could also
be done, but this would probably not be very useful in practice).
The idea is the following. After handling special cases (NaN, infinities,
only zeros, and fewer than 3 regular inputs), we apply the generic case,
which more or less consists in a fixed-point accumulation by blocks: we
take into account the bits of the inputs whose weight is in some window
⟦minexp,maxexp⟦, and if this is not sufficient due to cancellation, we
reiterate with some less significant window. Once we have obtained an
accurate sum, if one still cannot round correctly because the result is
too close to a rounding boundary (i.e. a machine number or the middle
of two consecutive machine numbers), which is the problem known as the
Table Maker's Dilemma (TMD), then this problem is solved by determining
the sign of the remainder by using the same method in a low precision.
Specification
─────────────
The prototype:
int
mpfr_sum (mpfr_ptr sum, mpfr_ptr *const x, unsigned long n, mpfr_rnd_t rnd)
where sum will contain the result, x is an array of pointers to the
inputs, n is the length of this array, and rnd is the rounding mode.
The return value of type int will be the usual ternary value.
FIXME: It is currently assumed that no input pointers point to sum,
i.e. that the output sum is not shared with an input.
If n = 0, then the result is +0, whatever the rounding mode. This is
equivalent to mpfr_set_ui and mpfr_set_si on the integer 0, and this
choice is consistent with IEEE 754-2008's sum reduction operation of
vector length 0.
Otherwise the result must be the same as the one that would have been
obtained with
* if n = 1, mpfr_set;
* if n > 1, a succession of additions (mpfr_add) done in infinite
precision, then rounded (the order of these additions doesn't
matter).
This is equivalent to apply the following steps:
1. If an input is NaN, then the result is NaN.
2. If there are at least a +inf and a −inf, then the result is NaN.
3. If there is at least an infinity (in which case all the infinities
have the same sign), then the result is this infinity.
4. If the result is an exact zero:
* if all the inputs have the same sign (i.e. all +0 or all −0),
then the result has the same sign as the inputs;
* otherwise, either because all inputs are zeros with at least
a +0 and a −0, or because some inputs are non-zero (but they
globally cancel), the result is +0, except for the MPFR_RNDD
rounding mode, where it is −0.
5. Otherwise the exact result is a non-zero real number, and the
conventional rounding function is applied.
Concerning the exceptions, one generates a NaN exception in cases (1)
and (2). Case (5) potentially yields inexact / underflow / overflow
exceptions, as specified by the rounding rules.
See also:
https://sympa.inria.fr/sympa/arc/mpfr/2014-03/msg00000.html
Preliminaries About the Representation of MPFR Numbers
──────────────────────────────────────────────────────
In MPFR, exponents are stored in a signed integer type mpfr_exp_t.
If this type has N value bits, i.e. the maximum value is 2^N−1, then
the maximum exponent range is defined so that any valid exponent fits
in N−1 bits (sign bit excluded), i.e. it is ⟦1−2^(N−1),2^(N−1)−1⟧.
This implies a huge gap between the minimum value MPFR_EXP_MIN of
the type and the minimum valid exponent MPFR_EMIN_MIN.
We assume that each represented bit of a regular MPFR number has an
exponent value representable in the mpfr_exp_t type, i.e. a value
that is ≥ MPFR_EXP_MIN. MPFR code of some other functions have similar
requirements (often not documented). However, unless the precision of
a MPFR number is huge, such requirements are satisfied. In practice,
they are always satisfied on 64-bit machines due to memory limits. On
32-bit machines, users would probably reach other system limits first
(mainly due to the limit on the address space).
Algorithm and Implementation: Preliminary Steps
───────────────────────────────────────────────
We start by detecting the special cases. The mpfr_sum function does the
following.
1. If n ≤ 2, we can use existing MPFR functions and macros, mainly for
better efficiency since the algorithm described below can work with
any number of inputs (only minor changes would be needed).
* If n = 0, return +0 (by using MPFR macros).
* If n = 1, use mpfr_set.
* If n = 2, use mpfr_add.
2. Iterate on the n input numbers to:
(A) detect singular values (NaN, infinity, zero);
(B) among the regular values, get the maximum exponent.
Such information can be retrieved very quickly and this does not need
to look at the significand. Moreover, in the current internal number
representation, the kind of a singular value is represented as special
values of the exponent field, so that (B) does not need to fetch more
data in memory after doing (A).
In detail, during this iteration, 4 variables will be set, but the
loop will terminate earlier if one can determine that the result
will be NaN, either because of a NaN input or because of infinity
inputs of opposite signs:
* maxexp, which will contain the maximum exponent of the inputs.
Thus it is initialized to MPFR_EXP_MIN.
* rn, which will contain the number n' of regular inputs, i.e.
those which are non-zero finite numbers.
* sign_inf, which will contain the sign of an infinity result. It
is initialized to 0, meaning no infinities yet. When the first
infinity is encountered, this value is changed to the sign of
this infinity (+1 or −1). When a new infinity is encountered,
either it has the same sign of sign_inf, in which case nothing
changes, or it has the opposite sign, in which case the loop
terminates immediately and a NaN result is returned.
* sign_zero, which will contain the sign of the zero result in
the case where all the inputs are zeros. Thanks to the IEEE 754
rules, this can be tracked with this variable alone: There is a
weak sign (−1, except for MPFR_RNDD, where it is +1), which can
be obtained only when all the inputs are zeros of this sign, and
a strong sign (+1, except for MPFR_RNDD, where it is −1), which
is obtained in all the other cases, i.e. when there is at least
a zero of this sign. One could have initialized the value of
sign_zero to the weak sign. But we have chosen to initialize it
to 0, which means that the sign is currently unknown, and do an
additional test in the loop. In practice, one should not see the
difference; this second solution was chosen just because it was
implemented first, and on a test, it appears that it makes the
code slightly shorter.
When the loop has terminated "normally", the result cannot be NaN.
One does in the following order:
1. If sign_inf ≠ 0, then the result is an infinity of this sign.
2. If rn = 0, then all the inputs are zeros, so that the result
is a zero whose sign is given by sign_zero.
3. If rn ≤ 2, then one can use mpfr_set (if rn = 1) or mpfr_add
(if rn = 2) as an optimization, similarly to what was done for
n ≤ 2. We reiterate in order to find the concerned input(s).
4. Otherwise one calls a function sum_aux implementing the generic
case, more precisely:
sum_aux (sum, x, n, rnd, maxexp, rn);
i.e. in addition to the parameters of mpfr_sum, we pass:
* the maximum exponent;
* the number of regular inputs, i.e. the number of non-zero
inputs. This number will be used instead of n to determine
bounds on the sum (to avoid internal overflows) and error
bounds.
Algorithm and Implementation: The Generic Case
──────────────────────────────────────────────
Let us define logn = ceil(log2(n')).
The basic idea is to compute a truncated sum in the two's complement
representation, by using a fixed-point accumulator stored in a fixed
memory area.
Two's complement is preferred to the sign + magnitude representation
because the signs of the temporary (and final) results are not known
in advanced, and the computations (additions and subtractions) in
two's complement are more natural in this context. There will be a
conversion to sign + magnitude (representation used by MPFR numbers)
at the end, but this should not take much time compared to the other
calculations.
The precision of the accumulator needs to be a bit larger than the
output precision, denoted sq, for two reasons:
* We need some additional bits on the side of the most significant
part due to the accumulation of n' values, which can make the sum
grow and overflow without enough extra bits. The absolute value of
the sum is less than n' * 2^maxexp, taking up to logn extra bits,
and one needs one more bit to be able to determine the sign due to
two's complement, so that a total of cq = logn + 1 extra bits will
be necessary.
* We need some additional bits on the side of the least significant
part to take into account the accumulation of the truncation errors.
The choice of this number dq of bits is quite arbitrary: the larger
this value is, the longer an iteration will take, but conversely,
the less likely a costly new iteration (due to cancellations and/or
the Table Maker's Dilemma) will be needed. In order to make the
implementation simpler, the precision of the accumulator will be a
multiple of the word size (GMP_NUMB_BITS). Moreover the algorithm
will need at least 4 bits. The final choice should be done after
testing various applications. In the current implementation, we
choose the smallest value ≥ logn + 2 such that the precision of the
accumulator is a multiple of GMP_NUMB_BITS. Since logn ≥ 2, we have
dq ≥ 4 as wanted.
Accumulator: [--------]-----------------------------------]
cq └─ maxexp sq + dq minexp ─┘
Accumulator precision: wq = cq + maxexp − minexp = cq + sq + dq.
This idea is not sufficient because if the Table Maker's Dilemma occurs,
the needed precision for the truncated sum will grow. In particular, one
can easily reach a huge precision with a few small-precision inputs. For
instance, in directed rounding modes, sum(2^E,2^F) with F much smaller
than E. We want to avoid increasing the precision of the accumulator.
This will be done by detecting the Table Maker's Dilemma, and when it
occurs, solving it consists in determining the sign of some error term.
This will be done by computing an approximation to the error term in a
small precision. The algorithm to compute this error term is the same
as the one to compute an approximation to the sum, the main difference
being that we just need a small precision here. In the implementation,
we will use a function sum_raw for both computations.
Algorithm and Implementation: The sum_raw Function
──────────────────────────────────────────────────
In summary:
The sum_raw function will work in a fixed-point accumulator, having a
fixed precision (= size in limbs * GMP_NUMB_BITS). It will accumulate
the bits of the inputs whose exponents are in ⟦minexp,maxexp⟦, where
maxexp − minexp is less that the precision of the accumulator: as said
above, we need some additional bits in order to avoid overflows during
the accumulation. On the entry, the accumulator may already contain a
value from previous computations (it is the caller that clears it if
need be).
During the accumulation, the bits of an input x[i] whose exponents are
less than minexp form the "tail" of this input. When the tail of x[i]
is not empty, its exponent is defined as e[i] = min(E(x[i]),minexp).
The absolute value of each tail is strictly less than 2^e[i]. This will
give an error bound on the computed sum at each iteration.
If the computed result is 0 (meaning full cancellation), set maxexp to
the maximum exponent of the tails, set minexp so that it corresponds to
the least significant bit of the accumulator, and reiterate. Otherwise,
let e and err denote the exponents of the computed result and of the
error bound respectively. While e − err is less than some given bound
denoted prec, shift the accumulator, update maxexp and minexp, and
reiterate. For the caller, this bound must be large enough in order to
reach some wanted accuracy. However it cannot be too large since the
accumulator has a limited precision: we will need to make sure that
if a reiteration is needed, then the cause is a partial cancellation,
so that the shift count is non-zero, otherwise minexp would not change
and one would get an infinite loop. Details and formal definitions are
given later.
Basically, this can be regarded as a sum by block, where each block
has the form ⟦minexp,maxexp⟦.
Notes:
* The reiterations will end when there are no more tails, but in the
code, this is detected only when needed.
* The definition of the tails allows one to skip holes between inputs
in case of full cancellation (e.g. 1 + (−1) + 2^tiny_exponent).
* We choose not to include maxexp in the exponent interval in order
to match the convention chosen to represent floating-point numbers
in MPFR, where the significand is in [1/2,1[, i.e. the exponent of
a floating-point number is the one of the most significant bit + 1.
This also means that the current minexp will be maxexp at the next
iteration, unless there is a hole between the inputs, as explained
above.
Inputs of the sum_raw function (see the code for the types):
* wp: pointer to the accumulator (least significant limb first).
* ws: size of the accumulator (in limbs).
* wq: precision of the accumulator (ws * GMP_NUMB_BITS).
* x: array of the input numbers.
* n: size of this array (number of inputs, regular or not).
* minexp: exponent of the least significant bit of the first block.
* maxexp: exponent of the first block (exponent of its MSB + 1).
* tp: pointer to a temporary area (pre-allocated).
* ts: size of this temporary area.
* logn: ceil(log2(rn)), where rn is the number of regular inputs.
* prec: lower bound for e − err (as described above).
* ep: pointer to mpfr_exp_t (see below), or a null pointer.
* minexpp: pointer to mpfr_exp_t (see below).
* maxexpp: pointer to mpfr_exp_t (see below).
We require as preconditions (explanations are given later):
* prec ≥ 1
* wq ≥ logn + prec + 2
This function returns 0 if the accumulator is 0 (which implies that
the exact sum for this sum_raw invocation is 0), otherwise the number
of cancelled bits (>= 1), defined as the number of identical bits on
the most significant part of the accumulator. In the latter case, it
also returns the following data in variables passed by reference:
* in ep: the exponent e of the computed result (unless ep is NULL);
* in minexpp: the last value of minexp;
* in maxexpp: the new value of maxexp (for the next iteration after
the first invocation of sum_raw in the main code).
Notes:
* All the bits of the accumulator are meaningful; minexp is also
the exponent of the least significant bit of the accumulator.
* The temporary area must be large enough to hold a shifted input
block, and the value of ts is used only when the full assertions
are checked (i.e. with the --enable-assert configure option), to
check that a buffer overflow doesn't occur.
* Contrary to the returned value of minexp (the value in the last
iteration), the returned value of maxexp is the one for the next
iteration (= maxexp2 of the last iteration, see below).
Details about the sum_raw implementation are given below. Here is some
notation:
E(v): the exponent of a MPFR number v.
P(v): the precision of a MPFR number v.
Q(v) = E(v) − P(v): the exponent of the ulp of a MPFR number v.
A maxexp2 variable will contain the maximum exponent of the tails.
Thus it is initialized to the minimum value of the exponent type:
MPFR_EXP_MIN; this choice means that at the end of the loop below,
maxexp2 will be MPFR_EXP_MIN iff there are no more tails (this case
implies that the truncated sum is exact). If a new iteration is needed,
then maxexp2 will be the maxexp of this new iteration.
Then one has a loop over the inputs x[i]. Each iteration consists in
the following steps:
1. If the input is not regular, skip it (go to the next iteration).
Note: if there are many non-regular inputs, it may be more efficient
to have an array pointing to the regular inputs only, but such a
case is assumed to be rare, and the number on iterations is also
limited by the relatively small number of regular inputs.
2. If E(x[i]) ≤ minexp, then no bits of x[i] need to be considered here.
We set maxexp2 ← max(maxexp2,E(x[i])), then go to the next iteration.
3. Now, we have: E(x[i]) > minexp. If the tail of x[i] is not empty,
i.e. if Q(x[i]) < minexp, then we set maxexp2 ← minexp.
4. We prepare the input for the accumulation. First, this means that if
its significand is not aligned with the accumulator, then we need to
align it by shifting a part of the significand (containing bits that
will be accumulated at this iteration), storing the result to the
temporary area tp. In both cases, we compute:
* the address and size of the destination, corresponding either to
the full accumulator or to some most significant part (which can
occur if Q(x[i]) > minexp);
* the address and size of the block (of the input or its shifted
version) to accumulate to the destination;
* the number of the most significant bits of the block that need
to be masked. Masking may be needed if the input was not shifted
because in such as case, this is regarded as a read-only area.
5. If x[i] is positive: an addition is done with mpn_add_n; if the most
significant limb needs to be masked, it isn't taken into account in
the addition, but the masked part is just added to the carry; carry
propagation is done with mpn_add_1 if the size of the destination is
larger than the size of the block.
Note: There may be still be a carry out, but it is just ignored. This
occurs when a negative value in the accumulator becomes non-negative,
and this fact is part of the usual two's complement arithmetic.
If x[i] is negative, we do similar computations by using mpn_sub_n
for the subtraction and mpn_sub_1 to propagate borrows.
Note: The steps 2, 3, and 4 above are currently done by distinguishing
two cases:
A. Q(x[i]) < minexp, covering cases like:
[-+- accumulator ---]
[---│----- x[i] ------│--]
│ [----- x[i] --│--]
│ │[----- x[i] -----]
│ │ [----- x[i] -----]
maxexp minexp
B. Q(x[i]) ≥ minexp, covering cases like:
[-+- accumulator ---]
[- x[i] -] │ │
[---│-- x[i] ------] │
[------│-- x[i] ---------]
│ [- x[i] -] │
maxexp minexp
It might be possible to merge these cases in a future version of the
code.
After the loop over the inputs, we need to see whether the accuracy
of the truncated sum is sufficient. We first determine the number of
cancelled bits, defined as the number of consecutive identical bits
starting with the most significant one in the accumulator. At the
same time, we can determine whether the truncated sum is 0 (all the
bits are identical and their value is 0). If it is 0, we have two
cases: if maxexp2 is equal to MPFR_EXP_MIN (meaning no more tails),
then we return 0, otherwise we reiterate with maxexp = maxexp2 and
minexp = cq + maxexp − wq (at the beginning of sum_raw).
We can now assume that the truncated sum is not 0.
Let us note that our computation of the number of cancelled bits
(variable cancel) was limited to the accumulator representation,
while from a mathematical point of view, the binary expansion is
unlimited and the bits of exponent less than minexp are regarded
as 0's. So, we need to check that the value of cancel matches this
mathematical point of view:
* If the cancelled bits are 0's, then, since the truncated sum is
not 0, there must be at least a bit 1 in the accumulator.
* If the cancelled bits are 1's, then this sequence entirely fits
in the accumulator, as the first non-represented bit is a 0.
The analysis below maps the truncated sum to the destination without
considering rounding yet.
Let us denote:
* e = maxexp + cq − cancel = minexp + wq − cancel,
* err = maxexp2 + logn.
Then e is the exponent of the least significant cancelled bit, thus the
absolute value of the truncated sum is in [2^(e−1),2^e] (binade closed
on both ends due to two's complement). And since there are at most logn
regular inputs and the absolute value of each tail is < 2^maxexp2, the
absolute value of the error is strictly less than 2^err. If maxexp2 is
equal to MPFR_EXP_MIN (meaning no more tails), then the error is 0.
Note: The C implementation requires prec ≥ 0 due to the use of unsigned
integer arithmetic in the ≥ test between e − err and prec in order to
avoid a potential integer overflow. This is not an issue as to be able
to get the sign of the result, one needs prec ≥ 1. Actually, for this
reason, the C code uses prec ≥ 1 as a precondition.
If e − err ≥ prec, then the sum_raw function returns as described above.
Otherwise, due to cancellation, we need to reiterate after shifting the
value of the accumulator to the left and updating minexp and maxexp. Let
shiftq denote the shift count, which must satisfy: 0 < shiftq < cancel.
The left inequality must be strict to ensure termination, and the right
inequality ensures that the value of the accumulator will not change
with the updated minexp: shiftq is subtracted from minexp at the same
time. The reiteration is done with maxexp = maxexp2, as said above.
Example. If there is a additional iteration with maxexp2 = minexp − 4
and a shift of 26 bits (due to cancellation), here is the accumulator
before and after the shift:
Before shift: [--------]-----------------------------------]
└─ maxexp minexp ─┘
<------- 26 zeros ------->
After shift: [-------------------00000000000000000000000000
This iteration: minexp ─┘ ├─ maxexp2 │
Next iteration: └─ maxexp minexp ─┘
We now need to determine the value of shiftq. We prefer it to be as
large as possible: this is some form of normalization. Moreover, it
must satisfy the above double inequality and be such that:
(1) the new value of minexp is smaller than the new value of maxexp,
i.e. minexp − shiftq < maxexp2, i.e. shiftq > minexp − maxexp2;
(2) overflows will still be impossible in the new iteration.
Note that since maxexp2 ≤ minexp, (1) will imply shiftq > 0. And (2)
is an extension of shiftq < cancel. Thus the double inequality above
is a weak form of what is actually required.
Since we prefer shiftq to be maximum, we focus on (2) first. The
absolute value of the accumulator at the end of the next iteration
will be strictly bounded by 2^e + 2^err ≤ 2^(e + 1 + max(0, err − e)).
This means that if we do not shift the value in the accumulator, then
at the end of the next iteration, the accumulator will contain at least
cancel − 1 − max(0, err − e) identical bits on its most significant
part. Only the last of these bits is needed (which gives the sign) and
the other ones are redundant. Therefore, in order to satisfy (2), we
can choose:
shiftq = cancel − 2 − max(0, err − e).
Now, let us prove that for this value, (1) is satisfied.
* If err − e ≥ 0, then:
shiftq = cancel − 2 − (err − e)
= cancel − 2 − (maxexp2 + logn) + (minexp + wq − cancel)
= minexp − maxexp2 + (wq − logn − 2) > minexp − maxexp2
since wq ≥ logn + prec + 2 ≥ logn + 3 on input.
* If err − e < 0, then this is the case where the error can be
potentially small: to be able to prove the inequality, we need
to use the fact that the stop condition was not satisfied, i.e.
e − err < prec. Thus
(minexp + wq − cancel) − (maxexp2 + logn) < prec
and
shiftq − (minexp − maxexp2) = cancel − 2 − (minexp − maxexp2)
> wq − logn − prec − 2 ≥ 0
since wq ≥ logn + prec + 2 on input.
QED.
Note: It is expected that in general, the cancellation is not very
large, so that the new additions in the new iteration will occur
only in a small part of the accumulator, except in case of long
carry propagation (see below).
Old discussion about the choice of the shift count (not much useful
any longer, but it can help to understand, with its examples):
Concerning the choice of the shift count shiftq, letting only one
identical bit in the most significant part may not be sufficient;
for instance, if
a. maxexp2 = minexp;
b. the accumulator contains:
0000000011111111111111111111111111110101
This iteration: minexp ─┘
c. the accumulator is shifted to:
0111111111111111111111111111101010000000
Next iteration: └─ maxexp
d. there are at least 12 numbers to add at the next iteration;
then one could end up with something like:
1000000000000000000000000000000000010001
└─ maxexp
i.e. an overflow in two's complement. But leaving at least
2 + max(0, err - e) itentical bits in the most significant part,
such as
0011111111111111111111111111110101000000
└─ maxexp
is sufficient. The second term of the max covers cases like:
┌─ err = maxexp2 + logn
0000000000000000000000000000000000000111
e ─┘
(2^err being a bound on the error term, thus a bound on the value
that will be added to the accumulator at the next iteration), for
which the accumulator can be shifted to:
┌─ err
0000000111000000000000000000000000000000
e ─┘
without triggering an overflow at the next iteration, but
┌─ err
0000001110000000000000000000000000000000
e ─┘
is incorrect as a 1 could appear in the MSB (making the accumulator
value negative) just with additions of positive numbers.
Said otherwise, leaving at least i identical bits allows one to
represent numbers in [-2^(e+i-1),2^(e+i-1)[. The current sum is in
[-2^e,2^e[, and taking into account the error terms, one wants to
be able to represent arbitrary numbers in ]-2^e-2^err,2^e+2^err[.
So, i must be chosen in such a way that 2^e + 2^err ≤ 2^(e+i-1),
i.e. 2^0 + 2^(err-e) ≤ 2^(i-1). The smallest power of two larger
than or equal to 2^0 + 2^(err-e) has exponent 1 + max(0, err - e).
Hence the choice i ≥ 2 + max(0, err - e).
Algorithm and Implementation: Back to the Generic Case
──────────────────────────────────────────────────────
Let us recall the form of the accumulator for the summation itself,
i.e. the first sum_raw invocation:
Accumulator: [--------]-----------------------------------]
cq └─ maxexp sq + dq minexp ─┘
where cq = logn + 1, sq is the target precision, and dq ≥ logn + 2.
Memory is allocated both for the accumulator and for the temporary area
needed by sum_raw. For performance reasons, the allocation is done in
the stack if the size is small enough (see MPFR_TMP_LIMBS_ALLOC macro).
No other memory allocation will be needed (except for auto variables).
The accumulator is zeroed and the sum_raw is invoked to compute an
accurate approximation of the sum. Among its parameters, maxexp was
computed during the preliminary steps, minexp = maxexp − (wq − cq),
and prec = sq + 3, which satisfies the wq ≥ logn + prec + 2 sum_raw
precondition:
wq = cq + sq + dq ≥ logn + 1 + sq + 4 = logn + prec + 2.
If the sum_raw function returns 0, then the exact sum is 0, so that we
just set the MPFR number sum to 0 with the correct sign according to the
IEEE 754 rules (positive, except for MPFR_RNDD, where it is negative),
and return with ternary value = 0.
Now, the accumulator contains the significand of a good approximation
to the non-zero exact sum. The corresponding exponent is e and the sign
is determined from one of the cancelled bits. The exponent of the ulp
for the target precision is u = e − sq. The absolute value of the error
is strictly less than ulp(computed value) × 2^(−3) = 2^(u−3).
Here is a representation of the accumulator and the cancelled bits,
with the two cases depending on the sign of the truncated sum, where
the x's correspond to the sq−1 represented bits following the initial
value bit (1 if positive sum, 0 if negative), r is the rounding bit,
and the bits f are the following bits:
][------------------- accumulator -------------------]
][---- cancel ----]----------------------------------]
]0000000000000000001xxxxxxxxxxxxxxxxxxxxxrffffffffffff0000...
]1111111111111111110xxxxxxxxxxxxxxxxxxxxxrffffffffffff0000...
└─ e u ─┘ minexp ─┘
Note that the iterations in sum_raw could have stopped even in case
of important cancellation: it suffices that the error term be small
enough (err ≤ e − prec), i.e. maxexp2 be much smaller than minexp.
In such a case, the bit r and some of the least significant bits x
may fall outside of the accumulator, in which case they are regarded
as 0's (still due to truncation). In the following, we will make sure
that we do not try to read non-represented bits.
When maxexp (value returned by sum_raw) is different from MPFR_EXP_MIN,
i.e. when some bits of the inputs still not have been considered, we
will need to determine whether the TMD occurs. Let d = u − err, which
is ≥ 3, and can be very large if maxexp is very small; however d is
representable in a mpfr_exp_t since u ≤ emax, err ≥ emin, and this
type can hold the sum of two valid exponents. The TMD occurs when the
sum is close enough to a breakpoint, which is either a machine number
(i.e. a number whose significand fits on sq bits) or a midpoint between
two consecutive machine numbers, depending on the rounding mode:
┌───────────────┬────────────────┐
│ Rounding mode │ Breakpoint │
├───────────────┼────────────────┤
│ to nearest │ midpoint │
│ to nearest │ machine number │
│ directed │ machine number │
└───────────────┴────────────────┘
(in the second case, the correctly rounded sum can be determined, but
not the ternary value, and this is why the TMD occurs). More precisely,
the TMD occurs when:
* directed rounding modes: the d bits following the ulp bit are
identical;
* round-to-nearest mode: the d−1 bits following the rounding bit are
identical.
Several things need to be considered, in arbitrary order:
* the copy to the destination;
* a shift (for the alignment), if the shift count is non-zero
(this is the most probable case);
* a negation/complement if the value is negative (cancelled bits = 1),
since the significand of MPFR numbers uses the conventional
sign + absolute value representation;
* rounding (the TMD needs to be resolved first if it occurs).
Some of these operations will be merged, i.e. done at the same time,
for performance reasons, and this depends on the operations provided
by the mpn layer of GMP. Ideally, all these operations should be
merged together, but this is not possible with GMP.
For practical reasons, the shift should be done before the rounding,
so that all the bits are represented for the rounding. The copy itself
should be done together with the shift or the negation, because this
is where most of the limbs are changed in general. We chose to do it
with the shift as it is assumed that the proportion of non-zero shift
counts is higher than the proportion of negations.
Moreover, when the value is negative, the difference between negation
and complement is similar to the difference between rounding directions
(these operations are identical on the real numbers, i.e. in infinite
precision), so that negation/complement and rounding can naturally be
merged, as detailed later.
We start by doing the first two operations at the same time: the bits
of exponents in ⟦max(u,minexp),e⟦ are copied with a possible shift to
the most significant part of the destination, and the least significant
limbs (if any) are zeroed.
Note: By definition of e, the most significant bit that is copied is
the opposite of the value of the cancelled bits. A variable pos will
contain its value, i.e. pos = 1 for a positive number, pos = 0 for a
negative number.
The values of 3 variables are also determined:
* inex: 0 if the final sum is known to be exact, otherwise 1.
* rbit: the rounding bit (0 or 1) of the truncated sum, corrected
in some halfway cases (see below).
* tmd: three possible values:
0: the TMD does not occur.
1: the TMD occurs on a machine number.
2: the TMD occurs on a midpoint.
All this is done by considering two cases:
* u > minexp. One just has a shift (mpn_lshift) or a copy (MPN_COPY).
There are no least significant limbs to zero. In this case, the
rounding bit, which is represented, is read. Then there are two
subcases:
* maxexp = MPFR_EXP_MIN. The sum in the accumulator is exact.
Thus inex will be the value of rounding bit || sticky bit.
In round to nearest, rbit = 1 will mean that the value is
to be rounded toward +inf; therefore we need to detect and
handle halfway cases now (information on the trailing bits
will not be kept). More precisely we need to determine the
sticky bit if
* rbit = 0, or
* rbit = 1 and rnd = MPFR_RNDN and the least significant bit
of the truncated sq-bit significand is 0. In this case, if
the sticky bit is 0, this halfway value will have to be
rounded toward −infinity; so, rbit is changed to 0.
And tmd is set to 0 because one can round correctly, knowing
the exact sum.
* maxexp ≠ MPFR_EXP_MIN. We do not know whether the final sum is
exact, so that we set inex to 1. We also determine the value of
tmd as briefly described above (the code is quite complex since
we need to take into account the fact that not all the bits are
represented).
* u ≤ minexp. One just has a shift (bit shifted in are 0's, which is
what we want) or a copy, and if there are remaining low significant
limbs in the destination, they are zeroed with MPN_ZERO. We set
rbit to 0 because the rounding bit is not represented, thus its
value is 0. If maxexp = MPFR_EXP_MIN, then both inex and tmd are
set to 0; otherwise they are set to 1 (the bits following the ulp
bit are not represented, thus are all 0's, implying that the TMD
occurs on a machine number).
Now we seek to determine how the value will be rounded, more precisely,
what correction will be done to the significand that have been copied
just above. We currently have a significand, a trailing term t in the
accumulator such that 0 ≤ t < 1 ulp, and an error on the trailing term
bounded by some value t' in absolute value, so that the error e on the
significand satisfies −t' ≤ e < 1 ulp + t'. Thus one has 4 correction
cases, denoted by a corr integer value between −1 and 2, which depends
on e, the sign of the significand, rbit, and the rounding mode:
−1: equivalent to nextbelow;
0: no correction;
+1: equivalent to nextabove.
+2: equivalent to 2 consecutive nextabove.
We also determine the ternary value inex that will be provided as the
second argument of mpfr_check_range.
We distinguish two cases:
* tmd = 0. The TMD does not occur, so that t' has no influence on
the rounding and the ternary value (one can assume t' = 0). In
the directed rounding modes, inex = 0 iff t = 0. Therefore:
* for MPFR_RNDD, corr ← 0;
* for MPFR_RNDU, corr ← inex;
* for MPFR_RNDZ, corr ← inex && !pos;
* for MPFR_RNDA, corr ← inex && pos;
* for MPFR_RNDN, corr ← rbit.
If inex ≠ 0 and corr = 0, we set inex to −1. The variable inex now
contains the ternary value.
* tmd ≠ 0. The TMD occurs and will be solved by determining the sign
(−1, 0 or +1) of a secondary term with a second sum_raw invocation
with a small-precision accumulator. Since the truncated sum in the
accumulator was already copied to the destination, we can reuse the
memory of the accumulator, but its precision is now set to cq + dq
rounded up to the next multiple of the word size (GMP_NUMB_BITS).
There may remain some bits following the truncated sum, but they
will be taken into account as described below.
Let us recall that the d−1 bits from exponent u−2 to u−d (= err)
are identical. We distinguish two subcases:
* err ≥ minexp. We keep the last 2 over the d−1 identical bits and
the following bits, i.e. the bits from err+1 to minexp.
* err < minexp. Here at least one of the identical bits is not
represented, meaning that it is 0 and all these bits are 0's.
Thus the accumulator set to 0. The new minexp is determined
from maxexp, with cq bits reserved to avoid an overflow.
Once the accumulator is set up, sum_raw is called with prec = 1,
satisfying the first sum_raw precondition. And we have:
wq ≥ cq + dq ≥ logn + 3 = logn + prec + 2,
corresponding to the second sum_raw precondition.
The sign of the secondary term (−1, 0, or +1), corrected for the
halfway cases, is stored in a variable sst. In details: If the
value returned by sum_raw (the number of cancelled bits) is not 0,
then the secondary term is not 0, and its sign is obtained from the
most significant bit (MSB) of the accumulator: positive if MSB = 0,
negative if MSB = 1. Otherwise the secondary term is 0, and so is
its sign; however, for the halfway cases (tmd = 2), we want to
eliminate the ambiguity of their rounding due to the even-rounding
rule by choosing a non-zero value for the sign: −1 if the truncated
significand is even, +1 if it is odd.
Then, from the values of the variables rnd (rounding mode), tmd,
rbit (rounding bit), sst (sign of the secondary term, corrected
for the halfway cases), and pos (sign of the sum), we determine:
* the correction case (between −1 and +2);
* the ternary value (negative, zero, or positive).
The different cases are summarized in the table below. The two lines
corresponding to halfway cases are not possible since sst has been
changed to an equivalent non-zero value as said above. The rounding
modes MPFR_RNDZ and MPFR_RNDA are not in this table since they are
handled like MPFR_RNDD and MPFR_RNDU depending on the value of pos
(see the MPFR_IS_LIKE_RNDD and MPFR_IS_LIKE_RNDU macros).
rnd tmd rbit sst correction ternary
───────────────────────────────────────────────
N 1 0 − 0 +
N 1 0 0 0 0
N 1 0 + 0 −
N 1 1 − +1 +
N 1 1 0 +1 0
N 1 1 + +1 −
───────────────────────────────────────────────
N 2 0 − 0 −
N 2 0 0 n/a n/a (halfway case)
N 2 0 + +1 +
N 2 1 − 0 −
N 2 1 0 n/a n/a (halfway case)
N 2 1 + +1 +
───────────────────────────────────────────────
D 1 0 − −1 −
D 1 0 0 0 0
D 1 0 + 0 −
D 1 1 − 0 −
D 1 1 0 +1 0
D 1 1 + +1 −
───────────────────────────────────────────────
U 1 0 − 0 +
U 1 0 0 0 0
U 1 0 + +1 +
U 1 1 − +1 +
U 1 1 0 +1 0
U 1 1 + +2 +
───────────────────────────────────────────────
At this point, the variable inex contains the correct ternary value
(before the overflow/underflow detection) and we know the correction
that needs to be applied to the significand. So, the next step is to
change the representation of the significand, i.e. do a negation or
complement if the sum is negative, and apply the correction at the
same time. We distinguish two cases:
* pos = 1 (positive sum). We first set the sign of the MPFR number
to positive. Since the significand can contain garbage in the
trailing bits (present when the precision is not a multiple of
GMP_NUMB_BITS), we set these trailing bits to 0 as required by
the format of MPFR numbers. If corr > 0, we need to add corr to
the significand (we can see that this works even if corr = 2 and
the significand contains all 1's, which was not obvious). This is
done with mpn_add_1, but corr must be shifted by sd bits to the
left, where sd is the number of trailing bits. If corr = 2 and
sd = GMP_NUMB_BITS − 1, the mathematical result of the shift does
not hold in the variable; in this case, the value 1 is added with
mpn_add_1 starting at the second limb, which necessarily exists,
otherwise this would mean that the precision of the MPFR number
is 1, and this is not possible (MPFR_PREC_MIN = 2). In case of
carry out, meaning a change of binade, the most significant bit
of the significand is set to 1 without touching the other bits
(this is important because if corr = 2 and the significand has
only one limb, the least significant non-trailing bit may be 1),
and e is incremented. If corr < 0, then it is −1, so that we
subtract 1 from the significand with mpn_sub_1. In case of
borrow out, meaning a change of binade, the most significant bit
of the significand is set to 1 so that all the (non-trailing)
bits of the significand are 1's, and e is decremented.
* pos = 0 (negative sum). We first set the sign of the MPFR number
to negative. Contrary to the positive case, where we could do
MPN number + limb (with mpn_add_1) or MPN number − limb (with
mpn_sub_1), GMP does not provide a function to do an operation
like: limb − MPN number. However, we will show that this can be
emulated with mpn_neg, which does a negation, and mpn_com, which
does a complement. This allows us to avoid separate mpn_com (or
mpn_neg) and mpn_add_1 (or mpn_sub_1) operations, as they could
yield two loops in some particular cases involving a long sequence
of 0's in the low significant bits.
Let us focus on the negation and complement operations and what
happens at the bit level. For the complement operation, all the
bits are inverted and there is no dependency between them. The
negation of an integer is equivalent to its complement + 1:
neg(x) = com(x) + 1. Said otherwise, after an initial carry
propagation on the least significant sequence of 1's in com(x),
the bits are just inverted, i.e. one has a complement operation
on the remaining bits. This is why we will regard complement as
the core operation in the following.
Now, we want to compute:
abs(x + corr) = neg(x + corr)
= neg(x) − corr
= com(x) + (1 − corr)
where −1 ≤ 1 − corr ≤ 2. We consider two subcases, leading to a
non-negative case for the correction, and a negative case:
* corr ≤ 1, i.e. 1 − corr ≥ 0.
We first compute the least significant limb by setting the
trailing bits to 1, complementing the limb, and adding the
correction term 1 − corr properly shifted. This can generate
a carry. In the case where corr = −1 (so that 1 − corr = 2)
and the shift count is GMP_NUMB_BITS − 1, the shift of the
correction term overflows, but this is equivalent to have a
correction term equal to 0 and a carry.
* If there is a carry, we apply mpn_neg on the next limbs
(if the significand has more than one limb). If there is
still a carry, i.e. if the significand has exactly one
limb or if there is no borrow out of the mpn_neg, then
we handle the change of binade just like in the pos = 1
case for corr > 0.
* If there is no carry, we apply mpn_com on the next limbs
(if the significand has more than one limb). There cannot
be a change of binade in this case since a complement
cannot have a carry out.
* corr = 2, i.e. 1 − corr = −1.
Here we want to compute com(x) − 1, but GMP does not provide
an operation for that. The fact is that a sequence of low
significant bits 1 is invariant, and we need to do the loop
ourselves in C instead of uses an optimized assembly version
from GMP. However this may not be a problem in practice, as
the difference is probably not noticeable. When a limb with
a zero is reached (there is at least one since the most
significant bit of the significand is a 0), we compute its
complement − 1 (the "− 1" corresponds to a borrow in). If
there are remaining limbs, we complement them and a change
of binade is not possible. Otherwise the complement − 1 on
the most significant limb can lead to a change of binade;
more precisely, this happens on the significand 01111…111,
whose complement is 10000…000 and com(x) − 1 is 01111…111.
The change of binade is handled like in the pos = 1 case for
corr < 0.
Finally, we set the exponent of the MPFR number to e (not with
MPFR_SET_EXP because this exponent may not be in the current exponent
range) and call mpfr_check_range as usual to check whether e is in
the current exponent range and handle a possible overflow/underflow.
────────────────────────────────────────────────────────────────────────
An example:
[1] [2] A [3]
u0 = ***** | | . |
u1 = ******|** | . |
u2 = ********|***** | . |
u3 = | *****| . |
u4 = | | ****|**
u5 = | | ***|*
At iteration 1, minexp is determined to be [1]; thus u0, a part of u1,
and a part of u2 are taken into account for the truncated sum. Then it
appears that an important cancellation occurred, and another iteration
is needed. Since u1 was truncated, the new maxexp will be minexp, i.e.
[1]. At iteration 2, minexp is determined to be [2]; thus a part of u1,
a part of u2, and u3 are taken into account for the truncated sum. Now,
assume that on this example, the error is small enough, but its sign is
unknown: the TMD occurs. Since no numbers were truncated at the previous
iteration, maxexp is the maximum exponent of the remaining numbers, here
the one of u4, and minexp is determined to be [3]. Assume that the sign
of the error can be determined now, so that we can return the rounded
result with the ternary value.
Note 1: Data locality can be improved after the first iteration if the
shifted values are stored at the end of the temporary area instead of
the beginning. The reason is that only the least significant part of
the accumulator will be used once a good approximate value of the sum
is known, and the accumulator lies just after the temporary area. But
the gain would probably not be noticeable in practice.
Note 2: It was considered to determine the number of cancelled limbs
instead of the number of cancelled bits in order to avoid a non-trivial
shift between iterations. Moreover, on platforms with a MMU, thus most
platforms where GNU MPFR would be used, we could arrange to reserve a
specific address space for the accumulator and choose the shift count
to be a multiple of a memory block size (or actually not to do any
shift/copy at all, just free the most significant part and allocate
a new part when needed), so that the shift (copy) of the accumulator
contents could just be MMU related changes, without any physical memory
copy. However, in practice, only large precisions with specific inputs
would benefit from such a choice, which would require non portable code,
and the relative speedup would remain small since the sum itself is
expected to take most of the time.
*** To be considered in future versions ***
It seems that carry propagation (mpn_add_1 & mpn_sub_1 in the code) is
most often limited. But consider the following cases, where all inputs
have the minimal precision 2, and the output precision is p:
u0 = 1
u_i = (−1)^i × 2^(−p) for i > 0
Here long carry propagation will occur for each addition of the initial
iteration, so that the complexity will be O(n·p) instead of O(n+p) if we
choose to delay carry propagation (however such a choice may slower the
average case and take more memory, such as around 3p instead of 2p).
When a new iteration is needed due to cancellation, a second accumulator
was considered in some early version of the algorithm: the temporary
results of the computations during the new iteration would be stored in
this second accumulator, which would generally be small, thus limiting
carry propagation; this method is actually equivalent to delaying carry
propagation. It could help in some cases, such as:
u0 = 2^q with some q > 0
u1 = 1
u2 = −2^q
u_i = (−1)^i × 2^(−p) for i > 2
but such examples are very specific cases, and as seen with the first
example, a better way must be chosen if avoiding long carry propagation
is regarded as important (in all cases). Moreover, while the use of two
accumulators does not take much more memory (since both accumulators can
occupy the same area, with a flexible limit between them), it probably
makes the code more complex, and noticeably slower if n is small.
|