projects/tasks.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
  <title>
    GMP Itemized Development Tasks
  </title>
</head>
<body bgcolor=lightgreen>

<center>
  <h1>
    GMP Itemized Development Tasks
  </h1>
</center>

<comment>
  An up-to-date html version of this file is available at
  <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
</comment>

<p> This file lists itemized GMP development tasks.  Not all the tasks listed
    here are suitable for volunteers, but many of them are.  Please see the
    file projects, or <a href="projects.html">
    http://www.swox.com/gmp/projects.html</a> for more sizeable projects.

<h4>Correctness and Completeness</h4>
<ul>
<li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the
     new instructions.  Unfortunately, HPUX 9 and earlier assemblers rejects
     that directive.  How very clever of HP.  We will have to pass assembler
     options, and make sure it works with new and old systems and GNU
     assembler.
<li> The various reuse.c tests need to force reallocation by calling
     <code>_mpz_realloc</code> with a small (1 limb) size.
<li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.
<li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit
     <code>mp_exp_t</code>, the precision of
     <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
     <code>mpf_get_str</code> aborts.  Detect and compensate.
<li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other
     machines that use <code>long long</code> for storing limbs.)
<li> Make the string reading functions allow the `0x' prefix when the base is
     explicitly 16.  They currently only allow that prefix when the base is
     unspecified.
<li> The probabilistic prime function(s) should never return a different result
     just because of the state of an independent random generator.  Therefore,
     make it use the new random number functions (see accompanying projects file.)
<li> In the development sources, we return abs(a%b) in the
     <code>mpz_*_ui</code> division routines.  Perhaps make them return the
     real remainder instead?  Changes return type to <code>signed long int</code>.
<li> libmp/mout calls strlen with incompatible type.
<li> libmp/mtox returns from incompatible type.
<li> <code>mpf_eq</code> is not always correct, when one operand is
     1000000000... and the other operand is 0111111111..., i.e., extremely
     close.  There is a special case in <code>mpf_sub</code> for this
     situation; put similar code in <code>mpf_eq</code>.
<li> <code>mpn_divrem</code> cannot throw away the quotient when <code>XSIZE !=
     0</code>.  Document or fix.
<li> Install Alpha assembly changes (prec/gmp-alpha-patches).
<li> NeXT has problems with newlines in asm strings in longlong.h.  Also,
     <code>__builtin_constant_p</code> is unavailable?  Same problem with MacOS
     X.
<li> Shut up SGI's compiler by declaring <code>dump_abort</code> in
     mp?/tests/*.c.
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.
</ul>

<h4>Machine Independent Optimization</h4>
<ul>
<li> In hundreds of places in the code, we invoke count_leading_zeros and then
     check if the returned count is zero.  Instead check the most significant
     bit of the operand, and avoid invoking <code>count_leading_zeros</code> if
     the bit is set.  This is an optimization on all machines, and significant
     on machines with slow <code>count_leading_zeros</code>.
<li> In a couple of places <code>count_trailing_zeros</code> is used
     on more or less uniformly distributed numbers.  For some CPUs
     <code>count_trailing_zeros</code> is slow and it's probably worth
     handling the frequently occurring 0 to 2 trailing zeros cases specially.
<li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to
     instead use <code>mpn_invert_limb</code>.
<li> Reorganize longlong.h so that we can inline the operations even for the
     system compiler.  When there is no such compiler feature, make calls to
     stub functions.  Write such stub functions for as many machines as
     possible.
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
     most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
     (Peter Montgomery has some ideas on this subject.)
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
     products with fewer operations.
<li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in
     the sub O(n^2) range, using some divide-and-conquer approach, preferably
     without using division.
<li> Copy tricky code for converting a limb from development version of
     <code>mpn_get_str</code> to mpf/get_str.  (Talk to Torbj�rn about this.)
<li> Consider inlining these functions: <code>mpz_size</code>,
     <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,
     <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,
     <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,
     <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,
     <code>mpz_popcount</code>, <code>mpf_size</code>,
     <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,
     <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,
     <code>mpf_clear</code>, <code>mpf_set_si</code>.
</ul>

<h4>Machine Dependent Optimization</h4>
<ul>
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code> for the 21264.  On 21264, they should run at 4, 3,
     and 3 cycles/limb respectively, if the code is unrolled properly.  (Ask
     Torbj�rn for his xm.s and xam.s skeleton files.)
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code> for the 21164.  This should use both integer
     multiplies and floating-point multiplies.  For the floating-point
     operations, the single-limb multiplier should be split into three 21-bit
     chunks.
<li> UltraSPARC: Rewrite 64-bit <code>mpn_add_n</code> and
     <code>mpn_sub_n</code>.  The current sparc64 code uses <code>MOVcc</code>
     instructions, which take about 6 cycles on UltraSPARC.  The correct
     approach is probably to use conditional branching.  That should lead to
     loops that run at 4 cycles/limb.  (Torbj�rn has code that just needs to be
     finished.)
<li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,
     <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>.  Should use
     floating-point operations, and split the invariant single-limb multiplier
     into 21-bit chunks.  Should give about 18 cycles/limb, but the pipeline
     will become very deep.  (Torbj�rn has C code that is useful as a starting
     point.)
<li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.
     Should give 2 cycles/limb.  (Torbj�rn has code that just needs to be
     finished.)
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
     <code>mpn_mul_1</code>.  The current development code runs at 11
     cycles/limb, which is already very good.  But it should be possible to
     saturate the cache, which will happen at 7.5 cycles/limb.
<li> UltraSPARC: Write <code>umul_ppmm</code>.  Important in particular for
     <code>mpn_sqr_basecase</code>.
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
     for important machines.
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
     Will bring time from 1.75 to 1.25 cycles/limb.
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See Pentium code.)
<li> Alpha: Optimize <code>count_leading_zeros</code>.
<li> Alpha: Optimize <code>udiv_qrnnd</code>.  (Ask Torbj�rn for the file
     test-udiv-preinv.c as a starting point.)
<li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.
     It should just require 3 cycles/limb, but the current code propagates
     carry poorly.  The trick is to add carry-in later than we do now,
     decreasing the number of operations used to generate carry-out from 4 to
     to 3.
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
     The pipeline is now extremely deep, perhaps unnecessarily deep.  Also, r5
     is unused.  (Ask Torbj�rn for a copy of the current code.)
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
     run at just 3.25 cycles/limb.  (Ask for xxx-add_n.s as a starting point.)
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
     <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
     method", to avoid carry propagation.  splitting one of the operands in
     11-bit chunks.
<li> Cray: Vectorize main functions, perhaps in assembly language.
<li> Cray: Write <code>mpn_mul_basecase</code> and
     <code>mpn_sqr_basecase</code>.  Same comment applies to this as to the
     same functions for Fujitsu VPP.
<li> Integrate Paul Zimmermann's FFT multiply code.  (If you want to work on
     this, ask Torbj�rn for the code.)
<li> Improve <code>count_leading_zeros</code> for 64-bit machines:

  <pre>
  if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x &lt&lt= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
  if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x &lt&lt= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
  ... </pre>

</ul>

<h4>New Functionality</h4>
<ul>
<li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the nth limb).
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
<li> Let `0b' and `0B' mean binary input everywhere.
<li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>
     (cf. <code>mpq_set_d</code>).
<li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy
     allocation.  Set <code>ALLOC(var)</code> to 0, and have
     <code>mpz_realloc</code> special-handle that case.  Update functions that
     rely on a single limb (like <code>mpz_set_ui</code>,
     <code>mpz_[tfc]div_r_ui</code>, and others).
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
     format is portable between 32-bit and 64-bit machines, and between
     little-endian and big-endian machines.
<li> Handle numeric exceptions: Call an error handler, and/or set
     <code>gmp_errno</code>.
<li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and
     <code>gmp_snprintf</code>.
<li> Implement some <code>mpq</code> input and output functions.
<li> Implement mixed precision Kronecker/Jacobi symbol, extend the
     normal <code>mpz_jacobi</code> to a full range of inputs.
     Kevin Ryde is working on this.
</ul>

<h4>Miscellaneous</h4>
<ul>

<li> Work on the way we build the library.  We now do it building
     convenience libraries but then listing all the object files a
     second time in the top level Makefile.am.
<li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to
     mp[zq]/add.c to decide whether it will add or subtract.  Will decrease
     redundancy.  Similarly in many places.
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
     analogous to <code>mpz_mod</code>.  Document, and list as an
     incompatibility.
<li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new
     mpn/generic/pow_ui.
<li> Make mpz_invert call mpn_gcdext directly.
<li> Make an option for stack-alloc.c to call <code>malloc</code>
     separately for each <code>TMP_ALLOC</code> block, so a redzoning
     malloc debugger could be used during development.
</ul>

<h4>Documentation</h4>
<ul>
<li> Document conventions, like that <code> unsigned long int</code> is used for
     bit counts/ranges, and that <code>mp_size_t</code> is used for limb counts.
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
</ul>

<hr>

<table width="100%">
  <tr>
    <td>
      <font size=2>
      Please send comments about this page to
      <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>
      Copyright (C) 1999 Torbj�rn Granlund.
      </font>
    </td>
    <td align=right>
    </td>
  </tr>
</table>

</body>
</html>