GMP Itemized Development Tasks

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with the GNU MP Library; see the file COPYING.LIB. If not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

This file current as of 29 Nov 2001. An up-to-date version is available at http://www.swox.com/gmp/tasks.html. Please send comments about this page to bug-gmp@gnu.org.

These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.

Correctness and Completeness

mpn/generic/get_str.c stores mp_size_t data in several int variables. (It is also really slow. Perhaps just call mpn_divrem_1. See also optimization item below.)
The various reuse.c tests need to force reallocation by calling _mpz_realloc with a small (1 limb) size.
One reuse case is missing from mpX/tests/reuse.c: mpz_XXX(a,a,a).
When printing mpf_t numbers with exponents >2^53 on machines with 64-bit mp_exp_t, the precision of __mp_bases[base].chars_per_bit_exactly is insufficient and mpf_get_str aborts. Detect and compensate. Alternately, think seriously about using some sort of fixed-point integer value. Avoiding unnecessary floating point is probably a good thing in general, and it might be faster on some CPUs.
Make the string reading functions allow the `0x' prefix when the base is explicitly 16. They currently only allow that prefix when the base is unspecified (zero).
mpf_eq is not always correct, when one operand is 1000000000... and the other operand is 0111111111..., i.e., extremely close. There is a special case in mpf_sub for this situation; put similar code in mpf_eq.
mpf_eq doesn't implement what gmp.texi specifies. It should not use just whole limbs, but partial limbs.
_mpz_realloc will reallocate to less than the current size of an mpz. This is ok internally when a destination size is yet to be updated, but from user code it should probably refuse to go below the size of the current value. Could switch internal uses across to a new function, and tighten up _mpz_realloc.
mpf_set_str doesn't validate it's exponent, for instance garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow of a long is not detected.
mpf_add doesn't check for a carry from truncated portions of the inputs, and in that respect doesn't implement the "infinite precision followed by truncate" specified in the manual.
mpf_div of x/x doesn't always give 1, reported by Peter Moulder. Perhaps it suffices to put +1 on the effective divisor prec, so that data bits rather than zeros are shifted in when normalizing. Would prefer to switch to mpn_tdiv_qr, where all shifting should disappear.
Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global variables with pointers to mpz_add etc, which doesn't work when those routines are coming from a DLL (because they're effectively function pointer global variables themselves). Need to rearrange perhaps to a set of calls to a test function rather than iterating over an array.

Machine Independent Optimization

mpn_gcdext, mpz_get_d, mpf_get_str: Don't test count_leading_zeros for zero, instead check the high bit of the operand and avoid invoking count_leading_zeros. This is an optimization on all machines, and significant on machines with slow count_leading_zeros, though it's possible an already normalized operand might not be encountered very often.
Rewrite umul_ppmm to use floating-point for generating the most significant limb (if BITS_PER_MP_LIMB <= 52 bits). (Peter Montgomery has some ideas on this subject.)
Improve the default umul_ppmm code in longlong.h: Add partial products with fewer operations.
Write new mpn_get_str and mpn_set_str running in the sub O(n^2) range, using some divide-and-conquer approach, preferably without using division.
mpn_get_str should use a native mpn_divrem_1 when available (athlon, p6mmx). A new mpn_preinv_divrem_1 entrypoint would suit, since the inverse and shift are known from mp_bases. Don't do an mpn_lshift in mpn_get_str, leave it up to mpn_divrem_1 to either make that call or do it on-the-fly.
Copy tricky code for converting a limb from development version of mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
Consider inlining mpz_set_ui. This would be both small and fast, especially for compile-time constants, but would make application binaries depend on having 1 limb allocated to an mpz_t, preventing the "lazy" allocation scheme below.
Consider inlining mpz_[cft]div_ui and maybe mpz_[cft]div_r_ui. A __gmp_divide_by_zero would be needed for the divide by zero test, unless that could be left to mpn_mod_1 (not sure currently whether all the risc chips provoke the right exception there if using mul-by-inverse).
Consider inlining: mpz_fits_s*_p. The setups for LONG_MAX etc would need to go into gmp.h, and on Cray it might, unfortunately, be necessary to forcibly include <limits.h> since there's no apparent way to get SHRT_MAX with an expression (since short and unsigned short can be different sizes).
mpz_powm and mpz_powm_ui aren't very fast on one or two limb moduli, due to a lot of function call overheads. These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better algorithm selection, and the latter should use REDC. Both could change to use an mpn_powm and mpn_redc.
mpz_powm REDC should do multiplications by g[] using the division method when they're small, since the REDC form of a small multiplier is normally a full size product. Probably would need a new tuned parameter to say what size multiplier is "small", as a function of the size of the modulus.
mpz_powm REDC should handle even moduli if possible. Maybe this would mean for m=n*2^k doing mod n using REDC and an auxiliary calculation mod 2^k, then putting them together at the end.
mpn_gcd might be able to be sped up on small to moderate sizes by improving find_a, possibly just by providing an alternate implementation for CPUs with slowish count_leading_zeros.
Toom3 USE_MORE_MPN could use a low to high cache localized evaluate and interpolate. The necessary mpn_divexact_by3c exists.
mpn_mul_basecase on NxM with big N but small M could try for better cache locality by taking N piece by piece. The current code could be left available for CPUs without caching. Depending how karatsuba etc is applied to unequal size operands it might be possible to assume M is always smallish.
mpn_perfect_square_p on small operands might be better off skipping the residue tests and just taking a square root.
mpz_perfect_power_p could be improved in a number of ways. Test for Nth power residues modulo small primes like mpn_perfect_square_p does. Use p-adic arithmetic to find possible roots. Divisibility by other primes should be tested by grouping into a limb like PP.
mpz_perfect_power_p might like to use mpn_gcd_1 instead of a private GCD routine. The use it's put to isn't time-critical, and it might help be ensure correctness to use the main GCD routine.
mpz_perfect_power_p could use mpz_divisible_ui_p instead of mpz_tdiv_ui for divisibility testing, the former is faster on a number of systems. (But all that prime test stuff is going to be rewritten some time.)
Change PP/PP_INVERTED into an array of such pairs, listing several hundred primes. Perhaps actually make the products larger than one limb each.
PP can have factors of 2 introduced in order to get the high bit set and therefore a PP_INVERTED existing. The factors of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used, further remainders r%x, r%y, etc, are the same since x, y, etc are odd. The advantage of this is that mpn_preinv_mod_1 can then be used if it's faster than plain mpn_mod_1. This would be a change only for 16-bit limbs, all the rest already have PP in the right form.
PP could have extra factors of 3 or 5 or whatever introduced if they fit, and final remainders mod 9 or 25 or whatever used, thereby making more efficient use of the mpn_mod_1 done. On a 16-bit limb it looks like PP could take an extra factor of 3.
mpz_probab_prime_p, mpn_perfect_square_p and mpz_perfect_power_p could take a remainder mod 2^24-1 to quickly get remainders mod 3, 5, 7, 13 and 17 (factors of 2^24-1). A remainder mod 2^24-1 can be taken at adder throughput speed. This could either replace the PP division currently done, or allow PP to do larger primes, depending how many residue tests seem worthwhile before launching into full root extractions or Miller-Rabin etc.
mpz_probab_prime_p (and maybe others) could code the divisibility tests like n%7 == 0 in the form
```
#define MP_LIMB_DIVISIBLE_7_P(n) \
  ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7)
```
This would help compilers which don't know how to optimize divisions by constants, and would help current gcc (3.0) too since gcc forms a whole remainder rather than using a modular inverse and comparing. This technique works for any odd modulus, and with some tweaks for even moduli too. See Granlund and Montgomery "Division By Invariant Integers" section 9.
mpz_probab_prime_p and mpz_nextprime could offer certainty for primes up to 2^32 by using a one limb miller-rabin test to base 2, combined with a table of actual strong pseudoprimes in that range (2314 of them). If that table is too big then both base 2 and base 3 tests could be done, leaving a table of 104. The test could use REDC and therefore be a modlimb_invert a remainder (maybe) then two multiplies per bit (successively dependent). Processors with pipelined multipliers could do base 2 and 3 in parallel. Vector systems could do a whole bunch of bases in parallel, and perhaps offer near certainty up to 64-bits (certainty might depend on an exhaustive search of pseudoprimes up to that limit). Obviously 2^32 is not a big number, but an efficient and certain calculation is attractive. It might find other uses internally, and could even be offered as a one limb prime test mpn_probab_prime_1_p or gmp_probab_prime_ui_p perhaps.
mpz_probab_prime_p doesn't need to make a copy of n when the input is negative, it can setup an mpz_t alias, same data pointer but a positive size. With no need to clear before returning, the recursive function call could be dispensed with too.
mpf_set_str produces low zero limbs when a string has a fraction but is exactly representable, eg. 0.5 in decimal. These could be stripped to save work in later operations.
mpz_and, mpz_ior and mpz_xor should use mpn_and_n etc for the benefit of the small number of targets with native versions of those routines. Need to be careful not to pass size==0. Is some code sharing possible between the mpz routines?
mpf_add: Don't do a copy to avoid overlapping operands unless it's really necessary (currently only sizes are tested, not whether r really is u or v).
mpf_add: Under the check for v having no effect on the result, perhaps test for r==u and do nothing in that case, rather than currently it looks like an MPN_COPY_INCR will be done to reduce prec+1 limbs to prec.
mpn_divrem_2 could usefully accept unnormalized divisors and shift the dividend on-the-fly, since this should cost nothing on superscalar processors and avoid the need for temporary copying in mpn_tdiv_qr.
mpf_sqrt_ui calculates prec+1 limbs, whereas just prec would satisfy the application requested precision. It should suffice to simply reduce the rsize temporary to 2*prec-1 limbs. mpf_sqrt might be similar.
invert_limb generic C: The division could use dividend b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current (b-d):0, where b=2^BITS_PER_MP_LIMB and d=divisor. The former is per the original paper and is used in the x86 code, the advantage is that the current special case for 0x80..00 could be dropped. The two should be equivalent, but a little check of that would be wanted.
mpq_cmp_ui could form the num1*den2 and num2*den1 products limb-by-limb from high to low and look at each step for values differing by more than the possible carry bit from the uncalculated portion.
mpq_cmp could do the same high-to-low progressive multiply and compare. The benefits of karatsuba and higher multiplication algorithms are lost, but if it's assumed only a few high limbs will be needed to determine an order then that's fine.
mpn_add_1, mpn_sub_1, mpn_add, mpn_sub: Internally use __GMPN_ADD_1 etc instead of the functions, so they get inlined on all compilers, not just gcc and others with inline recognised in gmp.h. __GMPN_ADD_1 etc are meant mostly to support application inline mpn_add_1 etc and if they don't come out good for internal uses then special forms can be introduced, for instance many internal uses are in-place. Sometimes a block of code is executed based on the carry-out, rather than using it arithmetically, and those places might want to do their own loops entirely.
__gmp_extract_double on 64-bit systems could use just one bitfield for the mantissa extraction, not two, when endianness permits. Might depend on the compiler allowing long long bit fields when that's the only actual 64-bit type.
mpf_get_d could be more like mpz_get_d and do more in integers and give the float conversion as such a chance to round in its preferred direction. Some code sharing ought to be possible. Or if nothing else then for consistency the two ought to give identical results on integer operands (not clear if this is so right now).
usqr_ppm or some such could do a widening square in the style of umul_ppmm. This would help 68000, and be a small improvement for the generic C (which is used on UltraSPARC/64 for instance). GCC recognises the generic C ul*vh and vl*uh are identical, but does two separate additions to the rest of the result.
tal-notreent.c could keep a block of memory permanently allocated. Currently the last nested TMP_FREE releases all memory, so there's an allocate and free every time a top-level function using TMP is called. Would need mp_set_memory_functions to tell tal-notreent.c to release any cached memory when changing allocation functions though.

Machine Dependent Optimization

Run the `tune' utility for more compiler/CPU combinations. We would like to have gmp-mparam.h files in practically every implementation specific mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system compiler. See the `tune' top-level directory for more information.
```
	#ifdef (__GNUC__)
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 7
	...
	#endif
	#if __GNUC__ == 2 && __GNUC_MINOR__ == 8
	...
	#endif
	#ifndef KARATSUBA_MUL_THRESHOLD
	/* Default GNUC values */
	...
	#endif
	#else /* system compiler */
	...
	#endif	
```
invert_limb on various processors might benefit from the little Newton iteration done for alpha and ia64.
Alpha 21264: Improve feed-in code for mpn_addmul_1 and mpn_submul_1. Integrate new mpn_mul_1 code.
Alpha 21164: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1 for the 21164. This should use both integer multiplies and floating-point multiplies. For the floating-point operations, the single-limb multiplier should be split into three 21-bit chunks, or perhaps even better in four 16-bit chunks. Probably possible to reach 9 cycles/limb.
Alpha 21264 ev67: Use ctlz and cttz for count_leading_zeros andcount_trailing_zeros. Use inline for gcc, probably want asm files for elsewhere.
Itanium: mpn_divexact_by3 isn't particularly important, but the generic C runs at about 27 c/l, whereas with the multiplies off the dependent chain about 3 c/l ought to be possible.
Itanium: mpn_hamdist could be put together based on the current mpn_popcount.
Itanium: popc_limb in gmp-impl.h could use the popcnt insn.
UltraSPARC/64: Rewrite mpn_mul_1, mpn_addmul_1, and mpn_submul_1. Should use floating-point operations, and split the invariant single-limb multiplier into 16-bit chunks. Will give 14 cycles/limb, but with a deep pipeline.
UltraSPARC/64: Write umul_ppmm. Using four "mulx"s either with an asm block or via the generic C code is about 90 cycles. Try using fp operations, and also try using karatsuba for just three "mulx"s.
UltraSPARC/64: Write mpn_sqr_diagonal. Since we're squaring each limb, it seems best to split limbs into one 22-bit chunk (for most significant part) and two 21 bit chunks. Things magically align to 64-bits borders with that splitting. Will need 19 memory operations, and won't therefore run faster than at about 20 cycles/limb. (The current umul_ppmm loop of mpn_sqr_basecase runs at around 60 cycles/limb.) If we use VIS fand for splitting operands, just 13 memory are needed.
UltraSPARC/64: mpn_divrem_1, mpn_mod_1, mpn_divexact_1 and mpn_modexact_1_odd could process 32 bits at a time when the divisor fits 32-bits. This will need only 4 mulx's per limb instead of 8 in the general case.
UltraSPARC/32: Rewrite mpn_lshift, mpn_rshift. Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16.
UltraSPARC/32: Use mulx for umul_ppmm if possible (see commented out code in longlong.h).
UltraSPARC/32: On Solaris gcc doesn't give us __sparc_v9__ or anything to indicate V9 support when -mcpu=v9 is selected. See gcc/config/sol2-sld-64.h. Will need to pass something through from ./configure to select the right code in longlong.h. (Currently nothing is lost because mulx for multiplying is commented out.)
UltraSPARC: modlimb_invert might save a few cycles from masking down to just the useful bits at each point in the calculation, since mulx speed depends on the highest bit set. Either explicit masks or small types like short and int ought to work.
Sparc64 HAL R1: mpn_popcount and mpn_hamdist could use popc currently commented out in gmp-impl.h. This chip reputedly implements popc properly (see gcc sparc.md), would need to recognise the chip as sparchalr1 or something in configure / config.sub / config.guess.
PA64: Improve mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The current code runs at 11 cycles/limb. It should be possible to saturate the cache, which will happen at 8 cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; it should be possible to make them run at about 5 cycles/limb.
PPC630: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Use both integer and floating-point operations, possibly two floating-point and one integer limb per loop. Split operands into four 16-bit chunks for fast fp operations. Should easily reach 9 cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb (using one int + two fp).
PPC630: mpn_rshift could do the same sort of unrolled loop as mpn_lshift. Some judicious use of m4 might let the two share source code, or with a register to control the loop direction perhaps even share object code.
Implement mpn_mul_basecase and mpn_sqr_basecase for important machines. Helping the generic sqr_basecase.c with an mpn_sqr_diagonal might be enough for some of the RISCs.
POWER2/POWER2SC: Schedule mpn_lshift/mpn_rshift. Will bring time from 1.75 to 1.25 cycles/limb.
X86: Optimize non-MMX mpn_lshift for shifts by 1. (See Pentium code.)
X86: Good authority has it that in the past an inline rep movs would upset GCC register allocation for the whole function. Is this still true in GCC 3? It uses rep movs itself for __builtin_memcpy. Examine the code for some simple and complex functions to find out. Inlining rep movs would be desirable, it'd be both smaller and faster.
Pentium P54: mpn_lshift and mpn_rshift can come down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after shrdl and shldl, see mpn/x86/pentium/README.
Pentium P55 MMX: mpn_divexact_1 and mpn_modexact_1_odd on 16-bit divisors could use MMX multiplies and run at around 16 cycles (down from 23).
Pentium P55 MMX: mpn_lshift and mpn_rshift might benefit from some destination prefetching.
PentiumPro: mpn_divrem_1 might be able to use a mul-by-inverse, hoping for maybe 30 c/l.
P6: mpn_add_n and mpn_sub_n should be able to go faster than the generic x86 code at 3.5 c/l. The athlon code for instance runs at about 2.7.
K7: mpn_lshift and mpn_rshift might be able to do something branch-free for unaligned startups, and shaving one insn from the loop with alternative indexing might save a cycle.
PPC32: Try using fewer registers in the current mpn_lshift. The pipeline is now extremely deep, perhaps unnecessarily deep.
PPC32: Write mpn_rshift based on new mpn_lshift.
PPC32: Rewrite mpn_add_n and mpn_sub_n. Should run at just 3.25 cycles/limb.
Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
Fujitsu VPP: Write mpn_mul_basecase and mpn_sqr_basecase. This should use a "vertical multiplication method", to avoid carry propagation. splitting one of the operands in 11-bit chunks.
68k, Pentium: mpn_lshift by 31 should use the special rshift by 1 code, and vice versa mpn_rshift by 31 should use the special lshift by 1. This would be best as a jump across to the other routine, could let both live in lshift.asm and omit rshift.asm on finding mpn_rshift already provided.
Cray T3E: Experiment with optimization options. In particular, -hpipeline3 seems promising. We should at least up -O to -O2 or -O3.
Cray: mpn_com_n and mpn_and_n etc very probably wants a pragma like MPN_COPY_INCR.
Cray vector systems: mpn_lshift, mpn_rshift, mpn_popcount and mpn_hamdist are nice and small and could be inlined to avoid function calls.
Cray: Variable length arrays seem to be faster than the tal-notreent.c scheme. Not sure why, maybe they merely give the compiler more information about aliasing (or the lack thereof). Would like to modify TMP_ALLOC to use them, or introduce a new scheme. Memory blocks wanted unconditionally are easy enough, those wanted only sometimes are a problem. Perhaps a special size calculation to ask for a dummy length 1 when unwanted, or perhaps an inlined subroutine duplicating code under each conditional. Don't really want to turn everything into a dog's dinner just because Cray don't offer an alloca.
68000: mpn_mul_1 could check for a 16-bit multiplier and use two multiplies per limb, not four. Ditto mpn_addmul_1 and mpn_submul_1.
68000: mpn_lshift and mpn_rshift could use a roll and mask instead of lsrl and lsll. This promises to be a speedup, effectively trading a 6+2*n shift for one or two 4 cycle masks. Suggested by Jean-Charles Meyrignac.

Improve count_leading_zeros for 64-bit machines:

	   if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
	   if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
	   ...

IRIX 6 MIPSpro compiler has an __inline which could perhaps be used in __GMP_EXTERN_INLINE. What would be the right way to identify suitable versions of that compiler?
VAX D and G format double floats are straightforward and could perhaps be handled directly in __gmp_extract_double and maybe in mpz_get_d, rather than falling back on the generic code. GCC defines __GFLOAT when -mg has selected G format (which would be possible via a user CFLAGS).

New Functionality

Add in-memory versions of mp?_out_raw and mp?_inp_raw.
mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
Maybe add mpz_crr (Chinese Remainder Reconstruction).
Let `0b' and `0B' mean binary input everywhere.
mpz_init and mpq_init could do lazy allocation. Set ALLOC(var) to 0 to indicate nothing allocated, and let _mpz_realloc do the initial alloc. Set z->_mp_d to a dummy that mpz_get_ui and similar can unconditionally fetch from. Niels Möller has had a go at this.
The advantages of the lazy scheme would be:
- Initial allocate would be the size required for the first value stored, rather than getting 1 limb in mpz_init and then more or less immediately reallocating.
- mpz_init would only store magic values in the mpz_t fields, and could be inlined.
- A fixed initializer could even be used by applications, like mpz_t z = MPZ_INITIALIZER;, which might be convenient for globals.
The advantages of the current scheme are:
- mpz_set_ui and other similar routines needn't check the size allocated and can just store unconditionally.
- mpz_set_ui and perhaps others like mpz_tdiv_r_ui and a prospective mpz_set_ull could be inlined.
Add mpf_out_raw and mpf_inp_raw. Make sure format is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines.
Handle numeric exceptions: Call an error handler, and/or set gmp_errno. It should be possible to detect errors at the start of documented entrypoints and invoke an exception handler with a nice clean state, giving it the chance to recover with longjmp or a C++ throw or whatever.
Handle out-of-memory exceptions: It'd be good to offer the custom allocation functions a way to longjmp or throw a C++ exception on being out of memory (or out of the memory they care to offer GMP). This would need GMP to ensure all variables are in a sensible state whenever attempting an alloc or realloc, and would need some way to record other temporary memory or temporary variables in use which should be freed. The latter might be difficult, perhaps the C++ destructor mechanism would be necessary.
mpn_and_n ... mpn_copyd: Perhaps make the mpn logops and copys available in gmp.h, either as library functions or inlines, with the availability of library functions instantiated in the generated gmp.h at build time.
mpz_set_str etc variants taking string lengths rather than null-terminators.
Consider changing the thresholds to apply the simpler algorithm when "<=" rather than "<", so a threshold can be set to MP_SIZE_T_MAX to get only the simpler code (the compiler will know size <= MP_SIZE_T_MAX is always true). Alternately it looks like the ABOVE_THRESHOLD and BELOW_THRESHOLD macros can do this adequately, and also pick up cases where a threshold of zero should mean only the second algorithm.
mpz_nthprime.
Perhaps mpz_init2, initializing and making initial room for N bits. The actual size would be rounded up to a limb, and perhaps an extra limb added since so many mpz routines need that on their destination.
mpz_andn, mpz_iorn, mpz_nand, mpz_nior, mpz_xnor might be useful additions, if they could share code with the current such functions (which should be possible).
mpf_set_str and mpf_inp_str could usefully accept 0x, 0b etc when base==0. Perhaps the exponent could default to decimal in this case, with a further 0x, 0b etc allowed there. Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but probably something like "0.123" ought not mean octal.
GMP_LONG_LONG_LIMB or some such could become a documented feature of gmp.h, so applications could know whether to printf a limb using %lu or %Lu.
PRIdMP_LIMB and similar defines following C99 <inttypes.h> might be of use to applications printing limbs. Perhaps they should be defined only if specifically requested, the way <inttypes.h> does.
mpf_get_ld and mpf_set_ld converting mpf_t to and from long double. Other long double routines would be desirable too, but these would be a start. Often long double is the same as double, which is easy but pretty pointless. Should recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on sparc, hppa and power. Might like an ABI sub-option or something when it's a compiler option for 64-bit or 128-bit long double.

Configuration

Floating-point format: Determine this with a feature test. Get rid of the #ifdef mess in gmp-impl.h. This is simple when doing a native compile, but needs a reliable way to examine object files when cross-compiling. Falling back on a run-time test would be reasonable, if build time tests fail.
Alpha ev67 and ev68: Recognise these in config.guess, the same as recent configfsf.guess does. (Not that we've got anything specific for them right now.)
Alpha: On OSF the nm format is not recognised by GMP_ASM_LSYM_PREFIX. We end up falling back on L which is probably right, but it might be desirable to recognise the format, or try to jam it into BSD mode or whatever.
ARM: umul_ppmm in longlong.h always uses umull, but is that available only for M series chips or some such? Perhaps it should be configured in some way.
HPPA: rename `hppa' => `pa32'.
HPPA: Combine the pa64 and pa64w directories.
HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though perhaps just as a build-time choice. In any case, figure out how to identify a suitable gcc or put it in the right mode, for the GMP compiler choices.
Mips: Rename `mips2' => `mips32' and `mips3' => `mips64'.
Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. "hinv -c processor" gives lots of information on Irix. Standard config.guess etc append "el" to indicate endianness. GMP probably doesn't care about endianness for normal limb operations, but might if attempting to optimize bytes<->limbs conversions. Perhaps mipsr3000el and mipsr3000eb should be accepted, either that or see if there's a good way to ask the compiler if it knows the endianness.
Mips: IRIX 6.5 "as" doesn't like the .byte used in GMP_ASM_LSYM_PREFIX because it's data being emitted in the text section. It also complains about lacking .ent/.end directives. The fallback L is probably right, but perhaps the test should be improved.
Power: config.guess might be able to use AIX 4 _system_configuration.implementation for more accurate CPU determination.
PowerPC-32: gmp-mparam.h comes out quite different for a 750 than a 604e, it'd be good to select the right one, probably by having CPU types powerpc604, powerpc750 etc.
Sparc: config.guess should say supersparc, microsparc, ultrasparc1, ultrasparc2, etc. "prtconf -vp" gives lots of information about a Solaris system.
Sparc: recognise sparclite and sparclet (which configfsf.sub accepts). These have umul but not udiv, or something like that. Check the mpn/sparc32/v8 code is suitable, and add -mcpu= options for gcc.
Sparc32: floating point or integer udiv should be selected according to the CPU target. Currently floating point ends up being used on all sparcs, which is probably not right for generic V7 and V8.
m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies on system information for 030, 040 and 060. Can they be identified by running some code?
demos/pexpr.c: is becoming the usual *nix mess of nested ifdefs. Instead use the results of configure tests. Depending on config.h doesn't seem like a good idea, since it might encourage applications to use that file. A stand-alone pexpr.c could be generated from a pexpr.in template using AC_SUBSTs setup from configure. That would junk up the Makefiles a bit, but using an extra AM_CONFIG_HEADER type file didn't work very well last time it was tried.
Some CPUs have umul and udiv code not being used. Check all such for bit rot and then put umul and udiv in $gmp_mpn_functions_optional as "standard optional" objects.
In particular Sparc and SparcV8 on non-gcc should benefit from umul.asm enabled; the generic umul is suspected to be making sqr_basecase slower than mul_basecase.
HPPA mpn_umul_ppmm and mpn_udiv_qrnnd have a different parameter order than those functions on other CPUs. It might avoid confusion to have them under different names, maybe mpn_umul_ppmm_r or some such. Prototypes then wouldn't be conditionalized, and the appropriate form could be selected with the HAVE_NATIVE scheme if/when the code switches to use a PROLOGUE style.
DItype: The setup in gmp-impl.h for non-GCC could use an autoconf test to determine whether long long is available.
m88k: Make the assembler code work on non-underscore systems. Conversion to .asm would be desirable. Ought to be easy, but would want to be tested.
z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in mpn/z8000 could be an ABI choice. But this chip is obsolete and nothing is likely to be done unless someone is actively using it.
config.m4 is generated only by the configure script, it won't be regenerated by config.status. Creating it as an AC_OUTPUT would work, but it might upset "make" to have things like L$ get into the Makefiles through AC_SUBST. AC_CONFIG_COMMANDS would be the alternative. With some careful m4 quoting the changequote calls might not be needed, which might free up the order in which things had to be output.

Random Numbers

_gmp_rand is not particularly fast on the linear congruential algorithm and could stand various improvements.
- Make a second seed area within gmp_randstate_t (or _mp_algdata rather) to save some copying.
- Make a special case for a single limb 2exp modulus, to avoid mpn_mul calls. Perhaps the same for two limbs.
- Inline the lc code, to avoid a function call and TMP_ALLOC for every chunk.
- The special case for seedn==0 will be very rarely used, and on that basis seems unnecessary.
- Perhaps the 2exp and general LC cases should be split, for clarity (if the general case is retained).
gmp_randinit_mm (named after Mitchell and Moore) for the 55-element delayed Fibonacci generator from Knuth vol 2. Being additive it should be fast, and might be random enough for GMP test program purposes, if nothing else. Niels Möller has started on this.
gmp_randinit_lc: Finish or remove. Doing a division for every every step won't be very fast, so check whether the usefulness of this algorithm can be justified.
Blum-Blum-Shub: Finish or remove. A separate gmp_randinit_bbs would be wanted, not the currently commented out case in gmp_randinit.
_gmp_rand could be done as a function pointer within gmp_randstate_t (or rather in the _mp_algdata part), instead of switching on a gmp_randalg_t. Likewise gmp_randclear, and perhaps gmp_randseed if it became algorithm-specific. This would be more modular, and would ensure only code for the desired algorithms is dragged into the link.
mpz_urandomm should do something for n<=0 (but what?), and the use of mpz_mod looks doubtful (does it suffice to generate numbers of nbits until getting one <n?)
gmp_randstate_t used for parameters perhaps should become gmp_randstate_ptr the same as other types.
Some of the empirical randomness tests could be included in a "make check". They ought to work everywhere, for a given seed at least.

Miscellaneous

Make mpz_div and mpz_divmod use rounding analogous to mpz_mod. Document, and list as an incompatibility.
mpz_gcdext and mpn_gcdext ought to document what range of values the generated cofactors can take, and preferably ensure the definition uniquely specifies the cofactors for given inputs. A basic extended Euclidean algorithm or multi-step variant leads to |x|<|b| and |y|<|a| or something like that, but there's probably two solutions under just those restrictions.
mpz_invert should call mpn_gcdext directly.
Merge mpn/pa64 and pa64w.
demos/factorize.c should use the GMP random functions when restarting Pollard-Rho, not random / mrand48.
The various test programs use quite a bit of the main libgmp. This establishes good cross-checks, but it might be better to use simple reference routines where possible. Where it's not possible some attention could be paid to the order of the tests, so a libgmp routine is only used for tests once it seems to be good.
mpf_set_q is very similar to mpf_div, it'd be good for the two to share code. Perhaps mpf_set_q should make some mpf_t aliases for its numerator and denominator and just call mpf_div. Both would be simplified a good deal by switching to mpn_tdiv_qr perhaps making them small enough not to bother with sharing (especially since mpf_set_q wouldn't need to watch out for overlaps).

Aids to Development

Add ASSERTs at the start of each user-visible mpz/mpq/mpf function to check the validity of each mp?_t parameter, in particular to check they've been mp?_inited. This might catch elementary mistakes in user programs. Care would need to be taken over MPZ_TMP_INITed variables used internally. If nothing else then consistency checks like size<=alloc, ptr not NULL and ptr+size not wrapping around the address space, would be possible.
tune/time.c could try to determine at runtime whether getrusage and gettimeofday are reliable. Currently we pretend in configure that the dodgy m68k netbsd 1.4.1 getrusage doesn't exist. If a test might take a long time to run then perhaps cache the result in a file somewhere.

Documentation

Document conventions, like that unsigned long int is used for bit counts/ranges, and that mp_size_t is used for limb counts.
mpz_inp_str (etc) doesn't say when it stops reading digits.

Bright Ideas

The following may or may not be feasible, and aren't likely to get done in the near future, but are at least worth thinking about.

Reorganize longlong.h so that we can inline the operations even for the system compiler. When there is no such compiler feature, make calls to stub functions. Write such stub functions for as many machines as possible.
longlong.h could declare when it's using, or would like to use, a subroutine version of some macro, and the corresponding .asm file could be included in libgmp only in that case, the same as done for __clz_tab. But this would be a very small space saving, so perhaps not worth the complexity.
longlong.h could be built at configure time by concatenating or #including fragments from each directory in the mpn path. This would select CPU specific macros the same way as CPU specific assembler code. Code used would no longer depend on cpp predefines, and the current nested conditionals could be flattened out.
mpz_get_si returns 0x80000000 for -0x100000000, whereas it's sort of supposed to return the low 31 (or 63) bits. But this is undocumented, and perhaps not too important.
mpz_*_ui division routines currently return abs(a%b). Perhaps make them return the real remainder instead? Return type would be signed long int. But this would be an incompatible change, so it might have to be under newly named functions.
mpz_init_set* and mpz_realloc could allocate say an extra 16 limbs over what's needed, so as to reduce the chance of having to do a reallocate if the mpz_t grows a bit more. This could only be an option, since it'd badly bloat memory usage in applications using many small values.
m68k: configure could accept m68020fp or similar to select 68881 floating point. config.guess could try to detect that too. This would only be to add -m68881 to gcc, there's no GMP asm code using float, so perhaps it's just as easily left to the user to set CFLAGS.
mpq functions could perhaps check for numerator or denominator equal to 1, on the assumption that integers or denominator-only values might be expected to occur reasonably often.
count_trailing_zeros is used on more or less uniformly distributed numbers in a couple of places. For some CPUs count_trailing_zeros is slow and it's probably worth handling the frequently occurring 0 to 2 trailing zeros cases specially.
mpf_t might like to let the exponent be undefined when size==0, instead of requiring it 0 as now. It should be possible to do size==0 tests before paying attention to the exponent. The advantage is not needing to set exp in the various places a zero result can arise, which avoids some tedium but is otherwise perhaps not too important. Currently mpz_set_f and mpf_cmp_ui depend on exp==0, maybe elsewhere too.
__gmp_allocate_func: Could use GCC __attribute__ ((malloc)) on this, though don't know if it'd do much. GCC 3.0 allows that attribute on functions, but not function pointers (see info node "Attribute Syntax"), so would need a new autoconf test. This can wait until there's a GCC that supports it.
mpn_add_ui contains two __GMPN_COPYs, one from mpn_add_1 and one from mpn_sub_1. If those two routines were opened up a bit maybe that code could be shared. When a copy needs to be done there's no carry to append for the add, and if the copy is non-empty no high zero for the sub.
An alternative would be to do a copy at the start and then an in-place add or sub. Obviously that duplicates the fetches and stores for carry propagation, but that's normally only one or two limbs. The same applies to mpz_add when one operand is longer than the other, and to mpz_com since it's just -(x+1).
restrict'ed pointers: Does the C99 definition of restrict (one writer many readers, or whatever it is) suit the GMP style "same or separate" function parameters? If so, judicious use might improve the code generated a bit. Do any compilers have their own flavour of restrict as "completely unaliased", and is that still usable?
68000: A 16-bit limb might suit 68000 better than 32-bits, since the native multiply is only 16x16. Could have this as an ABI option, selecting _SHORT_LIMB in gmp.h. Naturally a new set of asm subroutines would be necessary. Would need new mpz_set_ui etc since the current code assumes limb>=long, but 2-limb operand forms would find a use for long long on other processors too.
throw() could be added to the C function prototypes to declare that they don't throw exceptions. This might let the C++ compiler avoid generating frame info in application code (when it knows no called function throws an exception). But this would prevent any future scheme for allowing out-of-memory or divide-by-zero exceptions. Though such things may or may not be feasible, it seems wisest not to close the door on them yet.
Nx1 remainders can be taken at multiplier throughput speed by pre-calculating an array "p[i] = 2^(i*BITS_PER_MP_LIMB) mod m", then for the input limbs x calculating an inner product "sum p[i]*x[i]", and a final 3x1 limb remainder mod m. If those powers take roughly N divide steps to calculate then there'd be an advantage any time the same m is used three or more times. Suggested by Victor Shoup in connection with chinese-remainder style decompositions, but perhaps with other uses.