*** empty log message ***

author: tege <tege@gmplib.org> 2000-06-30 10:35:05 +0200
committer: tege <tege@gmplib.org> 2000-06-30 10:35:05 +0200
commit: 8d20777e6911a1c027ec9f3e507ef374573c8c2b (patch)
tree: 7ae533339df0deb57cfb1281a52854a5b41a0c82 /doc/assembly_code
parent: a782c8d5ecfeca95b05c8c24040a2ac24283d531 (diff)
download: gmp-8d20777e6911a1c027ec9f3e507ef374573c8c2b.tar.gz
1 files changed, 57 insertions, 0 deletions
diff --git a/doc/assembly_code b/doc/assembly_code
new file mode 100644
index 000000000..f2b8abb37
--- /dev/null
+++ b/doc/assembly_code
@@ -0,0 +1,57 @@
+Most mpn subdirectories contain machine-dependent code, written in
+assembly or C.  The `generic' subdirectory contains default code, used
+when there is no machine-dependent replacement for a particular
+machine.
+
+There is one subdirectory for each ISA family.  Note that e.g., 32-bit SPARC
+and 64-bit SPARC are very different ISA's, and thus cannot share any code.
+
+A particular compile will only use code from one subdirectory, and the
+`generic' subdirectory.  The ISA-specific subdirectories contain hierachies of
+directories for various architecture variants and implementations; the
+top-most level contains code that runs correctly on all variants.
+
+HOW TO WRITE FAST ASSEMBLY CODE FOR GMP
+
+[This should ultimately be made into a chapter of the GMP manual.]
+
+The most basic techinques are software pipelining and loop unrolling.
+
+Software pipelining is the technique of scheduling instructions around
+the branch point in a loop, so that consecutive iterations overlap.
+It is very much like juggling.
+
+Unrolling is useful when software pipelining does not get us close
+enough to the peek performance of a processor's pipeline.  Unrolling
+decreases the loop overhead, but also often allows a more even load on
+a processor's functional units.
+
+For processors with very few registers, software pipelining is not
+feasible as it increases register pressure.
+
+For superscalar machines, it is often the case that all available
+execution capabilites are not used.  Scheduling some instructions
+for these otherwise unused resources will never cost us anything.
+
+Try to determine the alternative instructions that can be used for a
+particular processor.  For GMP, the problem that presents most
+challenges is rpopagating carry from one iteration to the next.
+Explore the different possibilities for doing that with the available
+instructions!
+
+For wide superscalar processors, the performance might be completely
+determined by the number of dependent instruction requied from
+accepting carry-in from the previous iteration until producing
+carry-out for the next iteration.  This is particularly true for
+simple operations like mpn_add_n and mpn_sub_n.  Some carry
+propagation schemes require 4 instructions, translating to at least
+four cycles per iterations.  Other schemes can propagate carry in two
+cycles or even just one cycle.
+
+Therefore, for wide superscalar processors, finding methods with
+"shallow" carry propagation given an instruction set is often the
+central problem we need to address.  The rest is just is hard coding
+work.
+
+[Describe: First find issue maps with desired performance
+	   Then schedule for latency]
author	tege <tege@gmplib.org>	2000-06-30 10:35:05 +0200
committer	tege <tege@gmplib.org>	2000-06-30 10:35:05 +0200
commit	8d20777e6911a1c027ec9f3e507ef374573c8c2b (patch)
tree	7ae533339df0deb57cfb1281a52854a5b41a0c82 /doc/assembly_code
parent	a782c8d5ecfeca95b05c8c24040a2ac24283d531 (diff)
download	gmp-8d20777e6911a1c027ec9f3e507ef374573c8c2b.tar.gz