From 547f67ec985fe829e11ecac4957025acad433ae0 Mon Sep 17 00:00:00 2001 From: Loic Dachary Date: Wed, 14 Jan 2015 16:15:27 +0100 Subject: manual: convert from PDF to HTML It makes it easier to update. Signed-off-by: Loic Dachary --- Manual.pdf | Bin 501334 -> 0 bytes manual/gf-complete.html | 3484 +++++++++++++++++++++++++++++++++++++++++++++++ manual/image1.png | Bin 0 -> 32752 bytes manual/image2.png | Bin 0 -> 28317 bytes manual/image3.png | Bin 0 -> 42222 bytes manual/image4.png | Bin 0 -> 26886 bytes manual/image5.png | Bin 0 -> 51457 bytes manual/image6.png | Bin 0 -> 47306 bytes manual/image7.png | Bin 0 -> 21312 bytes manual/style.css | 404 ++++++ 10 files changed, 3888 insertions(+) delete mode 100644 Manual.pdf create mode 100644 manual/gf-complete.html create mode 100644 manual/image1.png create mode 100644 manual/image2.png create mode 100644 manual/image3.png create mode 100644 manual/image4.png create mode 100644 manual/image5.png create mode 100644 manual/image6.png create mode 100644 manual/image7.png create mode 100644 manual/style.css diff --git a/Manual.pdf b/Manual.pdf deleted file mode 100644 index 59968bb..0000000 Binary files a/Manual.pdf and /dev/null differ diff --git a/manual/gf-complete.html b/manual/gf-complete.html new file mode 100644 index 0000000..86de277 --- /dev/null +++ b/manual/gf-complete.html @@ -0,0 +1,3484 @@ + + + + + + + + + + +
+ +

+GF-Complete: A Comprehensive Open Source Library for Galois
+Field Arithmetic +

+ +

Version 1.02

+ +

James S. Plank*        Ethan L. Miller +Kevin M. Greenan        Benjamin A. Arnold
+John A. Burnum        Adam W. Disney        +Allen C. McBride + +


+ + + + + +https://bitbucket.org/jimplank/gf-complete + +

+ +http://web.eecs.utk.edu/~plank/plank/papers/GF-Complete-Manual-1.02.pdf + + +

+ + + + + + + +
+ + +
+ +This is a user's manual for GF-Complete, version 1.02. This release supersedes version 0.1 and represents the first +major release of GF-Complete. To our knowledge, this library implements every Galois Field multiplication technique +applicable to erasure coding for storage, which is why we named it GF-Complete. The primary goal of this library is +to allow storage system researchers and implementors to utilize very fast Galois Field arithmetic for Reed-Solomon +coding and the like in their storage installations. The secondary goal is to allow those who want to explore different +ways to perform Galois Field arithmetic to be able to do so effectively. + + +

+If you wish to cite GF-Complete, please cite technical report UT-CS-13-716: [PMG+13]. + +

+ + +

If You Use This Library or Document

+ + + +Please send me an email to let me know how it goes. Or send me an email just to let me know you are using the +library. One of the ways in which we are evaluated both internally and externally is by the impact of our work, and if +you have found this library and/or this document useful, we would like to be able to document it. Please send mail to +plank@cs.utk.edu. Please send bug reports to that address as well. + + + +

+The library itself is protected by the New BSD License. It is free to use and modify within the bounds of this +license. To the authors' knowledge, none of the techniques implemented in this library have been patented, and the +authors are not pursing patents.


+ +
+ + +Finding the Code +

+This code is actively maintained on bitbucket: https://bitbucket.org/jimplank/gf-complete. There are +previous versions on my UTK site as a technical report; however, that it too hard to maintain, so the main version is +on bitbucket.

+ + +Two Related Papers

+ +This software acccompanies a large paper that describes these implementation techniques in detail [PGM13a]. We +will refer to this as "The Paper." You do not have to read The Paper to use the software. However, if you want to +start exploring the various implementations, then The Paper is where you'll want to go to learn about the techniques +in detail. + + + +

This library implements the techniques described in the paper "Screaming Fast Galois Field Arithmetic Using Intel +SIMD Instructions," [PGM13b]. The Paper describes all of those techniques as well. +



+ +If You Would Like HelpWith the Software

+ +Please contact the first author of this manual.

+ +Changes from Revision 1.01 +

+The major change is that we are using autoconf to aid with compilation, thus obviating the need for the old flag_tester +code. Additionally, we have added a quick timing tool, and we have modified gf_methods so that it may be used to +run the timing tool and the unit tester. + + + + + + + + + + + + + + + + + + +
+CONTENT 3 +

Contents

+
+1 Introduction 5 +

+2 Files in the Library 6
+ +
+2.1 Header files in the directory "include" . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
+2.2 Source files in the "src" directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
+2.3 Library tools files in the "tools" directory . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
+2.4 The unit tester in the "test" directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
+2.5 Example programs in the "examples" directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 + +
+
+
+ +3 Compilation 8

+4 Some Tools and Examples to Get You Started 8

+ + + +
+4.1 Three Simple Command Line Tools: gf mult, gf div and gf add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
+4.2 Quick Starting Example #1: Simple multiplication and division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
+4.3 Quick Starting Example #2: Multiplying a region by a constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
+4.4 Quick Starting Example #3: Using w = 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
+4.5 Quick Starting Example #4: Using w = 128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 +
+
+ + +
+5 Important Information on Alignment when Multiplying Regions 12

+ +6 The Defaults 13
+ +
+ +
+6.1 Changing the Defaults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
+ + +
    +
  • 6.1.1 Changing the Components of a Galois Field with create_gf_from_argv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
    +
  • +
  • +6.1.2 Changing the Polynomial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
    +
  • +
  • +6.1.3 Changing the Multiplication Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 +
  • + + +
  • +6.1.4 Changing the Division Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 +
  • + + +
  • +6.1.5 Changing the Region Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19 +
  • +
+6.2 Determining Supported Techniques with gf_methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
+ +6.3 Testing with gf_unit, gf_time, and time_tool.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 + +
    +
  • +6.3.1 time_tool.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 22 +
  • + +
  • +6.3.2 An example of gf_methods and time_tool.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .. . . 23 +
  • + +
+ +6.4 Calling gf_init_hard() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . 24
+ +6.5 gf_size() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . .. . 26

+
+ + +
+8 Further Information on Options and Algorithms 26


+
+7.1 Inlining Single Multiplication and Division for Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
+7.2 Using different techniques for single and region multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
+7.3 General w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
+ +7.4 Arguments to "SPLIT" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
+7.5 Arguments to "GROUP" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
+7.6 Considerations with "COMPOSITE" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
+7.7 "CARRY FREE" and the Primitive Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
+7.8 More on Primitive Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 31
+ + +
    +
  • +7.8.1 Primitive Polynomials that are not Primitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
    + +
  • +
  • 7.8.2 Default Polynomials for Composite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
    + +
  • +
+ +
+ + + + + + + + + + + +
+CONTENT 4 + +
+
    +
  • 7.8.3 The Program gf_poly for Verifying Irreducibility of Polynomials 33 +
  • +
+ + +7.9"ALTMAP" considerations and extract_word() 34 +
    +
  • + +7.9.1 Alternate mappings with "SPLIT" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
    +
  • +
  • +7.9.2 Alternate mappings with "COMPOSITE" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
    +
  • +
  • +7.9.3 The mapping of "CAUCHY" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . .. 37
    +
  • +
+
+ + +8 Thread Safety 37

+ +9 Listing of Procedures 37

+ +10 Troubleshooting 38

+11 Timings 41

+ +
+11.1 Multiply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . .. . . . 42
+11.2 Divide() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . .. . . . . 42
+11.3 Multiply Region() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . . . 43
+
+ + + + + + +
+INTRODUCTION 5 + + +

1 Introduction

+ +Galois Field arithmetic forms the backbone of erasure-coded storage systems, most famously the Reed-Solomon +erasure code. A Galois Field is defined over w-bit words and is termed GF(2w). As such, the elements of a Galois +Field are the integers 0, 1, . . ., 2w - 1. Galois Field arithmetic defines addition and multiplication over these closed +sets of integers in such a way that they work as you would hope they would work. Specifically, every number has a +unique multiplicative inverse. Moreover, there is a value, typically the value 2, which has the property that you can +enumerate all of the non-zero elements of the field by taking that value to successively higher powers. + + +

Addition in a Galois Field is equal to the bitwise exclusive-or operation. That's nice and convenient. Multiplication +is a little more complex, and there are many, many ways to implement it. The Paper describes them all, and the +following references providemore supporting material: [Anv09, GMS08, LHy08, LD00, LBOX12, Pla97]. The intent +of this library is to implement all of the techniques. That way, their performancemay be compared, and their tradeoffs +may be analyzed.

+ + + + +

    + +When used for erasure codes, there are typically five important operations:
    +
  1. Adding two numbers in GF(2w). That's bitwise exclusive-or.
  2. +
  3. Multiplying two numbers in GF(2w). Erasure codes are usually based on matrices in GF(2w), and constructing +these matrices requires both addition and multiplication.
  4. +
  5. Dividing two numbers in GF(2w). Sometimes you need to divide to construct matrices (for example, Cauchy +Reed-Solomon codes [BKK+95, Rab89]). More often, though, you use division to invert matrices for decoding. +Sometimes it is easier to find a number's inverse than it is to divide. In that case, you can divide by multiplying +by an inverse.
  6. + +
  7. adding two regions of numbers in GF(2w), which will be explained along with...
  8. +
  9. Mutiplying a region of numbers in GF(2w) by a constant in GF(2w). Erasure coding typically boils down +to performing dot products in GF(2w). For example, you may define a coding disk using the equation:

  10. + + + + +
    c0= d0 + 2d1 + 4d2 + 8d3.

    + +That looks like three multiplications and three additions However, the way ' implemented in a disk system +looks as in Figure 1. Large regions of disks are partitioned into w-bit words in GF(2w). In the example, let us +suppose that w = 8, and therefore that words are bytes. Then the regions pictured are 1 KB from each disk. +The bytes on disk Di are labeled di,0, di,1, . . . , di,1023, and the equation above is replicated 1024 times. For +0 ≤ j < 1024: +

    +
    c0,j = d0,j + 2d1,j + 4d2,j + 8d3,j .
    +
    + + +While it's possible to implement each of these 1024 equations independently, using the single multiplication +and addition operations above, it is often much more efficient to aggregate. For example, most computer architectures +support bitwise exclusive-or of 64 and 128 bit words. Thus, it makes much more sense to add regions +of numbers in 64 or 128 bit chunks rather than as words in GF(2w). Multiplying a region by a constant can +leverage similar optimizations.
+ + +

GF-Complete supports multiplication and division of single values for all values of w ≤ 32, plus w = 64 and w = +128. It also supports adding two regions of memory (for any value of w, since addition equals XOR), and multiplying +a region by a constant in GF(24), GF(28), GF(216), GF(232), GF(264) and GF(2128). These values are chosen +because words in GF(2w) fit into machine words with these values of w. Other values of w don't lend themselves +to efficient multiplication of regions by constants (although see the "CAUCHY" option in section 6.1.5 for a way to +multiply regions for other values of w).

+ + + + + + +
+ +2     FILES IN THE LIBRARY 6


+ + + +



+ +Figure 1: An example of adding two regions of numbers, and multiplying a region of numbers by a constant +in GF(2w) . In this example, w = 8, and each disk is holding a 1KB region. The same coding equation - +c0,j = d0,j + ad1,j + a2d2,j + a3d3,j is applied 1024 times. However, rather than executing this equation 1024 +times, it is more efficient to implement this with three region-constant multiplications and three region-region additions. + +

2     Files in the Library

+This section provides an overview of the files that compose GF-Complete. They are partitioned among multiple +directories. + +

2.1     Header files in the directory "include"

+ +The following header files are part of GF-Complete. + + + + + + + + +
+ +2     FILES IN THE LIBRARY 7


+ + +

2.2     Source files in the "src" directory"

+ + +

2.3     Library tools files in the "tools" directory

+ + + + + + + + + + +
+ +3     COMPILATION 8


+ + +

2.4     The unit tester in the "test" directory

+ +The test directory contains the proram gf_unit.c, which performs a battery of unit tests on GF-Complete. This is +explained in more detail in section 6.3. + + +

2.5    Example programs in the "examples" directory

+ +There are seven example programs to help you understand various facets of GF-Complete. They are in the files +gf_example x.c in the examples directory. They are explained in sections 4.2 through 4.5, and section 7.9.

+ +

3     Compilation

+ +From revision 1.02 forward, we are using autoconf. The old "flag tester" directory is now gone, as it is no longer in +use.

+To compile and install, you should do the standard operations that you do with most open source Unix code:

+ +UNIX> ./configure
+...
+UNIX> make
+...
+UNIX> sudo make install

+ + +

If you perform the install, then the header, source, tool, and library files will be moved to system locations. In +particular, you may then compile the library by linking with the flag -lgf_complete, and you may use the tools from a +global executable directory (like /usr/local/bin).

+ +

+If you don't perform the install, then the header and tool files will be in their respective directories, and the library +will be in src/libgf_complete.la.

+

+If your system supports the various Intel SIMD instructions, the compiler will find them, and GF-Complete will +use them by default.

+ + + +

4     Some Tools and Examples to Get You Started

+

4.1 Three Simple Command Line Tools: gf_mult, gf_div and gf_add

+ + +Before delving into the library, it may be helpful to explore Galois Field arithmetic with the command line tools: +gf_mult, gf_div and gf_add. These perform multiplication, division and addition on elements in GF(2w). If these are +not installed on your system, then you may find them in the tools directory. Their syntax is: + + + + + + + +
+ +6     THE DEFAULTS 14


+ + + +

Table 1 shows the default methods used for each power-of-two word size, their alignment parameters s and t, their +memory consumption and their rough performance. The performance tests are on an Intel Core i7-3770 running at +3.40 GHz, and are included solely to give a flavor of performance on a standard microprocessor. Some processors +will be faster with some techniques and others will be slower, so we only put numbers in so that you can ballpark it. +For other values of w between 1 and 31, we use table lookup when w ≤ 8, discrete logarithms when w ≤ 16 and +"Bytwop" for w ≤ 32.

+

+
With SSE +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
w Memory
Usage
multiply()
Implementation
Performance
(Mega Ops / s)
multiply region()
Implementation
s t Performance
(MB/s)
4 <1K Table501Table16 16 11,659
8 136K Table501Split Table (8,4)16 16 11,824
16 896K Log260Split Table (16,4)32 16 7,749
32 <1K Carry-Free48Split Table (32,4)64 16 5,011
64 2K Carry-Free84Split Table (64,4)128 16 2,402
128 64K Carry-Free48Split Table (128,4)16 16 833
+ + +
+
Without SE
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
w Memory
Usage
multiply()
Implementation
Performance
(Mega Ops / s)
multiply region()
Implementation
s t Performance
(MB/s)
4 4K Table501Double Table16 16 11,659
8 128K Table501Table1 1 1,397
16 896K Log266Split Table (16,8)32 16 2,135
32 4K Bytwop19Split Table (32,4)4 4 1,149
64 16K Bytwop9Split Table (64,4)8 8 987
128 64K Bytwop1.4Split Table (128,4)16 8 833
+
+
+

+Table 1: The default implementations, memory consumption and rough performance when w is a power of two. The +variables s and t are alignment variables described in Section 5. +

+A few comments on Table 1 are in order. First, with SSE, the performance of multiply() is faster when w = 64 +than when w = 32. That is because the primitive polynomial for w = 32, that has historically been used in Galois +Field implementations, is sub-ideal for using carry-free multiplication (PCLMUL). You can change this polynomial +(see section 7.7) so that the performance matches w = 64.

+

+The region operations for w = 4 and w = 8 without SSE have been selected to have a low memory footprint. There +are better options that consume more memory, or that only work on large memory regions (see section 6.1.5). +

+ +There are times that you may want to stray from the defaults. For example: + + + + + + + + + + + +
+ +6     THE DEFAULTS 15


+ + + + +

+Our command line tools allow you to deviate from the defaults, and we have two C functions -gf_init_hard() +and create_gf_from_argv() that can be called from application code to override the default methods. There are six +command-line tools that can be used to explore the many techniques implemented in GF-Complete:

+ + + + +

To change the default behavior in application code, you need to call gf_init_hard() rather than gf_init_easy(). +Alternatively, you can use create_g_from_argv(), included from gf_method.h, which uses an argv-style array of +strings to specify the options that you want. The procedure in gf_method.c parses the array and makes the proper +gf_init_hard() procedure call. This is the technique used to parse the command line in gf_mult, gf_div, gf_unit et al.

+ + +

6.1.1 Changing the Components of a Galois Field with create gf_from_argv()

+There are five main components to every Galois Field instance: + + +

The procedures gf_init_hard() and create_gf_from_argv() allow you to specify these parameters when you create +your Galois Field instance. We focus first on create_gf_from_argv(), because that is how the tools allow you to specify +the components. The prototype of create_gf_from_argv() is as follows:


+ +
+int create_gf_from_argv(gf_t *gf, int w, int argc, char **argv, int starting);

+ +You pass it a pointer to a gf_t, which it will initialize. You specify the word size with the parameter w, and then you +pass it an argc/argv pair as in any C or C++ program. You also specify a starting argument, which is where in argv +the specifications begin. If it successfully parses argc and argv, then it creates the gf_t using gf_init_hard() (described +below in section 6.4). It returns one past the last index of argv that it considered when creating the gf_t. If it fails, then +it returns zero, and the gf_t is unmodified. + + + +

For example, gf_mult.c calls create gf_from_argv() by simply passing argc and argv from its main() declaration, +and setting starting to 4.

+ + + + + + + + +
+ +6     THE DEFAULTS 16


+ +

+To choose defaults, argv[starting] should equal "-". Otherwise, you specify the component that you are changing +with "-m" for multiplication technique, "-d" for division technique, "-r" for region technique, and "-p" for the +polynomial. You may change multiple components. You end your specification with a single dash. For example, the +following call multiplies 6 and 5 in GF(24) with polynomial 0x19 using the "SHIFT" technique for multiplication +(we'll explain these parameters later): +



+ +
+UNIX> ./gf_mult 6 5 4 -p 0x19 -m SHIFT -
+7
+UNIX>

+
+ +

If create_gf_from_argv() fails, then you can call the procedure gf_error(), which prints out the reason why create_ +gf_from_argv() failed.

+ + +

6.1.2 Changing the Polynomial

+ +Galois Fields are typically implemented by representing numbers as polynomials with binary coefficients, and then +using the properties of polynomials to define addition and multiplication. You do not need to understand any of that to +use this library. However, if you want to learn more about polynomial representations and how they construct fields, +please refer to The Paper. + +

Multiplication is based on a special polynomial that we will refer to here as the "defining polynomial." This +polynomial has binary coefficients and is of degree w. You may change the polynomial with "-p" and then a number +in hexadecimal (the leading "0x" is optional). It is assumed that the w-th bit of the polynomial is set - you may include +it or omit it. For example, if you wish to set the polynomial for GF(216) to x16 + x5 + x3 + x2 + 1, rather than its +default of x16 + x12 + x3 + x + 1, you may say "-p 0x1002d," "-p 1002d," "-p 0x2d" or "-p 2d." +We discuss changing the polynomial for three reasons in other sections:

+ + +

+Some words about nomenclature with respect to the polynomial. A Galois Field requires the polynomial to be +irreducible .. That means that it cannot be factored. For example, when the coefficients are binary, the polynomial x5+ +x4+x+1 may be factored as (x4+1)(x+1). Therefore it is not irreducible and cannot be used to define a Galois Field. +It may, however, be used to define a ring. Please see section 7.8.1 for a discussion of ring support in GF-Complete.

+

+There is a subset of irreducible polynomials called primitive. These have an important property that one may enumerate +all of the elements of the field by raising 2 to successive posers. All of the default polynomials in GF-Complete +are primitive. However, so long as a polynomial is irreducible, it defines a Galois Field. Please see section 7.7 for a +further discussion of the polynomial.

+ +

+One thing that we want to stress here is that changing the polynomial changes the field, so fields with different +polynomialsmay not be used interchangeably. So long as the polynomial is irreducible, it generates a Galois Field that +is isomorphic to all other Galois Fields; however the multiplication and division of elements will differ. For example, +the polynomials 0x13 (the default) and 0x19 in GF(24) are both irreducible, so both generate valid Galois Fields. +However, their multiplication differs:


+ +
+UNIX> gf_mult 8 2 4 -p 0x13 -
+3
+UNIX> gf_mult 8 2 4 -p 0x19 -
+9
+
+ + + + + + + + + +
+ +6     THE DEFAULTS 17


+ +
+UNIX> gf_div 3 8 4 -p 0x13 -
+2
+UNIX> gf_div 9 8 4 -p 0x19 -
+2
+UNIX>
+ +
+ + +

6.1.3     Changing the Multiplication Technique

+The following list describes the multiplication techinques that may be changed with "-m". We keep the description +here brief. Please refer to The Paper for detailed descriptions of these techniques.

+ + +
  • "TABLE:" Multiplication and division are implemented with tables. The tables consume quite a bit of memory +(2w × 2 w × w/ +8 bytes), so they are most useful when w is small. Please see "SSE," "LAZY," "DOUBLE" and + +"QUAD" under region techniques below for further modifications to "TABLE" to perform multiply_region()

  • + + +
  • "LOG:" This employs discrete (or "Zeph") logarithm tables to implement multiplication and division. The +memory usage is roughly (3 × 2w × w / +8 bytes), so they are most useful when w is small, but they tolerate +larger w than "TABLE." If the polynomial is not primitive (see section 6.1.2), then you cannot use "LOG" as +an implementation. In that case, gf_init_hard() or create_gf_from_argv() will fail

  • + + +
  • "LOG ZERO:" Discrete logarithm tables which include extra room for zero entries. This more than doubles +the memory consumption to remove an if statement (please see [GMS08] or The Paper for more description). It +doesn’t really make a huge deal of difference in performance

  • + +
  • "LOG ZERO EXT:" This expends even more memory to remove another if statement. Again, please see The +Paper for an explanation. As with "LOG ZERO," the performance difference is negligible

  • + +
  • "SHIFT:" Implementation straight from the definition of Galois Field multiplication, by shifting and XOR-ing, +then reducing the product using the polynomial. This is slooooooooow, so we don’t recommend you use it

  • + + +
  • "CARRY FREE:" This is identical to "SHIFT," however it leverages the SSE instruction PCLMUL to perform +carry-freemultiplications in single instructions. As such, it is the fastest way to perform multiplication for large +values of w when that instruction is available. Its performance depends on the polynomial used. See The Paper +for details, and see section 7.7 below for the speedups available when w = 16 and w = 32 if you use a different +polynomial than the default one

  • + + +
  • "BYTWO p:" This implements multiplication by successively multiplying the product by two and selectively +XOR-ing the multiplicand. See The Paper for more detail. It can leverage Anvin’s optimization that multiplies +64 and 128 bits of numbers in GF(2w) by two with just a few instructions. The SSE version requires SSE2

  • + + +
  • "BYTWO b:" This implements multiplication by successively multiplying the multiplicand by two and selectively +XOR-ing it into the product. It can also leverage Anvin's optimization, and it has the feature that when +you're multiplying a region by a very small constant (like 2), it can terminate the multiplication early. As such, +if you are multiplying regions of bytes by two (as in the Linux RAID-6 Reed-Solomon code [Anv09]), this is +the fastest of the techniques, regardless of the value of w. The SSE version requires SSE2

  • + + +
  • "SPLIT:" Split multiplication tables (like the LR tables in [GMS08], or the SIMD tables for w ≤ 8 in [LHy08, +Anv09, PGM13b]). This argument must be followed by two more arguments, wa and wb, which are the index +sizes of the sub-tables. This implementation reduces the size of the table from "TABLE," but requires multiple +

  • + + + + + + +
    + +6     THE DEFAULTS 18


    + +

    +With the exception of "COMPOSITE", only one multiplication technique can be provided for a given Galois +Field instance. Composite fields may use composite fields as their base fields, in which case the specification will be +recursive.

    + + + + + + + + +
    + +6     THE DEFAULTS 19


    + +

    6.1.4       Changing the Division Technique

    + +There are two techniques for division that may be set with "-d". If "-d" is not specified, then appropriate defaults +are employed. For example, when the multiplication technique is "TABLE," a table is created for division as well as +multiplication. When "LOG" is specified, the logarithm tables are used for division. With "COMPOSITE," a special +variant of Euclid's algorithm is employed that performs division using multiplication and division in the base field. +Otherwise, Euclid's algorithm is used. Please see The Paper for a description of Euclid's algorithm applied to Galois +Fields. + +

    If you use "-d", you must also specify the multiplication technique with "-m."

    +

    To force Euclid's algorithm instead of the defaults, you may specify it with "-d EUCLID." If instead, you would +rather convert elements of a Galois Field to a binary matrix and find an element's inverse by inverting the matrix, +then specify "-d MATRIX." In all of our tests, "MATRIX" is slower than "EUCLID." "MATRIX" is also not defined +for w > 32. +

    + + +

    6.1.5     Changing the Region Technique

    +The following are the region multiplication options ("-r"): + + + + + + + + + + + + +
    + +6     THE DEFAULTS 20


    + + + + +

    It is possible to combine region multiplication options. This is fully supported as long as gf_methods has the combination +listed. If multiple region options are required, they should be specified independently (as flags for gf_init_hard() +and independent options for command-line tools and create_gf_from_argv()).

    + + +

    6.2    Determining Supported Techniques with gf methods

    + + +The program gf_methods prints a list of supported methods on standard output. It is called as follows:

    +
    +
    ./gf methods w -BADC -LUMDRB

    + +The first argument is w , which may be any legal value of w . The second argument has the following flags:

    + +

    +You may specify multiple of these as the second argument. If you include both "B" and "A," then it uses the last +one specified.

    +

    +The last argument determines the output format of gf_methods. If it is "L," then it simply lists methods. If it +is "U," then the output contains gf_unit commands for each of the methods. For the others, the output contains +gf_time_tool.sh commands for M ultiplication,Division,Region multiplications with multiple buffer sizes, and the +Best region multiplication.

    +

    +gf_methods enumerates combinations of flags, and calls create_gf_from_argv() to see if the combinations are +supported. Although it enumerates a large number of combinations, it doesn't enumerate all possible parameters for +"SPLIT," "GROUP" or "COMPOSITE."

    + +

    Some examples of calling gf_methods are shown below in section 6.3.2.

    + + + + + + + +
    + +6     THE DEFAULTS 21


    + + +

    6.3 Testing with gf_unit , gf_time , and time_tool.sh

    + + + +gf_unit and gf_time may be used to verify that a combination of arguments works correctly and efficiently on your +platform. If you plan to stray from the defaults, it is probably best to run both tools to ensure there are no issues with +your environment. gf_unit will run a set of unit tests based on the arguments provided to the tool, and gf_time will +time Galois Field methods based on the provided arguments.
    +The usage of gf_ unit is:

    +
    +gf_unit w tests seed method

    +The usage of gf_ time is:

    +
    +gf_time w tests seed buffer-size iterations method

    +
    + + + +The seed is an integer- negative one uses the current time. The tests are specified by a listing of characters. The +following tests are supported (All are supported by gf_time. Only ', 'S' and 'R' are supported by gf_unit):

    + + + + + + + +

    Here are some examples of calling gf_unit and gf_time to verify that "-m SPLIT 32 4 -r ALTMAP -" works +in GF(232), and to get a feel for its performance. First, we go to the test directory and call gf_unit:



    + + +
    +UNIX> cd test
    +UNIX> ./gf_unit 32 A -1 -m SPLIT 32 4 -r ALTMAP -
    +Args: 32 A -1 -m SPLIT 32 4 -r ALTMAP - / size (bytes): 684
    +UNIX>

    +
    + +gf_unit reports on the arguments and how may bytes the gf_t consumes. If it discovers any problems or inconsistencies +with multiplication, division or region multiplication, it will report them. Here, there are no problems. +Next, we move to the tools directory and run performance tests on a 10K buffer, with 10,000 iterations of each test:

    + + +UNIX> cd ../tools
    +UNIX> ./gf_time 32 A -1 10240 10000 -m SPLIT 32 4 -r ALTMAP -
    +Seed: 1388435794
    +
    + + + + + + + + + + + + +
    Multiply: 4.090548 s Mops: 24.414 5.968 Mega-ops/s
    Divide: 37.794962 s Mops: 24.414 0.646 Mega-ops/s
    Inverse: 33.709875 s Mops: 24.414 0.724 Mega-ops/s
    Region-Random: XOR: 0 0.035210 s MB: 97.656 2773.527 MB/s
    Region-Random: XOR: 1 0.036081 s MB: 97.656 2706.578 MB/s
    Region-By-Zero:XOR: 0 0.003199 s MB: 97.656 30523.884 MB/s
    Region-By-Zero: XOR: 1 0.000626 s MB: 97.656 156038.095 MB/s
    +
    + + + + + + + + + + +
    + +6     THE DEFAULTS 22


    + +
    + + + + + + + +
    Region-By-One: XOR: 0 0.003810 s MB: 97.656 25628.832 MB/s
    Region-By-One: XOR: 1 0.008363 s MB: 97.656 11677.500 MB/s
    Region-By-Two: XOR: 0 0.032942 s MB: 97.656 2964.486 MB/s
    Region-By-Two: XOR: 1 0.033488 s MB: 97.656 2916.153 MB/s
    +
    +UNIX>

    + +

    The first column of output displays the name of the test performed. Region tests will test with and without the XOR +flag being set (see Section 4.3 for an example). The second column displays the total time the test took to complete +measured in seconds (s). The third column displays the size of the test measured in millions of operations (Mops) for +single tests and in Megabytes (MB) for the region tests. The final column displays the speed of the tests calculated +from the second and third columns, and is where you should look to get an idea of a method's performance.

    +

    +If the output of gf_unit and gf_time are to your satisfaction, you can incorporate the method into application code +using create gf_from_argv() or gf_init hard().

    +

    +The performance of "Region-By-Zero" and "Region-By-One" will not change from test to test, as all methods make +the same calls for these. "Region-By-Zero" with "XOR: 1" does nothing except set up the tests. Therefore, you may +use it as a control.

    + +

    6.3.1       time tool.sh

    + +Finally, the shell script time_tool.sh makes a bunch of calls to gf_time to give a rough estimate of performance. It is +called as follows:

    +usage sh time_tool.sh M|D|R|B w method

    + + +

    The values for the first argument are MDRB, for Multiplication, Division,Region multiplications with multiple +buffer sizes, and the Best region multiplication. For the example above, let's call time_tool.sh to get a rough idea of +performance:



    + +
    +UNIX> sh time_tool.sh M 32 -m SPLIT 32 4 -r ALTMAP -
    +M speed (MB/s): 6.03 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    +UNIX> sh time_tool.sh D 32 -m SPLIT 32 4 -r ALTMAP -
    +D speed (MB/s): 0.65 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    +UNIX> sh time_tool.sh R 32 -m SPLIT 32 4 -r ALTMAP -
    + + + + + + + + + + + + + +
    Region Buffer-Size: 16K (MB/s): 3082.91 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 32K (MB/s): 3529.07 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 64K (MB/s): 3749.94 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 128K (MB/s): 3861.27 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 512K (MB/s): 3820.82 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 1M (MB/s): 3737.41 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 2M (MB/s): 3002.90 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 4M (MB/s): 2760.77 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Best (MB/s): 3861.27 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    + +UNIX> sh time_tool.sh B 32 -m SPLIT 32 4 -r ALTMAP -
    +Region Best (MB/s): 3929.09 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    +UNIX>

    +
    +

    +We say that time_tool.sh is "rough" because it tries to limit each test to 5 ms or less. Thus, the time granularity +is fine, which means that the numbers may not be as precise as they could be were the time granularity to be course. +When in doubt, you should make your own calls to gf_time with a lot of iterations, so that startup costs and roundoff +error may be minimized.

    + + + + + + + + +
    + +6     THE DEFAULTS 23


    + +

    6.3.2       An example of gf methods and time tool.sh



    +Let's give an example of how some of these components fit together. Suppose we want to explore the basic techniques +in GF(232). First, let's take a look at what gf_methods suggests as "basic" methods:

    +
    +UNIX> gf_methods 32 -B -L
    +w=32: -
    +w=32: -m GROUP 4 8 -
    +w=32: -m SPLIT 32 4 -
    +w=32: -m SPLIT 32 4 -r ALTMAP -
    +w=32: -m SPLIT 32 8 -
    +w=32: -m SPLIT 8 8 -
    +w=32: -m COMPOSITE 2 - -
    +w=32: -m COMPOSITE 2 - -r ALTMAP -
    +UNIX>

    +
    + + +

    + +You'll note, this is on my old Macbook Pro, which doesn't support (PCLMUL), so "CARRY FREE" is not included +as an option. Now, let's run the unit tester on these to make sure they work, and to see their memory consumption:



    + +
    +UNIX> gf_methods 32 -B -U
    +../test/gf_unit 32 A -1 -
    +../test/gf_unit 32 A -1 -m GROUP 4 8 -
    +../test/gf_unit 32 A -1 -m SPLIT 32 4 -
    +../test/gf_unit 32 A -1 -m SPLIT 32 4 -r ALTMAP -
    +../test/gf_unit 32 A -1 -m SPLIT 32 8 -
    +../test/gf_unit 32 A -1 -m SPLIT 8 8 -
    +../test/gf_unit 32 A -1 -m COMPOSITE 2 - -
    +../test/gf_unit 32 A -1 -m COMPOSITE 2 - -r ALTMAP -
    +UNIX> gf_methods 32 -B -U | sh
    +Args: 32 A -1 - / size (bytes): 684
    +Args: 32 A -1 -m GROUP 4 8 - / size (bytes): 1296
    +Args: 32 A -1 -m SPLIT 32 4 - / size (bytes): 684
    +Args: 32 A -1 -m SPLIT 32 4 -r ALTMAP - / size (bytes): 684
    +Args: 32 A -1 -m SPLIT 32 8 - / size (bytes): 4268
    +Args: 32 A -1 -m SPLIT 8 8 - / size (bytes): 1839276
    +Args: 32 A -1 -m COMPOSITE 2 - - / size (bytes): 524648
    +Args: 32 A -1 -m COMPOSITE 2 - -r ALTMAP - / size (bytes): 524648
    +UNIX>

    +
    +

    +As anticipated, "SPLIT 8 8" consumes quite a bit of memory! Now, let's see how well they perform with both +single multiplications and region multiplications:



    +
    +UNIX> gf_methods 32 -B -M
    +sh time_tool.sh M 32 -
    +sh time_tool.sh M 32 -m GROUP 4 8 -
    +sh time_tool.sh M 32 -m SPLIT 32 4 -
    +sh time_tool.sh M 32 -m SPLIT 32 4 -r ALTMAP -
    +sh time_tool.sh M 32 -m SPLIT 32 8 -
    +sh time_tool.sh M 32 -m SPLIT 8 8 -
    + +
    + + + + + + + + +
    + +6     THE DEFAULTS 24


    + +
    +sh time_tool.sh M 32 -m COMPOSITE 2 -
    +sh time_tool.sh M 32 -m COMPOSITE 2 - -r ALTMAP
    +UNIX> gf_methods 32 -B -M | sh +M speed (MB/s): 5.90 W-Method: 32
    +M speed (MB/s): 14.09 W-Method: 32 -m GROUP 4 8
    +M speed (MB/s): 5.60 W-Method: 32 -m SPLIT 32 4
    +M speed (MB/s): 5.19 W-Method: 32 -m SPLIT 32 4 -r ALTMAP
    +M speed (MB/s): 5.98 W-Method: 32 -m SPLIT 32 8
    +M speed (MB/s): 22.10 W-Method: 32 -m SPLIT 8 8
    +M speed (MB/s): 34.98 W-Method: 32 -m COMPOSITE 2 -
    +M speed (MB/s): 34.16 W-Method: 32 -m COMPOSITE 2 - -r ALTMAP
    +UNIX> gf_methods 32 -B -B | sh +Region Best (MB/s): 2746.76 W-Method: 32
    +Region Best (MB/s): 177.06 W-Method: 32 -m GROUP 4 8
    +Region Best (MB/s): 2818.75 W-Method: 32 -m SPLIT 32 4
    +Region Best (MB/s): 3818.21 W-Method: 32 -m SPLIT 32 4 -r ALTMAP
    +Region Best (MB/s): 728.68 W-Method: 32 -m SPLIT 32 8
    +Region Best (MB/s): 730.97 W-Method: 32 -m SPLIT 8 8
    +Region Best (MB/s): 190.20 W-Method: 32 -m COMPOSITE 2 -
    +Region Best (MB/s): 1837.99 W-Method: 32 -m COMPOSITE 2 - -r ALTMAP
    +UNIX> +
    +

    +The default is quite a bit slower than the best performing methods for both single and region multiplication. So +why are the defaults the way that they are? As detailed at the beginning of this chapter, we strive for lower memory +consumption, so we don't use "SPLIT 8 8," which consumes 1.75MB.We don't implement alternate fields by default, +which is why we don't use "COMPOSITE." Finally, we don't implement alternate mappings of memory by default, +which is why we don't use "-m SPLIT 32 4 -r ALTMAP -."

    + +

    Of course, you may change these defaults if you please.

    +

    +Test question: Given the numbers above, it would appear that "COMPOSITE" yields the fastest performance of +single multiplication, while "SPLIT 32 4" yields the fastest performance of region multiplication. Should I use two +gf t's in my application – one for single multiplication that uses "COMPOSITE," and one for region multiplication +that uses "SPLIT 32 4?"

    +

    +The answer to this is "no." Why? Because composite fields are different from the "standard" fields, and if you mix +these two gf_t's, then you are using different fields for single multiplication and region multiplication. Please read +section 7.2 for a little more information on this.

    + +

    6.4      Calling gf_init_hard()

    + +We recommend that you use create_gf_from_argv() instead of gf_init_hard(). However, there are extra things that +you can do with gf_init_hard(). Here's the prototype:

    +
    +int gf_init_hard(gf_t *gf
    +
    +int w
    +int mult_type
    +int region_type
    +int divide_type
    +uint64_t prim_poly
    +int arg1
    +int arg2
    +
    +
    + + + + + + + + +
    + +6     THE DEFAULTS 25


    +
    +
    +GFP base_gf,
    +void *scratch_memory);


    + + +The arguments mult type, region type and divide type allow for the same specifications as above, except the +types are integer constants defined in gf complete.h:

    +typedef enum {GF_MULT_DEFAULT,
    +
    +GF_MULT_SHIFT
    +GF_MULT_CARRY_FREE
    +GF_MULT_GROUP
    +GF_MULT_BYTWO_p
    +GF_MULT_BYTWO_b
    +GF_MULT_TABLE
    +GF_MULT_LOG_TABLE
    +GF_MULT_LOG_ZERO
    +GF_MULT_LOG_ZERO_EXT
    +GF_MULT_SPLIT_TABLE
    +GF_MULT_COMPOSITE } gf_mult_type_t;

    + +
    + +#define GF_REGION_DEFAULT (0x0)
    +#define GF_REGION_DOUBLE_TABLE (0x1)
    +#define GF_REGION_QUAD_TABLE (0x2)
    +#define GF_REGION_LAZY (0x4)
    +#define GF_REGION_SSE (0x8)
    +#define GF_REGION_NOSSE (0x10)
    +#define GF_REGION_ALTMAP (0x20)
    +#define GF_REGION_CAUCHY (0x40)

    +typedef enum { GF_DIVIDE_DEFAULT
    +
    GF_DIVIDE_MATRIX
    +GF_DIVIDE_EUCLID } gf_division_type_t;

    +
    +
    +

    +You can mix the region types with bitwise or. The arguments to GF_MULT_GROUP,GF_MULT_SPLIT_TABLE +and GF_MULT_COMPOSITE are specified in arg1 and arg2. GF_MULT_COMPOSITE also takes a base field +in base_gf. The base field is itself a gf_t, which should have been created previously with create_gf_fro_argv(), +gf_init_easy() or gf_init_hard(). Note that this base_gf has its own base_gf member and can be a composite field +itself.

    +

    +You can specify an alternate polynomial in prim_poly. For w ≤ 32, the leftmost one (the one in bit position w) is +optional. If you omit it, it will be added for you. For w = 64, there's no room for that one, so you have to leave it off. +For w = 128, your polynomial can only use the bottom-most 64 bits. Fortunately, the standard polynomial only uses +those bits. If you set prim_poly to zero, the library selects the "standard" polynomial. +

    +

    +Finally, scratch_memory is there in case you don't want gf_init_hard() to call malloc(). Youmay call gf_scratch_size() +to find out how much extra memory each technique uses, and then you may pass it a pointer for it to use in scratc_memory. +If you set scratch memory to NULL, then the extra memory is allocated for you with malloc(). If you use gf_init_easy() +or create_gf_from_argv(), or you use gf_init_hard() and set scratch_memory to NULL, then you should call gf_free() +to free memory. If you use gf_init_hard() and use your own scratch_memory you can still call gf_free(), and it will +not do anything.

    +

    +Both gf_init_hard() and gf_scratch_size() return zero if the arguments don't specify a valid gf_t. When that happens, +you can call gf_error() to print why the call failed.

    + + + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 26


    + + +

    We'll give you one example of calling gf_ init_hard(). Suppose you want to make a gf_ init_hard() call to be +equivalent to "-m SPLIT 16 4 -r SSE -r ALTMAP -" and you want to allocate the scratch space yourself. Then you'd +do the following:



    + +
    +gf_t gf;
    +void *scratch;
    +int size;
    +size = gf_scratch_size(16, GF_MULT_SPLIT_TABLE,
    +GF_REGION_SSE | GF_REGION_ALTMAP,
    +GF_DIVIDE_DEFAULT,
    +16, 4);
    +if (size == 0) { gf_error(); exit(1); } /* It failed. That shouldn’t happen */
    +scratch = (void *) malloc(size);
    +if (scratch == NULL) { perror("malloc"); exit(1); }
    +if (!gf_init_hard(&gf, 16, GF_MULT_SPLIT_TABLE,
    +GF_REGION_SSE | GF_REGION_ALTMAP,
    +GF_DIVIDE_DEFAULT,
    +0, 16, 4, NULL, scratch)) {
    +gf_error();
    +exit(1);
    +}
    + +
    + + +

    6.5     gf_size()

    + +You can call gf_size(gf_t *gf) to learn the memory consumption of the gf_t. It returns all memory consumed by the +gf_t, including the gf_t itself, any scratch memory required by the gf_ t, and the memory consumed by the sub-field +if the field is "COMPOSITE." If you provided your own memory to gf_init_hard(), it does not report the size of +this memory, but what the size should be, as determined by gf_scratch size(). gf_ unit() prints out the return value of +gf_size() on the given field. + + +

    7   Further Information on Options and Algorithms

    +

    +7.1   Inlining Single Multiplication and Division for Speed

    + +Obviously, procedure calls are more expensive than single instructions, and the mechanics of multiplication in "TABLE" +and "LOG" are pretty simple. For that reason, we support inlining for "TABLE" when w = 4 and w = 8, and +for "LOG" when w = 16. We elaborate below. +

    +When w = 4, you may inline multiplication and division as follows. The following procedures return pointers to +the multiplication and division tables respectively:



    + +
    +uint8_t *gf_w4_get_mult_table(gf_t * gf);
    +uint8_t *gf_w4_get_div_table(gf_t * gf);

    +
    +

    The macro Gf_W4_INLINE_MULTDIV (table, a, b) then multiplies or divides a by b using the given table. This +of course only works if the multiplication technique is "TABLE," which is the default for w = 4. If the multiplication +technique is not "TABLE," then gf_w4_get_mult_table() will return NULL.

    + + + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 27


    + + + + +

    When w = 8, the procedures gf_w8_et_mult_table() and gf_ w8_get_div_table(), and the macro

    + +GF_W8_INLINE_MULTDIV (table, a, b) work identically to the w = 4 case. + +

    When w = 16, the following procedures return pointers to the logarithm table, and the two inverse logarithm tables +respectively:


    + +
    +uint16_t *gf_w16_get_log_table(gf_t * gf);
    +uint16_t *gf_w16_get_mult_alog_table(gf_t * gf);
    +uint16_t *gf_w16_get_div_alog_table(gf_t * gf);
    + +
    +
    +

    +The first inverse logarithm table works for multiplication, and the second works for division. They actually point +to the same table, but to different places in the table. You may then use the macro GF_W16_INLINE_MULT(log, +alog, a, b ) to multiply a and b, and the macro GF_W16_INLINE_DIV (log, alog, a, b ) to divide a and b. Make +sure you use the alog table returned by gf_w16_get_mult_alog_table() for multiplication and the one returned by +gf_w16_get_div_alog_table() for division. Here are some timings:



    + + +UNIX> gf_time 4 M 0 10240 10240 -
    +Seed: 0
    +Multiply: 0.228860 s Mops: 100.000 436.949 Mega-ops/s
    +UNIX> gf_inline_time 4 0 10240 10240
    +Seed: 0
    +Inline mult: 0.096859 s Mops: 100.000 1032.424 Mega-ops/s
    +UNIX> gf_time 8 M 0 10240 10240 -
    +Seed: 0
    +Multiply: 0.228931 s Mops: 100.000 436.812 Mega-ops/s
    +UNIX> gf_inline_time 8 0 10240 10240
    +Seed: 0
    +Inline mult: 0.114300 s Mops: 100.000 874.889 Mega-ops/s
    +UNIX> gf_time 16 M 0 10240 10240 -
    +Seed: 0
    +Multiply: 0.193626 s Mops: 50.000 258.229 Mega-ops/s
    +UNIX> gf_inline_time 16 0 10240 10240
    +Seed: 0
    +Inline mult: 0.310229 s Mops: 100.000 322.342 Mega-ops/s
    +UNIX>

    + +

    +7.2     Using different techniques for single and region multiplication

    + + +You may want to "mix and match" the techniques. For example, suppose you'd like to use "-m SPLIT 8 8" for +multiply() in GF(232), because it's fast, and you don't mind consuming all of that space for tables. However, for +multiply_region(), you'd like to use "-m SPLIT 32 4 -r ALTMAP," because that's the fastest way to implement +multiply_region(). Unfortunately, There is no way to create a gf_t that does this combination. In this case, you should +simply create two gf_t's, and use one for multiply() and the other for multiply_region(). All of the implementations +may be used interchangably with the following exceptions: + + + + + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 28


    + + + + + +

    7.5    Arguments to "GROUP"

    + +The "GROUP" multiplication option takes tw o arguments, gs and gr. It implements multiplication in the same manner +as "SHIFT," except it uses a table of size 2gs to perform gs shifts at a time, and a table of size 2gr to perform gr +reductions at at time. The program gf_methods only prints the options 4 4 and 4 8 as arguments for "GROUP." +However, other values of gs and gr are legal and sometimes desirable:

    + +
      +
    1. + For w ≤ 32 and w = 64, any values of gs and gr may be used, so long as they are less than or equal to w and so +long as the tables fit into memory. There are four exceptions to this, listed below .

    2. +
    3. For w = 4, "GROUP" is not supported.

    4. +
    5. For w = 8, "GROUP" is not supported.

    6. +
    7. For w = 16, "GROUP" is only supported for gs = gr = 4.

    8. +
    9. For w = 128 "GROUP" only supports gs = 4 and gr ∈ {4, 8, 16}.

    10. +
    +

    +The way that gs and gr impact performance is as follows. The "SHIFT" implementation works by performing a +carry-free multiplication in w steps, and then performing reduction in w steps. In "GROUP," the carry-free multiplication +is reduced to w /gssteps, and the reduction is reduced to w /gr + +. Both require tables. The table for the carry-free +multiplication must be created at the beginning of each multiply() or multiply_region(), while the table for reduction +is created when the gf_t is initialized. For that reason, it makes sense for gr to be bigger than gs.

    + +

    +To give a flavor for the impact of these arguments, Figure 3 show s the performance of varying gs and gr for +single multiplication and region multiplication respectively, in GF(232) and GF(264). As the graphs demonstrate, +multiply() performs better w ith smaller values of gs, w hile multiply region() amortizes the creation of the shifting +table, and can tolerate larger values of gs. w hen gs equals gr, there are some optimizations that we hand-encode. +These can be seen clearly in the multiply_region() graphs. +

    + + + + + + + + +
    +7     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 30 + + +
    + +
    + +
    +
    +Figure 3: The performance of multiply() and multiply_region() using "GROUP," and varying the arguments
    gs +and gr. All graphs are heat maps with black equaling zero. The region size is 100KB. + +

    7.6  Considerations with "COMPOSITE"

    + + +As mentioned above, using "ALTMAP" with "COMPOSITE" allows multiply_region() to recursively call multiply_ +region(), rather than simply calling multiply() on every word in the region. The difference can be pronounced:

    + +
    + + + + + + + + + +
    +gf time 32 G 0 10240 10240 -m COMPOSITE 2 - - +Speed = 322 MB/s
    gf time 32 G 0 10240 10240 -m COMPOSITE 2 - -r ALTMAP - +Speed = 3,368 MB/s
    +gf time 32 G 0 10240 10240 -m COMPOSITE 2 -m SPLIT 16 4 -r ALTMAP - -r ALTMAP - +Speed = 3,925 MB/s
    +
    + + +

    +

    +There is support for performing multiply() inline for the "TABLE" implementations for w ∈ {4, 8} and for the +"LOG" implementation for w = 16 (see section 7.1). These are leveraged by multiply() in "COMPOSITE," and +by multiply_region() if you are not using "ALTMAP." To demonstrate this, in the table below, you can see that the +performance of multiply() with "SPLIT 8 4" is 88 percent as fast than the default in w = 8 (which is "TABLE"). +When you use each as a base field for "COMPOSITE" with w = 16, the one with "SPLIT 8 4" is now just 37 percent +as fast. The difference is the inlining of multiplication in the base field when "TABLE" is employed:



    + +
    + + + + + + + + +
    gf time 8 M 0 1048576 100 - Speed = 501 Mega-ops/s
    gf time 8 M 0 1048576 100 -m SPLIT 8 4 - Speed = 439 Mega-ops/s
    gf time 8 M 0 1048576 100 -m COMPOSITE 2 - - Speed = 207 Mega-ops/s
    gf time 8 M 0 1048576 100 -m COMPOSITE 2 -m SPLIT 8 4 - - Speed = 77 Mega-ops/s
    +
    +

    +
    + +You can keep making recursive definitions of composites field if you want. For example, this one's not too slow for +region operations (641 MB/s): + + + + + + + + +
    +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 31


    + +
    +
    +gf time 128 G 0 1048576 100 -m COMPOSITE 2 -m COMPOSITE 2 -m COMPOSITE 2
    +-m SPLIT 16 4 -r ALTMAP - -r ALTMAP - -r ALTMAP - -r ALTMAP - +
    +

    + +

    Please see section 7.8.1 for a discussion of polynomials in composite fields.

    + +

    7.7       "CARRY FREE" and the Primitive Polynomial

    + + +If your machine supports the PCLMUL instruction, then we leverage that in "CARRY FREE." This implementation +first performs a carry free multiplication of two w-bit numbers, which yields a 2w-bit number. It does this with +one PCLMUL instruction. To reduce the 2w-bit number back to a w-bit number requires some manipulation of the +polynomial. As it turns out, if the polynomial has a lot of contiguous zeroes following its leftmost one, the number of +reduction steps may be minimized. For example, with w = 32, we employ the polynomial 0x100400007, because that +is what other libraries employ. This only has 9 contiguous zeros following the one, which means that the reduction +takes four steps. If we instead use 0x1000000c5, which has 24 contiguous zeros, the reduction takes just two steps. +You can see the difference in performance: +

    +
    +
    + + + + + + + + +
    gf time 32 M 0 1048576 100 -m CARRY FREE - Speed = 48 Mega-ops/s
    gf time 32 M 0 1048576 100 -m CARRY FREE -p 0xc5 - Speed = 81 Mega-ops/s
    + +

    + +

    +This is relevant for w = 16 and w = 32, where the "standard" polynomials are sub-optimal with respect to +"CARRY FREE." For w = 16, the polynomial 0x1002d has the desired property. It’s less important, of course, +with w = 16, because "LOG" is so much faster than CARRY FREE.

    + +

    7.8   More on Primitive Polynomials

    + +

    7.8.1   Primitive Polynomials that are not Primitive

    + +The library is willing to work with most polynomials, even if they are not primitive or irreducible. For example, the +polynomial x4 + x3 +x2 +x+1 is irreducible, and therefore generates a valid Galois Field for GF(24). However, it +is not primitive, because 25 = 1. For that reason, if you use this polynomial, you cannot use the "LOG" method. The +other methods will work fine:

    + +
    + +UNIX> gf_mult 2 2 4 -p 0xf -
    +4
    +UNIX> gf_mult 4 2 4 -p 0xf -
    +8
    +UNIX> gf_mult 8 2 4 -p 0xf -
    +15
    +UNIX> gf_mult 15 2 4 -p 0xf -
    +1
    +UNIX> gf_div 1 15 4 -p 0xf -
    +2
    +UNIX> gf_div 1 15 4 -p 0xf -m LOG -
    +usage: gf_div a b w [method] - does division of a and b in GF(2ˆw)
    +Bad Method Specification: Cannot use Log tables because the polynomial is not primitive.
    +UNIX>
    +
    +

    +If a polynomial is reducible, then it does not define a Galois Field, but instead a ring. GF-Complete attempts to +work here where it can; however certain parts of the library will not work: +

    + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 32


    +
      +
    1. +Division is a best effort service. The problemis that often quotients are not unique. If divide() returns a non-zero +number, then that number will be a valid quotient, but it may be one of many. If the multiplication technique is +"TABLE," then if a quotient exists, one is returned. Otherwise, zero is returned. Here are some examples - the +polynomial x4 + 1 is reducible, and therefore produces a ring. Below, we see that with this polynomal, 1*6 = 6 +and 14*6 = 6. Therefore, 6/6 has two valid quotients: 1 and 14. GF-Complete returns 14 as the quotient:

    2. + +
      +UNIX> gf_mult 1 6 4 -p 0x1 -
      +6
      +UNIX> gf_mult 14 6 4 -p 0x1 -
      +6
      +UNIX> gf_div 6 6 4 -p 0x1 -
      +14
      +UNIX>

      +
      + + +
    3. When "EUCLID" is employed for division, it uses the extended Euclidean algorithm for GCD to find a number's +inverse, and then it multiplies by the inverse. The problem is that not all numbers in a ring have inverses. For +example, in the above ring, there is no number a such that 6a = 1. Thus, 6 has no inverse. This means that even +though 6/6 has quotients in this ring, "EUCLID" will fail on it because it is unable to find the inverse of 6. It will +return 0: +

    4. +
      +UNIX> gf_div 6 6 4 -p 0x1 -m TABLE -d EUCLID -
      +0
      +UNIX>
      +

      + +
    5. Inverses only work if a number has an inverse. Inverses may not be unique.

    6. + +
    7. "LOG" will not work. In cases where the default would be "LOG," "SHIFT" is used instead.
    8. +
    + +

    +Due to problems with division, gf_unit may fail on a reducible polynomial. If you are determined to use such a +polynomial, don't let this error discourage you. +

    + +

    7.8.2 Default Polynomials for Composite Fields

    + +GF-Complete will successfully select a default polynomial in the following composite fields: + + + + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 33


    + + +

    7.8.3 The Program gf poly for Verifying Irreducibility of Polynomials

    + +The program gf_poly uses the Ben-Or algorithm[GP97] to determine whether a polynomial with coefficients in GF(2w ) +is reducible. Its syntax is:

    +
    +gf_poly w method power:coef power:coef ... +
    + +
    +

    You can use it to test for irreducible polynomials with binary coefficients by specifying w = 1. For example, from +the discussion above, we know that x4 +x+1 and x4 +x3 +x2 +x+1 are both irreducible, but x4 +1 is reducible. +gf_poly confirms:


    + +

    +UNIX> gf_poly 1 - 4:1 1:1 0:1
    +Poly: xˆ4 + x + 1
    +Irreducible.
    +UNIX> gf_poly 1 - 4:1 3:1 2:1 1:1 0:1 +Poly: xˆ4 + xˆ3 + xˆ2 + x + 1
    +Irreducible.
    +UNIX> gf_poly 1 - 4:1 0:1 r
    +Poly: xˆ4 + 1
    +Reducible.
    +UNIX>
    + +
    + + +

    +For composite fields GF((2l)2), we are looking for a value s such that x2 + sx + 1 is irreducible. That value +depends on the base field. For example, for the default field GF(232), a value of s = 2 makes the polynomial +irreducible. However, if the polynomial 0xc5 is used (so that PCLMUL is fast - see section 7.7), then s = 2 yields a +reducible polynomial, but s = 3 yields an irreducible one. You can use gf_poly to help verify these things, and to help +define s if you need to stray from the defaults:


    + +
    +UNIX> gf_poly 32 - 2:1 1:2 0:1
    +Poly: xˆ2 + (0x2)x + 1
    +Irreducible.
    +UNIX> gf_poly 32 -p 0xc5 - 2:1 1:2 0:1
    +Poly: xˆ2 + (0x2)x + 1
    +Reducible.
    +UNIX> gf_poly 32 -p 0xc5 - 2:1 1:3 0:1
    +Poly: xˆ2 + (0x3)x + 1
    +Irreducible.
    +UNIX>
    +
    + +

    +gf_unit does random sampling to test for problems. In particular, it chooses a random a and a random b, multiplies +them, and then tests the result by dividing it by a and b. When w is large, this sampling does not come close to +providing complete coverage to check for problems. In particular, if the polynomial is reducible, there is a good +chance that gf_unit won't discover any problems. For example, the following gf_unit call does not flag any problems, +even though the polynomial is reducible.

    +
    +
    +UNIX> gf_unit 64 A 0 -m COMPOSITE 2 -p 0xc5 - -p 2 -
    +UNIX> +
    + +

    +How can we demonstrate that this particular field has a problem? Well, when the polynomial is 0xc5, we can factor +x2 + 2x + 1 as (x + 0x7f6f95f9)(x + 0x7f6f95fb). Thus, in the composite field, when we multiply 0x17f6f95f9 by +0x17f6f95fb, we get zero. That's the problem: +

    + + + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 34


    + +
    + +UNIX> gf_mult 7f6f95f9 7f6f95fb 32h -p 0xc5 -
    +1
    +UNIX> gf_mult 17f6f95f9 17f6f95fb 64h -m COMPOSITE 2 -p 0xc5 - -p 2 -
    +0
    +UNIX>
    + +
    + +

    7.9 "ALTMAP" considerations and extract_word()

    + +There are two times when you may employ alternate memory mappings: +
      +
    1. When using "SPLIT" and wb = 4.
    2. +
    3. When using "COMPOSITE."
    4. +
    + +Additionally, by default, the "CAUCHY" region option also employs an alternate memory mapping. + +

    When you use alternate memory mappings, the exact mapping of words in GF(2w ) to memory depends on the +situation, the size of the region, and the alignment of the pointers. To help you figure things out, we have included the +procedures extract_word.wxx() as part of the gf_t struct. This procedure takes four parameters:

    + + +

    +It then returns the n-th word in memory. When the standard mapping is employed, this simply returns the n- +th contiguous word in memory. With alternate mappings, each word may be split over several memory regions, so +extract_word() grabs the relevant parts of each memory region to extract the word. Below, we go over each of the +above situations in detail. Please refer to Figure 2 in Section 5 for reference.

    + + +

    7.9.1 Alternate mappings with "SPLIT"

    + +The alternate mapping with "SPLIT" is employed so that we can best leverage mm_shuffle_epi8(). Please read [PGM13b] +for details as to why. Consider an example when w = 16. In the main region of memory (the middle region in Figure +2), multiplication proceeds in units of 32 bytes, which are each broken into two 16-byte regions. The first region +holds the high bytes of each word in GF(216), and the second region holds the low bytes. +Let's look at a very detailed example, from gf_example_5.c. This program makes the following call, where gf has + +been initialized for w = 16, using "SPLIT" and "ALTMAP:"

    +
    +gf.multiply_region.w32(&gf, a, b, 0x1234, 30*2, 0); +

    + + +

    In other words, it is multiplying a region a of 60 bytes (30 words) by the constant 0x1234 in GF(216), and placing +the result into b. The pointers a and b have been set up so that they are not multiples of 16. The first line of output +prints a and b:


    + +a: 0x10010008c b: 0x10010015c

    + +As described in Section 5, the regions of memory are split into three parts: + + + + + + + + +
    + + +6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 35


    + + +
      +
    1. 4 bytes starting at 0x1001008c / 0x10010015c.
    2. +
    3. 32 bytes starting at 0x10010090 / 0x100100160.
    4. +
    5. 24 bytes starting at 0x100100b0 / 0x100100180.
    6. + +
    + + +

    In the first and third parts, the bytes are laid out according to the standard mapping. However, the second part is +split into two 16-byte regions- one that holds the high bytes of each word and one that holds the low bytes. To help +illustrate, the remainder of the output prints the 30 words of a and b as they appear in memory, and then the 30 return +values of extract_word.w32():


    + +
    + + + + + + + + + + + +
    1 2 3 4 5 6 7 8 9
    a: 640b 07e5 2fba ce5d f1f9 3ab8 c518 1d97 45a7 0160
    b: 1ba3 644e 84f8 be3c 4318 4905 b2fb 46eb ef01 a503
    +

    + + + + + + + + + + + +
    10 11 12 13 14 15 16 17 18 19
    a: 3759 b107 9660 3fde b3ea 8a53 75ff 46dc c504 72c2
    b: da27 e166 a0d2 b3a2 1699 3a3e 47fb 39af 1314 8e76
    + + +

    + + + + + + + + + +
    20 21 22 23 24 25 26 27 28 29
    a: b469 1b97 e91d 1dbc 131e 47e0 c11a 7f07 76e0 fe86
    b: 937c a5db 01b7 7f5f 8974 05e1 cff3 a09c de3c 4ac0
    +

    + + + + + + + + + + + + + + + + + + + +
    Word 0: 0x640b * 0x1234 = 0x1ba3 Word 15: 0x4575 * 0x1234 = 0xef47
    Word 1: 0x07e5 * 0x1234 = 0x644e Word 16: 0x60dc * 0x1234 = 0x03af
    Word 2: 0xba59 * 0x1234 = 0xf827 Word 17: 0x0146 * 0x1234 = 0xa539
    Word 3: 0x2f37 * 0x1234 = 0x84da Word 18: 0xc504 * 0x1234 = 0x1314
    Word 4: 0x5d07 * 0x1234 = 0x3c66 Word 19: 0x72c2 * 0x1234 = 0x8e76
    Word 5: 0xceb1 * 0x1234 = 0xbee1 Word 20: 0xb469 * 0x1234 = 0x937c
    Word 6: 0xf960 * 0x1234 = 0x18d2 Word 21: 0x1b97 * 0x1234 = 0xa5db
    Word 7: 0xf196 * 0x1234 = 0x43a0 Word 22: 0xe91d * 0x1234 = 0x01b7
    Word 8: 0xb8de * 0x1234 = 0x05a2 Word 23: 0x1dbc * 0x1234 = 0x7f5f
    Word 9: 0x3a3f * 0x1234 = 0x49b3 Word 24: 0x131e * 0x1234 = 0x8974
    Word 10: 0x18ea * 0x1234 = 0xfb99 Word 25: 0x47e0 * 0x1234 = 0x05e1
    Word 11: 0xc5b3 * 0x1234 = 0xb216 Word 26: 0xc11a * 0x1234 = 0xcff3
    Word 12: 0x9753 * 0x1234 = 0xeb3e Word 27: 0x7f07 * 0x1234 = 0xa09c
    Word 13: 0x1d8a * 0x1234 = 0x463a Word 28: 0x76e0 * 0x1234 = 0xde3c
    Word 14: 0xa7ff * 0x1234 = 0x01fb Word 29: 0xfe86 * 0x1234 = 0x4ac0
    +
    +
    +In the first region are words 0 and 1, which are identical to how they appear in memory: 0x640b and 0x07e5. In +the second region are words 2 through 17. These words are split among the two sixteen-byte regions. For example, +word 2, which extract_word() reports is 0xba59, is constructed from the low byte in word 2 (0xba) and the low byte +in word 10 (0x59). Since 0xba59 * 0x1234 = 0xf827, we see that the low byte in word 2 of b is 0xf8, and the low byte +in word 10 is 0x27. +

    When we reach word 22, we are in the third region of memory, and words are once again identical to how they +appear in memory.

    + +

    While this is confusing, we stress that that so long as you call multiply_region() with pointers of the same alignment +and regions of the same size, your results with ALTMAP will be consistent. If you call it with pointers of

    + + + + + + +
    + + +7     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 36


    + +different alignments, or with different region sizes, then the results will not be consistent. To reiterate, if you don't use +ALTMAP, you don't have to worry about any of this - words will always be laid out contiguously in memory. +

    +When w = 32, the middle region is a multiple of 64, and each word in the middle region is broken into bytes, each +of which is in a different 16-byte region. When w = 64, the middle region is a multiple of 128, and each word is +stored in eight 16-byte regions. And finally, whenw = 128, the middle region is a multiple of 128, and each word is +stored in 16 16-byte regions.


    + +

    7.9.2   Alternate mappings with "COMPOSITE"

    + +With "COMPOSITE," the alternate mapping divides the middle region in half. The lower half of each word is stored +in the first half of the middle region, and the higher half is stored in the second half. To illustrate, gf example 6 +performs the same example as gf example 5, except it is using "COMPOSITE" in GF((216)2), and it is multiplying +a region of 120 bytes rather than 60. As before, the pointers are not aligned on 16-bit quantities, so the region is broken +into three regions of 4 bytes, 96 bytes, and 20 bytes. In the first and third region, each consecutive four byte word is a +word in GF(232). For example, word 0 is 0x562c640b, and word 25 is 0x46bc47e0. In the middle region, the low two +bytes of each word come from the first half, and the high two bytes come from the second half. For example, word 1 +as reported by extract_word() is composed of the lower two bytes of word 1 of memory (0x07e5), and the lower two +bytes of word 13 (0x3fde). The product of 0x3fde07e5 and 0x12345678 is 0x211c880d, which is stored in the lower +two bytes of words 1 and 13 of b.

    + +a: 0x10010011c b: 0x1001001ec + +

    + +
    + + + + + + + + + + + + + + +
    1 2 3 4 5 6 7 8 9
    a: 562c640b 959407e5 56592fba cbadce5d 1d1cf1f9 35d73ab8 6493c518 b37c1d97 8e4545a7 c0d80160
    b: f589f36c f146880d 74f7b349 7ea7c5c6 34827c1a 93cc3746 bfd9288b 763941d1 bcd33a5d da695e64
    + + +

    + + + + + + + + + + + + + +
    10 11 12 13 14 15 16 17 18 19
    a: 965b3759 cb3eb107 1b129660 95a33fde 95a7b3ea d16c8a53 153375ff f74646dc 35aac504 98f972c2
    b: fd70f125 3274fa8f d9dd34ee c01a211c d4402403 8b55c08b da45f0ad 90992e18 b65e0902 d91069b5
    + + + +

    + + + + + + + + + + + +
    20 21 22 23 24 25 26 27 28 29
    a: 5509b469 7f8a1b97 3472e91d 9ee71dbc de4e131e 46bc47e0 5bc9c11a 931d7f07 c85cfe86 fe86
    b: fc92b8f5 edd59668 b4bc0d90 a679e4ce 1a98f7d0 6038765f b2ff333f e7937e49 fa5a5867 79c00ea2
    +

    + + + + + + + + + + + + + + + + + + + +
    Word 0: 0x562c640b * 0x12345678 = 0xf589f36c Word 15: 0xb46945a7 * 0x12345678 = 0xb8f53a5d
    Word 1: 0x3fde07e5 * 0x12345678 = 0x211c880d Word 16: 0x55098e45 * 0x12345678 = 0xfc92bcd3
    Word 2: 0x95a39594 * 0x12345678 = 0xc01af146 Word 17: 0x1b970160 * 0x12345678 = 0x96685e64
    Word 3: 0xb3ea2fba * 0x12345678 = 0x2403b349 Word 18: 0x7f8ac0d8 * 0x12345678 = 0xedd5da69
    Word 4: 0x95a75659 * 0x12345678 = 0xd44074f7 Word 19: 0xe91d3759 * 0x12345678 = 0x0d90f125
    Word 5: 0x8a53ce5d * 0x12345678 = 0xc08bc5c6 Word 20: 0x3472965b * 0x12345678 = 0xb4bcfd70
    Word 6: 0xd16ccbad * 0x12345678 = 0x8b557ea7 Word 21: 0x1dbcb107 * 0x12345678 = 0xe4cefa8f
    Word 7: 0x75fff1f9 * 0x12345678 = 0xf0ad7c1a Word 22: 0x9ee7cb3e * 0x12345678 = 0xa6793274
    Word 8: 0x15331d1c * 0x12345678 = 0xda453482 Word 23: 0x131e9660 * 0x12345678 = 0xf7d034ee
    Word 9: 0x46dc3ab8 * 0x12345678 = 0x2e183746 Word 24: 0xde4e1b12 * 0x12345678 = 0x1a98d9dd
    Word 10: 0xf74635d7 * 0x12345678 = 0x909993cc Word 25: 0x46bc47e0 * 0x12345678 = 0x6038765f
    Word 11: 0xc504c518 * 0x12345678 = 0x0902288b Word 26: 0x5bc9c11a * 0x12345678 = 0xb2ff333f
    Word 12: 0x35aa6493 * 0x12345678 = 0xb65ebfd9 Word 27: 0x931d7f07 * 0x12345678 = 0xe7937e49
    +
    + + + + + + + + +
    + + +8     THREAD SAFETY 37


    +
    + + + + + + + + + +
    Word 13: 0x72c21d97 * 0x12345678 = 0x69b541d1 Word 28: 0xd40676e0 * 0x12345678 = 0xfa5a5867
    Word 14: 0x98f9b37c * 0x12345678 = 0xd9107639 Word 29: 0xc85cfe86 * 0x12345678 = 0x79c00ea2
    +

    + + +

    +As with "SPLIT," using multiply_region() with "COMPOSITE" and "ALTMAP" will be consistent only if the +alignment of pointers and region sizes are identical.

    + + +

    7.9.3 The mapping of "CAUCHY"

    + +With "CAUCHY," the region is partitioned into w subregions, and each word in the region is broken into w bits, +each of which is stored in a different subregion. To illustrate, gf_example_7 multiplies a region of three bytes by 5 +in GF(23) using "CAUCHY:"

    + +
    + +UNIX> gf_example_7
    +a: 0x100100190 b: 0x1001001a0

    +a: 0x0b 0xe5 0xba
    +b: 0xee 0xba 0x0b

    +a bits: 00001011 11100101 10111010
    +b bits: 11101110 10111010 00001011

    +Word 0: 3 * 5 = 4
    +Word 1: 5 * 5 = 7
    +Word 2: 2 * 5 = 1
    +Word 3: 5 * 5 = 7
    +Word 4: 4 * 5 = 2
    +Word 5: 6 * 5 = 3
    +Word 6: 2 * 5 = 1
    +Word 7: 6 * 5 = 3
    +UNIX>

    +

    + +The program prints the three bytes of a and b in hexadecimal and in binary. To see how words are broken up, +consider word 0, which is the lowest bit of each of the three bytes of a (and b). These are the bits 1, 1 and 0 in a, and +0, 0, and 1 in b. Accordingly, the word is 3 in a, and 3*5 = 4 in b. Similarly, word 7 is the high bit in each byte: 0, 1, 1 +(6) in a, and 1, 1, 0 (3) in b.

    +

    With "CAUCHY," multiply_region()may be implemented exclusively with XOR operations. Please see [BKK+95] +for more information on the motivation behind "CAUCHY."

    + +

    8   Thread Safety

    + +Once you initialize a gf_t, you may use it wontonly in multiple threads for all operations except for the ones below. +With the implementations listed below, the scratch space in the gf_t is used for temporary tables, and therefore you +cannot call region_multiply, and in some cases multiply from multiple threads because they will overwrite each +others' tables. In these cases, if you want to call the procedures from multiple threads, you should allocate a separate +gf_t for each thread: + + + + + + + + + + +
    + + +9     LISTING OF PROCEDURES 38


    + + +

    9  Listing of Procedures

    + +The following is an alphabetical listing of the procedures, data types and global variables for users to employ in +GF-complete.
    + +