GF-Complete: A Comprehensive Open Source Library for Galois
Field Arithmetic

Version 1.02

James S. Plank*        Ethan L. Miller Kevin M. Greenan        Benjamin A. Arnold
John A. Burnum        Adam W. Disney        Allen C. McBride


https://bitbucket.org/jimplank/gf-complete

http://web.eecs.utk.edu/~plank/plank/papers/GF-Complete-Manual-1.02.pdf

This is a user's manual for GF-Complete, version 1.02. This release supersedes version 0.1 and represents the first major release of GF-Complete. To our knowledge, this library implements every Galois Field multiplication technique applicable to erasure coding for storage, which is why we named it GF-Complete. The primary goal of this library is to allow storage system researchers and implementors to utilize very fast Galois Field arithmetic for Reed-Solomon coding and the like in their storage installations. The secondary goal is to allow those who want to explore different ways to perform Galois Field arithmetic to be able to do so effectively.

If you wish to cite GF-Complete, please cite technical report UT-CS-13-716: [PMG+13].

If You Use This Library or Document

Please send me an email to let me know how it goes. Or send me an email just to let me know you are using the library. One of the ways in which we are evaluated both internally and externally is by the impact of our work, and if you have found this library and/or this document useful, we would like to be able to document it. Please send mail to plank@cs.utk.edu. Please send bug reports to that address as well.

The library itself is protected by the New BSD License. It is free to use and modify within the bounds of this license. To the authors' knowledge, none of the techniques implemented in this library have been patented, and the authors are not pursing patents.


Finding the Code

This code is actively maintained on bitbucket: https://bitbucket.org/jimplank/gf-complete. There are previous versions on my UTK site as a technical report; however, that it too hard to maintain, so the main version is on bitbucket.

Two Related Papers

This software acccompanies a large paper that describes these implementation techniques in detail [PGM13a]. We will refer to this as "The Paper." You do not have to read The Paper to use the software. However, if you want to start exploring the various implementations, then The Paper is where you'll want to go to learn about the techniques in detail.

This library implements the techniques described in the paper "Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions," [PGM13b]. The Paper describes all of those techniques as well.



If You Would Like HelpWith the Software

Please contact the first author of this manual.

Changes from Revision 1.01

The major change is that we are using autoconf to aid with compilation, thus obviating the need for the old flag_tester code. Additionally, we have added a quick timing tool, and we have modified gf_methods so that it may be used to run the timing tool and the unit tester.
CONTENT 3

Contents

1 Introduction 5

2 Files in the Library 6
2.1 Header files in the directory "include" . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Source files in the "src" directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Library tools files in the "tools" directory . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 The unit tester in the "test" directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Example programs in the "examples" directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Compilation 8

4 Some Tools and Examples to Get You Started 8

4.1 Three Simple Command Line Tools: gf_mult, gf_div and gf_add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Quick Starting Example #1: Simple multiplication and division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Quick Starting Example #2: Multiplying a region by a constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Quick Starting Example #3: Using w = 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Quick Starting Example #4: Using w = 128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Important Information on Alignment when Multiplying Regions 12

6 The Defaults 13
6.1 Changing the Defaults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.2 Determining Supported Techniques with gf_methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 Testing with gf_unit, gf_time, and time_tool.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.4 Calling gf_init_hard() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . 24
6.5 gf_size() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . .. . 26

8 Further Information on Options and Algorithms 26


7.1 Inlining Single Multiplication and Division for Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Using different techniques for single and region multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 General w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.4 Arguments to "SPLIT" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.5 Arguments to "GROUP" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.6 Considerations with "COMPOSITE" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.7 "CARRY_FREE" and the Primitive Polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.8 More on Primitive Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 31

CONTENT 4
7.9"ALTMAP" considerations and extract_word() 34
8 Thread Safety 37

9 Listing of Procedures 37

10 Troubleshooting 38

11 Timings 41

11.1 Multiply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . .. . . . 42
11.2 Divide() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . .. . . . . 42
11.3 Multiply Region() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . . . 43

INTRODUCTION 5

1 Introduction

Galois Field arithmetic forms the backbone of erasure-coded storage systems, most famously the Reed-Solomon erasure code. A Galois Field is defined over w-bit words and is termed GF(2w). As such, the elements of a Galois Field are the integers 0, 1, . . ., 2w - 1. Galois Field arithmetic defines addition and multiplication over these closed sets of integers in such a way that they work as you would hope they would work. Specifically, every number has a unique multiplicative inverse. Moreover, there is a value, typically the value 2, which has the property that you can enumerate all of the non-zero elements of the field by taking that value to successively higher powers.

Addition in a Galois Field is equal to the bitwise exclusive-or operation. That's nice and convenient. Multiplication is a little more complex, and there are many, many ways to implement it. The Paper describes them all, and the following references providemore supporting material: [Anv09, GMS08, LHy08, LD00, LBOX12, Pla97]. The intent of this library is to implement all of the techniques. That way, their performancemay be compared, and their tradeoffs may be analyzed.

    When used for erasure codes, there are typically five important operations:
  1. Adding two numbers in GF(2w). That's bitwise exclusive-or.
  2. Multiplying two numbers in GF(2w). Erasure codes are usually based on matrices in GF(2w), and constructing these matrices requires both addition and multiplication.
  3. Dividing two numbers in GF(2w). Sometimes you need to divide to construct matrices (for example, Cauchy Reed-Solomon codes [BKK+95, Rab89]). More often, though, you use division to invert matrices for decoding. Sometimes it is easier to find a number's inverse than it is to divide. In that case, you can divide by multiplying by an inverse.
  4. adding two regions of numbers in GF(2w), which will be explained along with...
  5. Mutiplying a region of numbers in GF(2w) by a constant in GF(2w). Erasure coding typically boils down to performing dot products in GF(2w). For example, you may define a coding disk using the equation:

  6. c0= d0 + 2d1 + 4d2 + 8d3.

    That looks like three multiplications and three additions However, the way ' implemented in a disk system looks as in Figure 1. Large regions of disks are partitioned into w-bit words in GF(2w). In the example, let us suppose that w = 8, and therefore that words are bytes. Then the regions pictured are 1 KB from each disk. The bytes on disk Di are labeled di,0, di,1, . . . , di,1023, and the equation above is replicated 1024 times. For 0 ≤ j < 1024:

    c0,j = d0,j + 2d1,j + 4d2,j + 8d3,j .

    While it's possible to implement each of these 1024 equations independently, using the single multiplication and addition operations above, it is often much more efficient to aggregate. For example, most computer architectures support bitwise exclusive-or of 64 and 128 bit words. Thus, it makes much more sense to add regions of numbers in 64 or 128 bit chunks rather than as words in GF(2w). Multiplying a region by a constant can leverage similar optimizations.

GF-Complete supports multiplication and division of single values for all values of w ≤ 32, plus w = 64 and w = 128. It also supports adding two regions of memory (for any value of w, since addition equals XOR), and multiplying a region by a constant in GF(24), GF(28), GF(216), GF(232), GF(264) and GF(2128). These values are chosen because words in GF(2w) fit into machine words with these values of w. Other values of w don't lend themselves to efficient multiplication of regions by constants (although see the "CAUCHY" option in section 6.1.5 for a way to multiply regions for other values of w).


2     FILES IN THE LIBRARY 6





Figure 1: An example of adding two regions of numbers, and multiplying a region of numbers by a constant in GF(2w) . In this example, w = 8, and each disk is holding a 1KB region. The same coding equation - c0,j = d0,j + ad1,j + a2d2,j + a3d3,j is applied 1024 times. However, rather than executing this equation 1024 times, it is more efficient to implement this with three region-constant multiplications and three region-region additions.

2     Files in the Library

This section provides an overview of the files that compose GF-Complete. They are partitioned among multiple directories.

2.1     Header files in the directory "include"

The following header files are part of GF-Complete.
2     FILES IN THE LIBRARY 7


2.2     Source files in the "src" directory"

2.3     Library tools files in the "tools" directory


3     COMPILATION 8


2.4     The unit tester in the "test" directory

The test directory contains the proram gf_unit.c, which performs a battery of unit tests on GF-Complete. This is explained in more detail in section 6.3.

2.5    Example programs in the "examples" directory

There are seven example programs to help you understand various facets of GF-Complete. They are in the files gf_example x.c in the examples directory. They are explained in sections 4.2 through 4.5, and section 7.9.

3     Compilation

From revision 1.02 forward, we are using autoconf. The old "flag tester" directory is now gone, as it is no longer in use.

To compile and install, you should do the standard operations that you do with most open source Unix code:

UNIX> ./configure
...
UNIX> make
...
UNIX> sudo make install

If you perform the install, then the header, source, tool, and library files will be moved to system locations. In particular, you may then compile the library by linking with the flag -lgf_complete, and you may use the tools from a global executable directory (like /usr/local/bin).

If you don't perform the install, then the header and tool files will be in their respective directories, and the library will be in src/libgf_complete.la.

If your system supports the various Intel SIMD instructions, the compiler will find them, and GF-Complete will use them by default.

4     Some Tools and Examples to Get You Started

4.1 Three Simple Command Line Tools: gf_mult, gf_div and gf_add

Before delving into the library, it may be helpful to explore Galois Field arithmetic with the command line tools: gf_mult, gf_div and gf_add. These perform multiplication, division and addition on elements in GF(2w). If these are not installed on your system, then you may find them in the tools directory. Their syntax is:
6     THE DEFAULTS 14


Table 1 shows the default methods used for each power-of-two word size, their alignment parameters s and t, their memory consumption and their rough performance. The performance tests are on an Intel Core i7-3770 running at 3.40 GHz, and are included solely to give a flavor of performance on a standard microprocessor. Some processors will be faster with some techniques and others will be slower, so we only put numbers in so that you can ballpark it. For other values of w between 1 and 31, we use table lookup when w ≤ 8, discrete logarithms when w ≤ 16 and "Bytwop" for w ≤ 32.



With SSE
w Memory
Usage
multiply()
Implementation
Performance
(Mega Ops / s)
multiply region()
Implementation
s t Performance
(MB/s)
4 <1K Table501Table 16 16 11,659
8 136K Table501Split Table (8,4) 16 16 11,824
16 896K Log260Split Table (16,4) 32 16 7,749
32 <1K Carry-Free48Split Table (32,4) 64 16 5,011
64 2K Carry-Free84Split Table (64,4) 128 16 2,402
128 64K Carry-Free48Split Table (128,4) 16 16 833
Without SE
w Memory
Usage
multiply()
Implementation
Performance
(Mega Ops / s)
multiply region()
Implementation
s t Performance
(MB/s)
4 4K Table501Double Table 16 16 11,659
8 128K Table501Table 1 1 1,397
16 896K Log266Split Table (16,8) 32 16 2,135
32 4K Bytwop19Split Table (32,4) 4 4 1,149
64 16K Bytwop9Split Table (64,4) 8 8 987
128 64K Bytwop1.4Split Table (128,4) 16 8 833


Table 1: The default implementations, memory consumption and rough performance when w is a power of two. The variables s and t are alignment variables described in Section 5.

A few comments on Table 1 are in order. First, with SSE, the performance of multiply() is faster when w = 64 than when w = 32. That is because the primitive polynomial for w = 32, that has historically been used in Galois Field implementations, is sub-ideal for using carry-free multiplication (PCLMUL). You can change this polynomial (see section 7.7) so that the performance matches w = 64.

The region operations for w = 4 and w = 8 without SSE have been selected to have a low memory footprint. There are better options that consume more memory, or that only work on large memory regions (see section 6.1.5).

There are times that you may want to stray from the defaults. For example:
6     THE DEFAULTS 15


Our command line tools allow you to deviate from the defaults, and we have two C functions -gf_init_hard() and create_gf_from_argv() that can be called from application code to override the default methods. There are six command-line tools that can be used to explore the many techniques implemented in GF-Complete:

To change the default behavior in application code, you need to call gf_init_hard() rather than gf_init_easy(). Alternatively, you can use create_g_from_argv(), included from gf_method.h, which uses an argv-style array of strings to specify the options that you want. The procedure in gf_method.c parses the array and makes the proper gf_init_hard() procedure call. This is the technique used to parse the command line in gf_mult, gf_div, gf_unit et al.

6.1.1 Changing the Components of a Galois Field with create gf_from_argv()

There are five main components to every Galois Field instance:

The procedures gf_init_hard() and create_gf_from_argv() allow you to specify these parameters when you create your Galois Field instance. We focus first on create_gf_from_argv(), because that is how the tools allow you to specify the components. The prototype of create_gf_from_argv() is as follows:


int create_gf_from_argv(gf_t *gf, int w, int argc, char **argv, int starting);

You pass it a pointer to a gf_t, which it will initialize. You specify the word size with the parameter w, and then you pass it an argc/argv pair as in any C or C++ program. You also specify a starting argument, which is where in argv the specifications begin. If it successfully parses argc and argv, then it creates the gf_t using gf_init_hard() (described below in section 6.4). It returns one past the last index of argv that it considered when creating the gf_t. If it fails, then it returns zero, and the gf_t is unmodified.

For example, gf_mult.c calls create gf_from_argv() by simply passing argc and argv from its main() declaration, and setting starting to 4.


6     THE DEFAULTS 16


To choose defaults, argv[starting] should equal "-". Otherwise, you specify the component that you are changing with "-m" for multiplication technique, "-d" for division technique, "-r" for region technique, and "-p" for the polynomial. You may change multiple components. You end your specification with a single dash. For example, the following call multiplies 6 and 5 in GF(24) with polynomial 0x19 using the "SHIFT" technique for multiplication (we'll explain these parameters later):



UNIX> ./gf_mult 6 5 4 -p 0x19 -m SHIFT -
7
UNIX>

If create_gf_from_argv() fails, then you can call the procedure gf_error(), which prints out the reason why create_ gf_from_argv() failed.

6.1.2 Changing the Polynomial

Galois Fields are typically implemented by representing numbers as polynomials with binary coefficients, and then using the properties of polynomials to define addition and multiplication. You do not need to understand any of that to use this library. However, if you want to learn more about polynomial representations and how they construct fields, please refer to The Paper.

Multiplication is based on a special polynomial that we will refer to here as the "defining polynomial." This polynomial has binary coefficients and is of degree w. You may change the polynomial with "-p" and then a number in hexadecimal (the leading "0x" is optional). It is assumed that the w-th bit of the polynomial is set - you may include it or omit it. For example, if you wish to set the polynomial for GF(216) to x16 + x5 + x3 + x2 + 1, rather than its default of x16 + x12 + x3 + x + 1, you may say "-p 0x1002d," "-p 1002d," "-p 0x2d" or "-p 2d." We discuss changing the polynomial for three reasons in other sections:

Some words about nomenclature with respect to the polynomial. A Galois Field requires the polynomial to be irreducible .. That means that it cannot be factored. For example, when the coefficients are binary, the polynomial x5+ x4+x+1 may be factored as (x4+1)(x+1). Therefore it is not irreducible and cannot be used to define a Galois Field. It may, however, be used to define a ring. Please see section 7.8.1 for a discussion of ring support in GF-Complete.

There is a subset of irreducible polynomials called primitive. These have an important property that one may enumerate all of the elements of the field by raising 2 to successive posers. All of the default polynomials in GF-Complete are primitive. However, so long as a polynomial is irreducible, it defines a Galois Field. Please see section 7.7 for a further discussion of the polynomial.

One thing that we want to stress here is that changing the polynomial changes the field, so fields with different polynomialsmay not be used interchangeably. So long as the polynomial is irreducible, it generates a Galois Field that is isomorphic to all other Galois Fields; however the multiplication and division of elements will differ. For example, the polynomials 0x13 (the default) and 0x19 in GF(24) are both irreducible, so both generate valid Galois Fields. However, their multiplication differs:


UNIX> gf_mult 8 2 4 -p 0x13 -
3
UNIX> gf_mult 8 2 4 -p 0x19 -
9

6     THE DEFAULTS 17


UNIX> gf_div 3 8 4 -p 0x13 -
2
UNIX> gf_div 9 8 4 -p 0x19 -
2
UNIX>

6.1.3     Changing the Multiplication Technique

The following list describes the multiplication techinques that may be changed with "-m". We keep the description here brief. Please refer to The Paper for detailed descriptions of these techniques.

  • "TABLE:" Multiplication and division are implemented with tables. The tables consume quite a bit of memory (2w × 2 w × w/ 8 bytes), so they are most useful when w is small. Please see "SSE," "LAZY," "DOUBLE" and "QUAD" under region techniques below for further modifications to "TABLE" to perform multiply_region()

  • "LOG:" This employs discrete (or "Zeph") logarithm tables to implement multiplication and division. The memory usage is roughly (3 × 2w × w / 8 bytes), so they are most useful when w is small, but they tolerate larger w than "TABLE." If the polynomial is not primitive (see section 6.1.2), then you cannot use "LOG" as an implementation. In that case, gf_init_hard() or create_gf_from_argv() will fail

  • "LOG_ZERO:" Discrete logarithm tables which include extra room for zero entries. This more than doubles the memory consumption to remove an if statement (please see [GMS08] or The Paper for more description). It doesn’t really make a huge deal of difference in performance

  • "LOG_ZERO_EXT:" This expends even more memory to remove another if statement. Again, please see The Paper for an explanation. As with "LOG_ZERO," the performance difference is negligible

  • "SHIFT:" Implementation straight from the definition of Galois Field multiplication, by shifting and XOR-ing, then reducing the product using the polynomial. This is slooooooooow, so we don’t recommend you use it

  • "CARRY_FREE:" This is identical to "SHIFT," however it leverages the SSE instruction PCLMUL to perform carry-freemultiplications in single instructions. As such, it is the fastest way to perform multiplication for large values of w when that instruction is available. Its performance depends on the polynomial used. See The Paper for details, and see section 7.7 below for the speedups available when w = 16 and w = 32 if you use a different polynomial than the default one

  • "BYTWO_p:" This implements multiplication by successively multiplying the product by two and selectively XOR-ing the multiplicand. See The Paper for more detail. It can leverage Anvin’s optimization that multiplies 64 and 128 bits of numbers in GF(2w) by two with just a few instructions. The SSE version requires SSE2

  • "BYTWO_b:" This implements multiplication by successively multiplying the multiplicand by two and selectively XOR-ing it into the product. It can also leverage Anvin's optimization, and it has the feature that when you're multiplying a region by a very small constant (like 2), it can terminate the multiplication early. As such, if you are multiplying regions of bytes by two (as in the Linux RAID-6 Reed-Solomon code [Anv09]), this is the fastest of the techniques, regardless of the value of w. The SSE version requires SSE2

  • "SPLIT:" Split multiplication tables (like the LR tables in [GMS08], or the SIMD tables for w ≤ 8 in [LHy08, Anv09, PGM13b]). This argument must be followed by two more arguments, wa and wb, which are the index sizes of the sub-tables. This implementation reduces the size of the table from "TABLE," but requires multiple


  • 6     THE DEFAULTS 18


    With the exception of "COMPOSITE", only one multiplication technique can be provided for a given Galois Field instance. Composite fields may use composite fields as their base fields, in which case the specification will be recursive.


    6     THE DEFAULTS 19


    6.1.4       Changing the Division Technique

    There are two techniques for division that may be set with "-d". If "-d" is not specified, then appropriate defaults are employed. For example, when the multiplication technique is "TABLE," a table is created for division as well as multiplication. When "LOG" is specified, the logarithm tables are used for division. With "COMPOSITE," a special variant of Euclid's algorithm is employed that performs division using multiplication and division in the base field. Otherwise, Euclid's algorithm is used. Please see The Paper for a description of Euclid's algorithm applied to Galois Fields.

    If you use "-d", you must also specify the multiplication technique with "-m."

    To force Euclid's algorithm instead of the defaults, you may specify it with "-d EUCLID." If instead, you would rather convert elements of a Galois Field to a binary matrix and find an element's inverse by inverting the matrix, then specify "-d MATRIX." In all of our tests, "MATRIX" is slower than "EUCLID." "MATRIX" is also not defined for w > 32.

    6.1.5     Changing the Region Technique

    The following are the region multiplication options ("-r"):
    6     THE DEFAULTS 20


    It is possible to combine region multiplication options. This is fully supported as long as gf_methods has the combination listed. If multiple region options are required, they should be specified independently (as flags for gf_init_hard() and independent options for command-line tools and create_gf_from_argv()).

    6.2    Determining Supported Techniques with gf_methods

    The program gf_methods prints a list of supported methods on standard output. It is called as follows:

    ./gf_methods w -BADC -LUMDRB

    The first argument is w , which may be any legal value of w . The second argument has the following flags:

    You may specify multiple of these as the second argument. If you include both "B" and "A," then it uses the last one specified.

    The last argument determines the output format of gf_methods. If it is "L," then it simply lists methods. If it is "U," then the output contains gf_unit commands for each of the methods. For the others, the output contains gf_time_tool.sh commands for M ultiplication,Division,Region multiplications with multiple buffer sizes, and the Best region multiplication.

    gf_methods enumerates combinations of flags, and calls create_gf_from_argv() to see if the combinations are supported. Although it enumerates a large number of combinations, it doesn't enumerate all possible parameters for "SPLIT," "GROUP" or "COMPOSITE."

    Some examples of calling gf_methods are shown below in section 6.3.2.


    6     THE DEFAULTS 21


    6.3 Testing with gf_unit , gf_time , and time_tool.sh

    gf_unit and gf_time may be used to verify that a combination of arguments works correctly and efficiently on your platform. If you plan to stray from the defaults, it is probably best to run both tools to ensure there are no issues with your environment. gf_unit will run a set of unit tests based on the arguments provided to the tool, and gf_time will time Galois Field methods based on the provided arguments.
    The usage of gf_ unit is:

    gf_unit w tests seed method

    The usage of gf_ time is:

    gf_time w tests seed buffer-size iterations method

    The seed is an integer- negative one uses the current time. The tests are specified by a listing of characters. The following tests are supported (All are supported by gf_time. Only ', 'S' and 'R' are supported by gf_unit):

    Here are some examples of calling gf_unit and gf_time to verify that "-m SPLIT 32 4 -r ALTMAP -" works in GF(232), and to get a feel for its performance. First, we go to the test directory and call gf_unit:



    UNIX> cd test
    UNIX> ./gf_unit 32 A -1 -m SPLIT 32 4 -r ALTMAP -
    Args: 32 A -1 -m SPLIT 32 4 -r ALTMAP - / size (bytes): 684
    UNIX>

    gf_unit reports on the arguments and how may bytes the gf_t consumes. If it discovers any problems or inconsistencies with multiplication, division or region multiplication, it will report them. Here, there are no problems. Next, we move to the tools directory and run performance tests on a 10K buffer, with 10,000 iterations of each test:

    UNIX> cd ../tools
    UNIX> ./gf_time 32 A -1 10240 10000 -m SPLIT 32 4 -r ALTMAP -
    Seed: 1388435794
    Multiply: 4.090548 s Mops: 24.414 5.968 Mega-ops/s
    Divide: 37.794962 s Mops: 24.414 0.646 Mega-ops/s
    Inverse: 33.709875 s Mops: 24.414 0.724 Mega-ops/s
    Region-Random: XOR: 0 0.035210 s MB: 97.656 2773.527 MB/s
    Region-Random: XOR: 1 0.036081 s MB: 97.656 2706.578 MB/s
    Region-By-Zero:XOR: 0 0.003199 s MB: 97.656 30523.884 MB/s
    Region-By-Zero: XOR: 1 0.000626 s MB: 97.656 156038.095 MB/s

    6     THE DEFAULTS 22


    Region-By-One: XOR: 0 0.003810 s MB: 97.656 25628.832 MB/s
    Region-By-One: XOR: 1 0.008363 s MB: 97.656 11677.500 MB/s
    Region-By-Two: XOR: 0 0.032942 s MB: 97.656 2964.486 MB/s
    Region-By-Two: XOR: 1 0.033488 s MB: 97.656 2916.153 MB/s
    UNIX>

    The first column of output displays the name of the test performed. Region tests will test with and without the XOR flag being set (see Section 4.3 for an example). The second column displays the total time the test took to complete measured in seconds (s). The third column displays the size of the test measured in millions of operations (Mops) for single tests and in Megabytes (MB) for the region tests. The final column displays the speed of the tests calculated from the second and third columns, and is where you should look to get an idea of a method's performance.

    If the output of gf_unit and gf_time are to your satisfaction, you can incorporate the method into application code using create gf_from_argv() or gf_init hard().

    The performance of "Region-By-Zero" and "Region-By-One" will not change from test to test, as all methods make the same calls for these. "Region-By-Zero" with "XOR: 1" does nothing except set up the tests. Therefore, you may use it as a control.

    6.3.1       time_tool.sh

    Finally, the shell script time_tool.sh makes a bunch of calls to gf_time to give a rough estimate of performance. It is called as follows:

    usage sh time_tool.sh M|D|R|B w method

    The values for the first argument are MDRB, for Multiplication, Division,Region multiplications with multiple buffer sizes, and the Best region multiplication. For the example above, let's call time_tool.sh to get a rough idea of performance:



    UNIX> sh time_tool.sh M 32 -m SPLIT 32 4 -r ALTMAP -
    M speed (MB/s): 6.03 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    UNIX> sh time_tool.sh D 32 -m SPLIT 32 4 -r ALTMAP -
    D speed (MB/s): 0.65 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    UNIX> sh time_tool.sh R 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 16K (MB/s): 3082.91 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 32K (MB/s): 3529.07 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 64K (MB/s): 3749.94 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 128K (MB/s): 3861.27 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 512K (MB/s): 3820.82 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 1M (MB/s): 3737.41 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 2M (MB/s): 3002.90 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Buffer-Size: 4M (MB/s): 2760.77 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    Region Best (MB/s): 3861.27 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    UNIX> sh time_tool.sh B 32 -m SPLIT 32 4 -r ALTMAP -
    Region Best (MB/s): 3929.09 W-Method: 32 -m SPLIT 32 4 -r ALTMAP -
    UNIX>

    We say that time_tool.sh is "rough" because it tries to limit each test to 5 ms or less. Thus, the time granularity is fine, which means that the numbers may not be as precise as they could be were the time granularity to be course. When in doubt, you should make your own calls to gf_time with a lot of iterations, so that startup costs and roundoff error may be minimized.


    6     THE DEFAULTS 23


    6.3.2       An example of gf_methods and time_tool.sh



    Let's give an example of how some of these components fit together. Suppose we want to explore the basic techniques in GF(232). First, let's take a look at what gf_methods suggests as "basic" methods:

    UNIX> gf_methods 32 -B -L
    w=32: -
    w=32: -m GROUP 4 8 -
    w=32: -m SPLIT 32 4 -
    w=32: -m SPLIT 32 4 -r ALTMAP -
    w=32: -m SPLIT 32 8 -
    w=32: -m SPLIT 8 8 -
    w=32: -m COMPOSITE 2 - -
    w=32: -m COMPOSITE 2 - -r ALTMAP -
    UNIX>

    You'll note, this is on my old Macbook Pro, which doesn't support (PCLMUL), so "CARRY_FREE" is not included as an option. Now, let's run the unit tester on these to make sure they work, and to see their memory consumption:



    UNIX> gf_methods 32 -B -U
    ../test/gf_unit 32 A -1 -
    ../test/gf_unit 32 A -1 -m GROUP 4 8 -
    ../test/gf_unit 32 A -1 -m SPLIT 32 4 -
    ../test/gf_unit 32 A -1 -m SPLIT 32 4 -r ALTMAP -
    ../test/gf_unit 32 A -1 -m SPLIT 32 8 -
    ../test/gf_unit 32 A -1 -m SPLIT 8 8 -
    ../test/gf_unit 32 A -1 -m COMPOSITE 2 - -
    ../test/gf_unit 32 A -1 -m COMPOSITE 2 - -r ALTMAP -
    UNIX> gf_methods 32 -B -U | sh
    Args: 32 A -1 - / size (bytes): 684
    Args: 32 A -1 -m GROUP 4 8 - / size (bytes): 1296
    Args: 32 A -1 -m SPLIT 32 4 - / size (bytes): 684
    Args: 32 A -1 -m SPLIT 32 4 -r ALTMAP - / size (bytes): 684
    Args: 32 A -1 -m SPLIT 32 8 - / size (bytes): 4268
    Args: 32 A -1 -m SPLIT 8 8 - / size (bytes): 1839276
    Args: 32 A -1 -m COMPOSITE 2 - - / size (bytes): 524648
    Args: 32 A -1 -m COMPOSITE 2 - -r ALTMAP - / size (bytes): 524648
    UNIX>

    As anticipated, "SPLIT 8 8" consumes quite a bit of memory! Now, let's see how well they perform with both single multiplications and region multiplications:



    UNIX> gf_methods 32 -B -M
    sh time_tool.sh M 32 -
    sh time_tool.sh M 32 -m GROUP 4 8 -
    sh time_tool.sh M 32 -m SPLIT 32 4 -
    sh time_tool.sh M 32 -m SPLIT 32 4 -r ALTMAP -
    sh time_tool.sh M 32 -m SPLIT 32 8 -
    sh time_tool.sh M 32 -m SPLIT 8 8 -

    6     THE DEFAULTS 24


    sh time_tool.sh M 32 -m COMPOSITE 2 -
    sh time_tool.sh M 32 -m COMPOSITE 2 - -r ALTMAP
    UNIX> gf_methods 32 -B -M | sh M speed (MB/s): 5.90 W-Method: 32
    M speed (MB/s): 14.09 W-Method: 32 -m GROUP 4 8
    M speed (MB/s): 5.60 W-Method: 32 -m SPLIT 32 4
    M speed (MB/s): 5.19 W-Method: 32 -m SPLIT 32 4 -r ALTMAP
    M speed (MB/s): 5.98 W-Method: 32 -m SPLIT 32 8
    M speed (MB/s): 22.10 W-Method: 32 -m SPLIT 8 8
    M speed (MB/s): 34.98 W-Method: 32 -m COMPOSITE 2 -
    M speed (MB/s): 34.16 W-Method: 32 -m COMPOSITE 2 - -r ALTMAP
    UNIX> gf_methods 32 -B -B | sh Region Best (MB/s): 2746.76 W-Method: 32
    Region Best (MB/s): 177.06 W-Method: 32 -m GROUP 4 8
    Region Best (MB/s): 2818.75 W-Method: 32 -m SPLIT 32 4
    Region Best (MB/s): 3818.21 W-Method: 32 -m SPLIT 32 4 -r ALTMAP
    Region Best (MB/s): 728.68 W-Method: 32 -m SPLIT 32 8
    Region Best (MB/s): 730.97 W-Method: 32 -m SPLIT 8 8
    Region Best (MB/s): 190.20 W-Method: 32 -m COMPOSITE 2 -
    Region Best (MB/s): 1837.99 W-Method: 32 -m COMPOSITE 2 - -r ALTMAP
    UNIX>

    The default is quite a bit slower than the best performing methods for both single and region multiplication. So why are the defaults the way that they are? As detailed at the beginning of this chapter, we strive for lower memory consumption, so we don't use "SPLIT 8 8," which consumes 1.75MB.We don't implement alternate fields by default, which is why we don't use "COMPOSITE." Finally, we don't implement alternate mappings of memory by default, which is why we don't use "-m SPLIT 32 4 -r ALTMAP -."

    Of course, you may change these defaults if you please.

    Test question: Given the numbers above, it would appear that "COMPOSITE" yields the fastest performance of single multiplication, while "SPLIT 32 4" yields the fastest performance of region multiplication. Should I use two gf_t's in my application – one for single multiplication that uses "COMPOSITE," and one for region multiplication that uses "SPLIT 32 4?"

    The answer to this is "no." Why? Because composite fields are different from the "standard" fields, and if you mix these two gf_t's, then you are using different fields for single multiplication and region multiplication. Please read section 7.2 for a little more information on this.

    6.4      Calling gf_init_hard()

    We recommend that you use create_gf_from_argv() instead of gf_init_hard(). However, there are extra things that you can do with gf_init_hard(). Here's the prototype:

    int gf_init_hard(gf_t *gf
    int w
    int mult_type
    int region_type
    int divide_type
    uint64_t prim_poly
    int arg1
    int arg2

    6     THE DEFAULTS 25


    GFP base_gf,
    void *scratch_memory);


    The arguments mult type, region type and divide type allow for the same specifications as above, except the types are integer constants defined in gf_complete.h:

    typedef enum {GF_MULT_DEFAULT,
    GF_MULT_SHIFT
    GF_MULT_CARRY_FREE
    GF_MULT_GROUP
    GF_MULT_BYTWO_p
    GF_MULT_BYTWO_b
    GF_MULT_TABLE
    GF_MULT_LOG_TABLE
    GF_MULT_LOG_ZERO
    GF_MULT_LOG_ZERO_EXT
    GF_MULT_SPLIT_TABLE
    GF_MULT_COMPOSITE } gf_mult_type_t;

    #define GF_REGION_DEFAULT (0x0)
    #define GF_REGION_DOUBLE_TABLE (0x1)
    #define GF_REGION_QUAD_TABLE (0x2)
    #define GF_REGION_LAZY (0x4)
    #define GF_REGION_SSE (0x8)
    #define GF_REGION_NOSSE (0x10)
    #define GF_REGION_ALTMAP (0x20)
    #define GF_REGION_CAUCHY (0x40)

    typedef enum { GF_DIVIDE_DEFAULT
    GF_DIVIDE_MATRIX
    GF_DIVIDE_EUCLID } gf_division_type_t;

    You can mix the region types with bitwise or. The arguments to GF_MULT_GROUP,GF_MULT_SPLIT_TABLE and GF_MULT_COMPOSITE are specified in arg1 and arg2. GF_MULT_COMPOSITE also takes a base field in base_gf. The base field is itself a gf_t, which should have been created previously with create_gf_fro_argv(), gf_init_easy() or gf_init_hard(). Note that this base_gf has its own base_gf member and can be a composite field itself.

    You can specify an alternate polynomial in prim_poly. For w ≤ 32, the leftmost one (the one in bit position w) is optional. If you omit it, it will be added for you. For w = 64, there's no room for that one, so you have to leave it off. For w = 128, your polynomial can only use the bottom-most 64 bits. Fortunately, the standard polynomial only uses those bits. If you set prim_poly to zero, the library selects the "standard" polynomial.

    Finally, scratch_memory is there in case you don't want gf_init_hard() to call malloc(). Youmay call gf_scratch_size() to find out how much extra memory each technique uses, and then you may pass it a pointer for it to use in scratc_memory. If you set scratch memory to NULL, then the extra memory is allocated for you with malloc(). If you use gf_init_easy() or create_gf_from_argv(), or you use gf_init_hard() and set scratch_memory to NULL, then you should call gf_free() to free memory. If you use gf_init_hard() and use your own scratch_memory you can still call gf_free(), and it will not do anything.

    Both gf_init_hard() and gf_scratch_size() return zero if the arguments don't specify a valid gf_t. When that happens, you can call gf_error() to print why the call failed.


    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 26


    We'll give you one example of calling gf_ init_hard(). Suppose you want to make a gf_ init_hard() call to be equivalent to "-m SPLIT 16 4 -r SSE -r ALTMAP -" and you want to allocate the scratch space yourself. Then you'd do the following:



    gf_t gf;
    void *scratch;
    int size;
    size = gf_scratch_size(16, GF_MULT_SPLIT_TABLE,
    GF_REGION_SSE | GF_REGION_ALTMAP,
    GF_DIVIDE_DEFAULT,
    16, 4);
    if (size == 0) { gf_error(); exit(1); } /* It failed. That shouldn’t happen */
    scratch = (void *) malloc(size);
    if (scratch == NULL) { perror("malloc"); exit(1); }
    if (!gf_init_hard(&gf, 16, GF_MULT_SPLIT_TABLE,
    GF_REGION_SSE | GF_REGION_ALTMAP,
    GF_DIVIDE_DEFAULT,
    0, 16, 4, NULL, scratch)) {
    gf_error();
    exit(1);
    }

    6.5     gf_size()

    You can call gf_size(gf_t *gf) to learn the memory consumption of the gf_t. It returns all memory consumed by the gf_t, including the gf_t itself, any scratch memory required by the gf_ t, and the memory consumed by the sub-field if the field is "COMPOSITE." If you provided your own memory to gf_init_hard(), it does not report the size of this memory, but what the size should be, as determined by gf_scratch size(). gf_ unit() prints out the return value of gf_size() on the given field.

    7   Further Information on Options and Algorithms

    7.1   Inlining Single Multiplication and Division for Speed

    Obviously, procedure calls are more expensive than single instructions, and the mechanics of multiplication in "TABLE" and "LOG" are pretty simple. For that reason, we support inlining for "TABLE" when w = 4 and w = 8, and for "LOG" when w = 16. We elaborate below.

    When w = 4, you may inline multiplication and division as follows. The following procedures return pointers to the multiplication and division tables respectively:



    uint8_t *gf_w4_get_mult_table(gf_t * gf);
    uint8_t *gf_w4_get_div_table(gf_t * gf);

    The macro Gf_W4_INLINE_MULTDIV (table, a, b) then multiplies or divides a by b using the given table. This of course only works if the multiplication technique is "TABLE," which is the default for w = 4. If the multiplication technique is not "TABLE," then gf_w4_get_mult_table() will return NULL.


    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 27


    When w = 8, the procedures gf_w8_et_mult_table() and gf_ w8_get_div_table(), and the macro

    GF_W8_INLINE_MULTDIV (table, a, b) work identically to the w = 4 case.

    When w = 16, the following procedures return pointers to the logarithm table, and the two inverse logarithm tables respectively:


    uint16_t *gf_w16_get_log_table(gf_t * gf);
    uint16_t *gf_w16_get_mult_alog_table(gf_t * gf);
    uint16_t *gf_w16_get_div_alog_table(gf_t * gf);

    The first inverse logarithm table works for multiplication, and the second works for division. They actually point to the same table, but to different places in the table. You may then use the macro GF_W16_INLINE_MULT(log, alog, a, b ) to multiply a and b, and the macro GF_W16_INLINE_DIV (log, alog, a, b ) to divide a and b. Make sure you use the alog table returned by gf_w16_get_mult_alog_table() for multiplication and the one returned by gf_w16_get_div_alog_table() for division. Here are some timings:



    UNIX> gf_time 4 M 0 10240 10240 -
    Seed: 0
    Multiply: 0.228860 s Mops: 100.000 436.949 Mega-ops/s
    UNIX> gf_inline_time 4 0 10240 10240
    Seed: 0
    Inline mult: 0.096859 s Mops: 100.000 1032.424 Mega-ops/s
    UNIX> gf_time 8 M 0 10240 10240 -
    Seed: 0
    Multiply: 0.228931 s Mops: 100.000 436.812 Mega-ops/s
    UNIX> gf_inline_time 8 0 10240 10240
    Seed: 0
    Inline mult: 0.114300 s Mops: 100.000 874.889 Mega-ops/s
    UNIX> gf_time 16 M 0 10240 10240 -
    Seed: 0
    Multiply: 0.193626 s Mops: 50.000 258.229 Mega-ops/s
    UNIX> gf_inline_time 16 0 10240 10240
    Seed: 0
    Inline mult: 0.310229 s Mops: 100.000 322.342 Mega-ops/s
    UNIX>

    7.2     Using different techniques for single and region multiplication

    You may want to "mix and match" the techniques. For example, suppose you'd like to use "-m SPLIT 8 8" for multiply() in GF(232), because it's fast, and you don't mind consuming all of that space for tables. However, for multiply_region(), you'd like to use "-m SPLIT 32 4 -r ALTMAP," because that's the fastest way to implement multiply_region(). Unfortunately, There is no way to create a gf_t that does this combination. In this case, you should simply create two gf_t's, and use one for multiply() and the other for multiply_region(). All of the implementations may be used interchangably with the following exceptions:
    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 28


    7.5    Arguments to "GROUP"

    The "GROUP" multiplication option takes tw o arguments, gs and gr. It implements multiplication in the same manner as "SHIFT," except it uses a table of size 2gs to perform gs shifts at a time, and a table of size 2gr to perform gr reductions at at time. The program gf_methods only prints the options 4 4 and 4 8 as arguments for "GROUP." However, other values of gs and gr are legal and sometimes desirable:

    1. For w ≤ 32 and w = 64, any values of gs and gr may be used, so long as they are less than or equal to w and so long as the tables fit into memory. There are four exceptions to this, listed below .

    2. For w = 4, "GROUP" is not supported.

    3. For w = 8, "GROUP" is not supported.

    4. For w = 16, "GROUP" is only supported for gs = gr = 4.

    5. For w = 128 "GROUP" only supports gs = 4 and gr ∈ {4, 8, 16}.

    The way that gs and gr impact performance is as follows. The "SHIFT" implementation works by performing a carry-free multiplication in w steps, and then performing reduction in w steps. In "GROUP," the carry-free multiplication is reduced to w /gssteps, and the reduction is reduced to w /gr . Both require tables. The table for the carry-free multiplication must be created at the beginning of each multiply() or multiply_region(), while the table for reduction is created when the gf_t is initialized. For that reason, it makes sense for gr to be bigger than gs.

    To give a flavor for the impact of these arguments, Figure 3 show s the performance of varying gs and gr for single multiplication and region multiplication respectively, in GF(232) and GF(264). As the graphs demonstrate, multiply() performs better w ith smaller values of gs, w hile multiply region() amortizes the creation of the shifting table, and can tolerate larger values of gs. w hen gs equals gr, there are some optimizations that we hand-encode. These can be seen clearly in the multiply_region() graphs.


    7     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 30
    Figure 3: The performance of multiply() and multiply_region() using "GROUP," and varying the arguments
    gs and gr. All graphs are heat maps with black equaling zero. The region size is 100KB.

    7.6  Considerations with "COMPOSITE"

    As mentioned above, using "ALTMAP" with "COMPOSITE" allows multiply_region() to recursively call multiply_ region(), rather than simply calling multiply() on every word in the region. The difference can be pronounced:

    gf_time 32 G 0 10240 10240 -m COMPOSITE 2 - - Speed = 322 MB/s
    gf_time 32 G 0 10240 10240 -m COMPOSITE 2 - -r ALTMAP - Speed = 3,368 MB/s
    gf_time 32 G 0 10240 10240 -m COMPOSITE 2 -m SPLIT 16 4 -r ALTMAP - -r ALTMAP - Speed = 3,925 MB/s


    There is support for performing multiply() inline for the "TABLE" implementations for w ∈ {4, 8} and for the "LOG" implementation for w = 16 (see section 7.1). These are leveraged by multiply() in "COMPOSITE," and by multiply_region() if you are not using "ALTMAP." To demonstrate this, in the table below, you can see that the performance of multiply() with "SPLIT 8 4" is 88 percent as fast than the default in w = 8 (which is "TABLE"). When you use each as a base field for "COMPOSITE" with w = 16, the one with "SPLIT 8 4" is now just 37 percent as fast. The difference is the inlining of multiplication in the base field when "TABLE" is employed:



    gf_time 8 M 0 1048576 100 - Speed = 501 Mega-ops/s
    gf_time 8 M 0 1048576 100 -m SPLIT 8 4 - Speed = 439 Mega-ops/s
    gf_time 8 M 0 1048576 100 -m COMPOSITE 2 - - Speed = 207 Mega-ops/s
    gf_time 8 M 0 1048576 100 -m COMPOSITE 2 -m SPLIT 8 4 - - Speed = 77 Mega-ops/s


    You can keep making recursive definitions of composites field if you want. For example, this one's not too slow for region operations (641 MB/s):

    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 31


    gf_time 128 G 0 1048576 100 -m COMPOSITE 2 -m COMPOSITE 2 -m COMPOSITE 2
    -m SPLIT 16 4 -r ALTMAP - -r ALTMAP - -r ALTMAP - -r ALTMAP -

    Please see section 7.8.1 for a discussion of polynomials in composite fields.

    7.7       "CARRY_FREE" and the Primitive Polynomial

    If your machine supports the PCLMUL instruction, then we leverage that in "CARRY_FREE." This implementation first performs a carry free multiplication of two w-bit numbers, which yields a 2w-bit number. It does this with one PCLMUL instruction. To reduce the 2w-bit number back to a w-bit number requires some manipulation of the polynomial. As it turns out, if the polynomial has a lot of contiguous zeroes following its leftmost one, the number of reduction steps may be minimized. For example, with w = 32, we employ the polynomial 0x100400007, because that is what other libraries employ. This only has 9 contiguous zeros following the one, which means that the reduction takes four steps. If we instead use 0x1000000c5, which has 24 contiguous zeros, the reduction takes just two steps. You can see the difference in performance:

    gf_time 32 M 0 1048576 100 -m CARRY_FREE - Speed = 48 Mega-ops/s
    gf_time 32 M 0 1048576 100 -m CARRY_FREE -p 0xc5 - Speed = 81 Mega-ops/s


    This is relevant for w = 16 and w = 32, where the "standard" polynomials are sub-optimal with respect to "CARRY_FREE." For w = 16, the polynomial 0x1002d has the desired property. It’s less important, of course, with w = 16, because "LOG" is so much faster than CARRY_FREE.

    7.8   More on Primitive Polynomials

    7.8.1   Primitive Polynomials that are not Primitive

    The library is willing to work with most polynomials, even if they are not primitive or irreducible. For example, the polynomial x4 + x3 +x2 +x+1 is irreducible, and therefore generates a valid Galois Field for GF(24). However, it is not primitive, because 25 = 1. For that reason, if you use this polynomial, you cannot use the "LOG" method. The other methods will work fine:

    UNIX> gf_mult 2 2 4 -p 0xf -
    4
    UNIX> gf_mult 4 2 4 -p 0xf -
    8
    UNIX> gf_mult 8 2 4 -p 0xf -
    15
    UNIX> gf_mult 15 2 4 -p 0xf -
    1
    UNIX> gf_div 1 15 4 -p 0xf -
    2
    UNIX> gf_div 1 15 4 -p 0xf -m LOG -
    usage: gf_div a b w [method] - does division of a and b in GF(2ˆw)
    Bad Method Specification: Cannot use Log tables because the polynomial is not primitive.
    UNIX>

    If a polynomial is reducible, then it does not define a Galois Field, but instead a ring. GF-Complete attempts to work here where it can; however certain parts of the library will not work:


    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 32


    1. Division is a best effort service. The problemis that often quotients are not unique. If divide() returns a non-zero number, then that number will be a valid quotient, but it may be one of many. If the multiplication technique is "TABLE," then if a quotient exists, one is returned. Otherwise, zero is returned. Here are some examples - the polynomial x4 + 1 is reducible, and therefore produces a ring. Below, we see that with this polynomal, 1*6 = 6 and 14*6 = 6. Therefore, 6/6 has two valid quotients: 1 and 14. GF-Complete returns 14 as the quotient:

    2. UNIX> gf_mult 1 6 4 -p 0x1 -
      6
      UNIX> gf_mult 14 6 4 -p 0x1 -
      6
      UNIX> gf_div 6 6 4 -p 0x1 -
      14
      UNIX>

    3. When "EUCLID" is employed for division, it uses the extended Euclidean algorithm for GCD to find a number's inverse, and then it multiplies by the inverse. The problem is that not all numbers in a ring have inverses. For example, in the above ring, there is no number a such that 6a = 1. Thus, 6 has no inverse. This means that even though 6/6 has quotients in this ring, "EUCLID" will fail on it because it is unable to find the inverse of 6. It will return 0:

    4. UNIX> gf_div 6 6 4 -p 0x1 -m TABLE -d EUCLID -
      0
      UNIX>

    5. Inverses only work if a number has an inverse. Inverses may not be unique.

    6. "LOG" will not work. In cases where the default would be "LOG," "SHIFT" is used instead.

    Due to problems with division, gf_unit may fail on a reducible polynomial. If you are determined to use such a polynomial, don't let this error discourage you.

    7.8.2 Default Polynomials for Composite Fields

    GF-Complete will successfully select a default polynomial in the following composite fields:
    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 33


    7.8.3 The Program gf_poly for Verifying Irreducibility of Polynomials

    The program gf_poly uses the Ben-Or algorithm[GP97] to determine whether a polynomial with coefficients in GF(2w ) is reducible. Its syntax is:

    gf_poly w method power:coef power:coef ...

    You can use it to test for irreducible polynomials with binary coefficients by specifying w = 1. For example, from the discussion above, we know that x4 +x+1 and x4 +x3 +x2 +x+1 are both irreducible, but x4 +1 is reducible. gf_poly confirms:


    UNIX> gf_poly 1 - 4:1 1:1 0:1
    Poly: xˆ4 + x + 1
    Irreducible.
    UNIX> gf_poly 1 - 4:1 3:1 2:1 1:1 0:1 Poly: xˆ4 + xˆ3 + xˆ2 + x + 1
    Irreducible.
    UNIX> gf_poly 1 - 4:1 0:1 r
    Poly: xˆ4 + 1
    Reducible.
    UNIX>

    For composite fields GF((2l)2), we are looking for a value s such that x2 + sx + 1 is irreducible. That value depends on the base field. For example, for the default field GF(232), a value of s = 2 makes the polynomial irreducible. However, if the polynomial 0xc5 is used (so that PCLMUL is fast - see section 7.7), then s = 2 yields a reducible polynomial, but s = 3 yields an irreducible one. You can use gf_poly to help verify these things, and to help define s if you need to stray from the defaults:


    UNIX> gf_poly 32 - 2:1 1:2 0:1
    Poly: xˆ2 + (0x2)x + 1
    Irreducible.
    UNIX> gf_poly 32 -p 0xc5 - 2:1 1:2 0:1
    Poly: xˆ2 + (0x2)x + 1
    Reducible.
    UNIX> gf_poly 32 -p 0xc5 - 2:1 1:3 0:1
    Poly: xˆ2 + (0x3)x + 1
    Irreducible.
    UNIX>

    gf_unit does random sampling to test for problems. In particular, it chooses a random a and a random b, multiplies them, and then tests the result by dividing it by a and b. When w is large, this sampling does not come close to providing complete coverage to check for problems. In particular, if the polynomial is reducible, there is a good chance that gf_unit won't discover any problems. For example, the following gf_unit call does not flag any problems, even though the polynomial is reducible.


    UNIX> gf_unit 64 A 0 -m COMPOSITE 2 -p 0xc5 - -p 2 -
    UNIX>

    How can we demonstrate that this particular field has a problem? Well, when the polynomial is 0xc5, we can factor x2 + 2x + 1 as (x + 0x7f6f95f9)(x + 0x7f6f95fb). Thus, in the composite field, when we multiply 0x17f6f95f9 by 0x17f6f95fb, we get zero. That's the problem:


    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 34


    UNIX> gf_mult 7f6f95f9 7f6f95fb 32h -p 0xc5 -
    1
    UNIX> gf_mult 17f6f95f9 17f6f95fb 64h -m COMPOSITE 2 -p 0xc5 - -p 2 -
    0
    UNIX>

    7.9 "ALTMAP" considerations and extract_word()

    There are two times when you may employ alternate memory mappings:
    1. When using "SPLIT" and wb = 4.
    2. When using "COMPOSITE."
    Additionally, by default, the "CAUCHY" region option also employs an alternate memory mapping.

    When you use alternate memory mappings, the exact mapping of words in GF(2w ) to memory depends on the situation, the size of the region, and the alignment of the pointers. To help you figure things out, we have included the procedures extract_word.wxx() as part of the gf_t struct. This procedure takes four parameters:

    It then returns the n-th word in memory. When the standard mapping is employed, this simply returns the n- th contiguous word in memory. With alternate mappings, each word may be split over several memory regions, so extract_word() grabs the relevant parts of each memory region to extract the word. Below, we go over each of the above situations in detail. Please refer to Figure 2 in Section 5 for reference.

    7.9.1 Alternate mappings with "SPLIT"

    The alternate mapping with "SPLIT" is employed so that we can best leverage mm_shuffle_epi8(). Please read [PGM13b] for details as to why. Consider an example when w = 16. In the main region of memory (the middle region in Figure 2), multiplication proceeds in units of 32 bytes, which are each broken into two 16-byte regions. The first region holds the high bytes of each word in GF(216), and the second region holds the low bytes. Let's look at a very detailed example, from gf_example_5.c. This program makes the following call, where gf has been initialized for w = 16, using "SPLIT" and "ALTMAP:"

    gf.multiply_region.w32(&gf, a, b, 0x1234, 30*2, 0);

    In other words, it is multiplying a region a of 60 bytes (30 words) by the constant 0x1234 in GF(216), and placing the result into b. The pointers a and b have been set up so that they are not multiples of 16. The first line of output prints a and b:


    a: 0x10010008c b: 0x10010015c

    As described in Section 5, the regions of memory are split into three parts:
    6     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 35


    1. 4 bytes starting at 0x1001008c / 0x10010015c.
    2. 32 bytes starting at 0x10010090 / 0x100100160.
    3. 24 bytes starting at 0x100100b0 / 0x100100180.

    In the first and third parts, the bytes are laid out according to the standard mapping. However, the second part is split into two 16-byte regions- one that holds the high bytes of each word and one that holds the low bytes. To help illustrate, the remainder of the output prints the 30 words of a and b as they appear in memory, and then the 30 return values of extract_word.w32():


    1 2 3 4 5 6 7 8 9
    a: 640b 07e5 2fba ce5d f1f9 3ab8 c518 1d97 45a7 0160
    b: 1ba3 644e 84f8 be3c 4318 4905 b2fb 46eb ef01 a503


    10 11 12 13 14 15 16 17 18 19
    a: 3759 b107 9660 3fde b3ea 8a53 75ff 46dc c504 72c2
    b: da27 e166 a0d2 b3a2 1699 3a3e 47fb 39af 1314 8e76


    20 21 22 23 24 25 26 27 28 29
    a: b469 1b97 e91d 1dbc 131e 47e0 c11a 7f07 76e0 fe86
    b: 937c a5db 01b7 7f5f 8974 05e1 cff3 a09c de3c 4ac0


    Word 0: 0x640b * 0x1234 = 0x1ba3 Word 15: 0x4575 * 0x1234 = 0xef47
    Word 1: 0x07e5 * 0x1234 = 0x644e Word 16: 0x60dc * 0x1234 = 0x03af
    Word 2: 0xba59 * 0x1234 = 0xf827 Word 17: 0x0146 * 0x1234 = 0xa539
    Word 3: 0x2f37 * 0x1234 = 0x84da Word 18: 0xc504 * 0x1234 = 0x1314
    Word 4: 0x5d07 * 0x1234 = 0x3c66 Word 19: 0x72c2 * 0x1234 = 0x8e76
    Word 5: 0xceb1 * 0x1234 = 0xbee1 Word 20: 0xb469 * 0x1234 = 0x937c
    Word 6: 0xf960 * 0x1234 = 0x18d2 Word 21: 0x1b97 * 0x1234 = 0xa5db
    Word 7: 0xf196 * 0x1234 = 0x43a0 Word 22: 0xe91d * 0x1234 = 0x01b7
    Word 8: 0xb8de * 0x1234 = 0x05a2 Word 23: 0x1dbc * 0x1234 = 0x7f5f
    Word 9: 0x3a3f * 0x1234 = 0x49b3 Word 24: 0x131e * 0x1234 = 0x8974
    Word 10: 0x18ea * 0x1234 = 0xfb99 Word 25: 0x47e0 * 0x1234 = 0x05e1
    Word 11: 0xc5b3 * 0x1234 = 0xb216 Word 26: 0xc11a * 0x1234 = 0xcff3
    Word 12: 0x9753 * 0x1234 = 0xeb3e Word 27: 0x7f07 * 0x1234 = 0xa09c
    Word 13: 0x1d8a * 0x1234 = 0x463a Word 28: 0x76e0 * 0x1234 = 0xde3c
    Word 14: 0xa7ff * 0x1234 = 0x01fb Word 29: 0xfe86 * 0x1234 = 0x4ac0

    In the first region are words 0 and 1, which are identical to how they appear in memory: 0x640b and 0x07e5. In the second region are words 2 through 17. These words are split among the two sixteen-byte regions. For example, word 2, which extract_word() reports is 0xba59, is constructed from the low byte in word 2 (0xba) and the low byte in word 10 (0x59). Since 0xba59 * 0x1234 = 0xf827, we see that the low byte in word 2 of b is 0xf8, and the low byte in word 10 is 0x27.

    When we reach word 22, we are in the third region of memory, and words are once again identical to how they appear in memory.

    While this is confusing, we stress that that so long as you call multiply_region() with pointers of the same alignment and regions of the same size, your results with ALTMAP will be consistent. If you call it with pointers of


    7     FURTHER INFORMATION ON OPTIONS AND ALGORITHMS 36


    different alignments, or with different region sizes, then the results will not be consistent. To reiterate, if you don't use ALTMAP, you don't have to worry about any of this - words will always be laid out contiguously in memory.

    When w = 32, the middle region is a multiple of 64, and each word in the middle region is broken into bytes, each of which is in a different 16-byte region. When w = 64, the middle region is a multiple of 128, and each word is stored in eight 16-byte regions. And finally, whenw = 128, the middle region is a multiple of 128, and each word is stored in 16 16-byte regions.


    7.9.2   Alternate mappings with "COMPOSITE"

    With "COMPOSITE," the alternate mapping divides the middle region in half. The lower half of each word is stored in the first half of the middle region, and the higher half is stored in the second half. To illustrate, gf_example_6 performs the same example as gf_example_5, except it is using "COMPOSITE" in GF((216)2), and it is multiplying a region of 120 bytes rather than 60. As before, the pointers are not aligned on 16-bit quantities, so the region is broken into three regions of 4 bytes, 96 bytes, and 20 bytes. In the first and third region, each consecutive four byte word is a word in GF(232). For example, word 0 is 0x562c640b, and word 25 is 0x46bc47e0. In the middle region, the low two bytes of each word come from the first half, and the high two bytes come from the second half. For example, word 1 as reported by extract_word() is composed of the lower two bytes of word 1 of memory (0x07e5), and the lower two bytes of word 13 (0x3fde). The product of 0x3fde07e5 and 0x12345678 is 0x211c880d, which is stored in the lower two bytes of words 1 and 13 of b.

    a: 0x10010011c b: 0x1001001ec

    1 2 3 4 5 6 7 8 9
    a: 562c640b 959407e5 56592fba cbadce5d 1d1cf1f9 35d73ab8 6493c518 b37c1d97 8e4545a7 c0d80160
    b: f589f36c f146880d 74f7b349 7ea7c5c6 34827c1a 93cc3746 bfd9288b 763941d1 bcd33a5d da695e64


    10 11 12 13 14 15 16 17 18 19
    a: 965b3759 cb3eb107 1b129660 95a33fde 95a7b3ea d16c8a53 153375ff f74646dc 35aac504 98f972c2
    b: fd70f125 3274fa8f d9dd34ee c01a211c d4402403 8b55c08b da45f0ad 90992e18 b65e0902 d91069b5


    20 21 22 23 24 25 26 27 28 29
    a: 5509b469 7f8a1b97 3472e91d 9ee71dbc de4e131e 46bc47e0 5bc9c11a 931d7f07 c85cfe86 fe86
    b: fc92b8f5 edd59668 b4bc0d90 a679e4ce 1a98f7d0 6038765f b2ff333f e7937e49 fa5a5867 79c00ea2


    Word 0: 0x562c640b * 0x12345678 = 0xf589f36c Word 15: 0xb46945a7 * 0x12345678 = 0xb8f53a5d
    Word 1: 0x3fde07e5 * 0x12345678 = 0x211c880d Word 16: 0x55098e45 * 0x12345678 = 0xfc92bcd3
    Word 2: 0x95a39594 * 0x12345678 = 0xc01af146 Word 17: 0x1b970160 * 0x12345678 = 0x96685e64
    Word 3: 0xb3ea2fba * 0x12345678 = 0x2403b349 Word 18: 0x7f8ac0d8 * 0x12345678 = 0xedd5da69
    Word 4: 0x95a75659 * 0x12345678 = 0xd44074f7 Word 19: 0xe91d3759 * 0x12345678 = 0x0d90f125
    Word 5: 0x8a53ce5d * 0x12345678 = 0xc08bc5c6 Word 20: 0x3472965b * 0x12345678 = 0xb4bcfd70
    Word 6: 0xd16ccbad * 0x12345678 = 0x8b557ea7 Word 21: 0x1dbcb107 * 0x12345678 = 0xe4cefa8f
    Word 7: 0x75fff1f9 * 0x12345678 = 0xf0ad7c1a Word 22: 0x9ee7cb3e * 0x12345678 = 0xa6793274
    Word 8: 0x15331d1c * 0x12345678 = 0xda453482 Word 23: 0x131e9660 * 0x12345678 = 0xf7d034ee
    Word 9: 0x46dc3ab8 * 0x12345678 = 0x2e183746 Word 24: 0xde4e1b12 * 0x12345678 = 0x1a98d9dd
    Word 10: 0xf74635d7 * 0x12345678 = 0x909993cc Word 25: 0x46bc47e0 * 0x12345678 = 0x6038765f
    Word 11: 0xc504c518 * 0x12345678 = 0x0902288b Word 26: 0x5bc9c11a * 0x12345678 = 0xb2ff333f
    Word 12: 0x35aa6493 * 0x12345678 = 0xb65ebfd9 Word 27: 0x931d7f07 * 0x12345678 = 0xe7937e49

    8     THREAD SAFETY 37


    Word 13: 0x72c21d97 * 0x12345678 = 0x69b541d1 Word 28: 0xd40676e0 * 0x12345678 = 0xfa5a5867
    Word 14: 0x98f9b37c * 0x12345678 = 0xd9107639 Word 29: 0xc85cfe86 * 0x12345678 = 0x79c00ea2

    As with "SPLIT," using multiply_region() with "COMPOSITE" and "ALTMAP" will be consistent only if the alignment of pointers and region sizes are identical.

    7.9.3 The mapping of "CAUCHY"

    With "CAUCHY," the region is partitioned into w subregions, and each word in the region is broken into w bits, each of which is stored in a different subregion. To illustrate, gf_example_7 multiplies a region of three bytes by 5 in GF(23) using "CAUCHY:"

    UNIX> gf_example_7
    a: 0x100100190 b: 0x1001001a0

    a: 0x0b 0xe5 0xba
    b: 0xee 0xba 0x0b

    a bits: 00001011 11100101 10111010
    b bits: 11101110 10111010 00001011

    Word 0: 3 * 5 = 4
    Word 1: 5 * 5 = 7
    Word 2: 2 * 5 = 1
    Word 3: 5 * 5 = 7
    Word 4: 4 * 5 = 2
    Word 5: 6 * 5 = 3
    Word 6: 2 * 5 = 1
    Word 7: 6 * 5 = 3
    UNIX>

    The program prints the three bytes of a and b in hexadecimal and in binary. To see how words are broken up, consider word 0, which is the lowest bit of each of the three bytes of a (and b). These are the bits 1, 1 and 0 in a, and 0, 0, and 1 in b. Accordingly, the word is 3 in a, and 3*5 = 4 in b. Similarly, word 7 is the high bit in each byte: 0, 1, 1 (6) in a, and 1, 1, 0 (3) in b.

    With "CAUCHY," multiply_region()may be implemented exclusively with XOR operations. Please see [BKK+95] for more information on the motivation behind "CAUCHY."

    8   Thread Safety

    Once you initialize a gf_t, you may use it wontonly in multiple threads for all operations except for the ones below. With the implementations listed below, the scratch space in the gf_t is used for temporary tables, and therefore you cannot call region_multiply, and in some cases multiply from multiple threads because they will overwrite each others' tables. In these cases, if you want to call the procedures from multiple threads, you should allocate a separate gf_t for each thread:
    9     LISTING OF PROCEDURES 38


    9  Listing of Procedures

    The following is an alphabetical listing of the procedures, data types and global variables for users to employ in GF-complete.

    9     LISTING OF PROCEDURES 39



    9     LISTING OF PROCEDURES 40



    10     TROUBLESHOOTING 41


    11     Timings

    We don't want to get too detailed with timing, because it is quite machine specific. However, here are the timings on an Intel Core i7-3770 CPU running at 3.40 GHz, with 4 × 256 KB L2 caches and an 8MB L3 cache. All timings are obtained with gf_time or gf_inline_time, in user mode with the machine dedicated solely to running these jobs.
    10     TROUBLESHOOTING 41


    Figure 4: Speed of doing single multiplications for w ∈ {4, 8, 16}.

    11.1   Multiply()

    The performance of multiply() is displayed in Figures 4 for w ∈ {4, 8, 16} and 5 for w ∈ {32, 64, 128}. These numbers were obtained by calling gf_time with the size and iterations both set to 10240. We plot the speed in megaops per second.

    As would be anticipated, the inlined operations (see section 7.1) outperform the others. Additionally, in all cases with the exception of w = 32, the defaults are the fastest performing implementations. With w = 32, "CARRY_FREE" is the fastest with an alternate polynomial (see section 7.7). Because we require the defaults to use a "standard" polynomial, we cannot use this implementation as the default.

    11.2   Divide()

    For the "TABLE" and "LOG" implementations, the performance of division is the same as multiplication. This means that for w ∈ {4, 8, 16}, it is very fast indeed. For the other implementations, division is implemented with Euclid's method, and is several factors slower than multiplication. In Figure 6, we plot the speed of a few implementations of the larger word sizes. Compared to the "TABLE" and "LOG" implemenations for the smaller word sizes, where the speeds are in the hundreds of mega-ops per second, these are very slow. Of note is the "COMPOSITE" implementation for w = 32, which is much faster than the others
    10     TROUBLESHOOTING 43


    Figure 5: Speed of doing single multiplications for w ∈ {32, 64, 128}.

    because it uses a special application of Euclid's method, which relies on division in GF(216), which is very fast.

    11.3   Multiply_Region()

    Tables 3 through 8 show the performance of the various region operations. It should be noted that for GF(216 ) through GF(2128), the default is not the fastest implementation of multiply_region(). The reasons for this are outlined in section 6

    For these tables, we performed 1GB worth of multiply_region() calls for all regions of size 2i bytes for 10 ≤ i ≤ 30. In the table, we plot the fastest speed obtained.

    We note that the performance of "CAUCHY" can be improved with techniques from [LSXP13] and [PSR12].


    REFERENCES 44


    Figure 6: Speed of doing single divisions for w ∈ {32, 64, 128}.

    Method Speed (MB/s)
    -m TABLE (Default) - 11879.909
    -m TABLE -r CAUCHY - 9079.712
    -m BYTWO_b - 5242.400
    -m BYTWO_p - 4078.431
    -m BYTWO_b -r NOSSE - 3799.699
    -m TABLE -r QUAD - 3014.315
    -m TABLE -r DOUBLE - 2253.627
    -m TABLE -r NOSSE - 2021.237
    -m TABLE -r NOSSE - 1061.497
    -m LOG - 503.310
    m SHIFT - 157.749
    -m CARRY_FREE - 86.202


    Table 3: Speed of various calls to multiply_region() for w = 4.

    References

    [Anv09] H. P. Anvin. The mathematics of RAID-6. http://kernel.org/pub/linux/kernel/people/hpa/ raid6.pdf, 2009.

    [BKK+95] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman. An XOR-based erasureresilient coding scheme. Technical Report TR-95-048, International Computer Science Institute, August 1995.

    [GMS08] K. Greenan, E. Miller, and T. J. Schwartz. Optimizing Galois Field arithmetic for diverse processor architectures and applications. In MASCOTS 2008: 16th IEEE Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Baltimore, MD, September 2008.

    [GP97] S. Gao and D. Panario. Tests and constructions of irreducible polynomials over finite fields. In Foundations of Computational Mathematics, pages 346–361. Springer Verlag, 1997.
    REFERENCES 45


    Method Speed (MB/s)
    -m SPLIT 8 4 (Default) 13279.146
    -m COMPOSITE 2 - -r ALTMAP - 5516.588
    -m TABLE -r CAUCHY - 4968.721
    -m BYTWO_b - 2656.463
    -m TABLE -r DOUBLE - 2561.225
    -m TABLE - 1408.577
    -m BYTWO_b -r NOSSE - 1382.409
    -m BYTWO_p - 1376.661
    -m LOG_ZERO_EXT - 1175.739
    -m LOG_ZERO - 1174.694
    -m LOG - 997.838
    -m SPLIT 8 4 -r NOSSE - 885.897
    -m BYTWO_p -r NOSSE - 589.520
    -m COMPOSITE 2 - - 327.039
    -m SHIFT - 106.115
    -m CARRY_FREE - 104.299


    Table 4: Speed of various calls to multiply region() for w = 4.


    [LBOX12] J. Luo, K. D. Bowers, A. Oprea, and L. Xu. Efficient software implementations of large finite fields GF(2n) for secure storage applications. ACM Transactions on Storage, 8(2), February 2012.

    [LD00] J. Lopez and R. Dahab. High-speed software multiplication in f2m. In Annual International Conference on Cryptology in India, 2000.

    [LHy08] H. Li and Q. Huan-yan. Parallelized network coding with SIMD instruction sets. In International Symposium on Computer Science and Computational Technology, pages 364-369. IEEE, December 2008.

    [LSXP13] J. Luo, M. Shrestha, L. Xu, and J. S. Plank. Efficient encoding schedules for XOR-based erasure codes. IEEE Transactions on Computing,May 2013.

    [Mar94] G. Marsaglia. The mother of all random generators. ftp://ftp.taygeta.com/pub/c/mother. c, October 1994.
    [PGM13a] J. S. Plank, K. M. Greenan, and E. L. Miller. A complete treatment of software implementations of finite field arithmetic for erasure coding applications. Technical Report UT-CS-13-717, University of Tennessee, September 2013.

    [PGM13b] J. S. Plank, K. M. Greenan, and E. L. Miller. Screaming fast Galois Field arithmetic using Intel SIMD instructions. In FAST-2013: 11th Usenix Conference on File and Storage Technologies, San Jose, February 2013.

    [Pla97] J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software - Practice & Experience, 27(9):995-1012, September 1997.
    REFERENCES 46


    Method Speed (MB/s)
    -m SPLIT 16 4 -r ALTMAP - 10460.834
    -m SPLIT 16 4 -r SSE (Default) - 8473.793
    -m COMPOSITE 2 - -r ALTMAP - 5215.073
    -m LOG -r CAUCHY - 2428.824
    -m TABLE - 2319.129
    -m SPLIT 16 8 - 2164.111
    -m SPLIT 8 8 - 2163.993
    -m SPLIT 16 4 -r NOSSE - 1148.810
    -m LOG - 1019.896
    -m LOG_ZERO - 1016.814
    -m BYTWO_b - 738.879
    -m COMPOSITE 2 - - 596.819
    -m BYTWO_p - 560.972
    -m GROUP 4 4 - 450.815
    -m BYTWO_b -r NOSSE - 332.967
    -m BYTWO_p -r NOSSE - 249.849
    -m CARRY_FREE - 111.582
    -m SHIFT - 95.813


    Table 5: Speed of various calls to multiply region() for w = 4.


    [PMG+13] J. S. Plank, E. L. Miller, K. M. Greenan, B. A. Arnold, J. A. Burnum, A. W. Disney, and A. C. McBride. GF-Complete: A comprehensive open source library for Galois Field arithmetic. version 1.0. Technical Report UT-CS-13-716, University of Tennessee, September 2013.

    [PSR12] J. S. Plank, C. D. Schuman, and B. D. Robison. Heuristics for optimizing matrix-based erasure codes for fault-tolerant storage systems. In DSN-2012: The International Conference on Dependable Systems and Networks, Boston, MA, June 2012. IEEE.

    [Rab89] M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the Association for Computing Machinery, 36(2):335-348, April 1989.
    REFERENCES 47


    Method Speed (MB/s)
    -m SPLIT 32 4 -r SSE -r ALTMAP -
    -m SPLIT 32 4 (Default)
    -m COMPOSITE 2 -m SPLIT 16 4 -r ALTMAP - -r ALTMAP -
    -m COMPOSITE 2 - -r ALTMAP -
    -m SPLIT 8 8 -
    -m SPLIT 32 8 -
    -m SPLIT 32 16 -
    -m SPLIT 8 8 -r CAUCHY
    -m SPLIT 32 4 -r NOSSE
    -m CARRY_FREE -p 0xc5
    -m COMPOSITE 2 -
    -m BYTWO_b -
    -m BYTWO_p -
    -m GROUP 4 8 -
    -m GROUP 4 4 -
    -m CARRY_FREE -
    -m BYTWO_b -r NOSSE -
    -m BYTWO_p -r NOSSE -
    -m SHIFT -
    7185.440
    5063.966
    4176.440
    3360.860
    1345.678
    1340.656
    1262.676
    1143.263
    480.859
    393.185
    332.964
    309.971
    258.623
    242.076
    227.399
    226.785
    143.403
    111.956
    52.295


    Table 6: Speed of various calls to multiply region() w = 4.


    Method Speed (MB/s)
    -m SPLIT 64 4 -r ALTMAP -
    -m SPLIT 64 4 -r SSE (Default) -
    -m COMPOSITE 2 -m SPLIT 32 4 -r ALTMAP - -r ALTMAP -
    -m COMPOSITE 2 - -r ALTMAP -
    -m SPLIT 64 16 -
    -m SPLIT 64 8 -
    -m CARRY_FREE -
    -m SPLIT 64 4 -r NOSSE -
    -m GROUP 4 4 -
    -m GROUP 4 8 -
    -m BYTWO_b -
    -m BYTWO_p -
    -m SPLIT 8 8 -
    -m BYTWO_p -r NOSSE -
    -m COMPOSITE 2 - -
    -m BYTWO_b -r NOSSE -
    -m SHIFT -
    3522.798
    2647.862
    2461.572
    1860.921
    1066.490
    998.461
    975.290
    545.479
    230.137
    153.947
    144.052
    124.538
    98.892
    77.912
    77.522
    36.391
    25.282


    Table 7: Speed of various calls to multiply region() for w = 4.



    REFERENCES 48


    Method Speed (MB/s)
    -m SPLIT 128 4 -r ALTMAP -
    -m COMPOSITE 2 -m SPLIT 64 4 -r ALTMAP - -r ALTMAP -
    -m COMPOSITE 2 - -r ALTMAP -
    -m SPLIT 128 8 (Default) -
    -m CARRY_FREE -
    -m SPLIT 128 4 -
    -m COMPOSITE 2 -
    -m GROUP 4 8 -
    -m GROUP 4 4 -
    -m BYTWO_p -
    -m BYTWO_b -
    -m SHIFT -
    1727.683
    1385.693
    1041.456
    872.619
    814.030
    500.133
    289.207
    133.583
    116.187
    25.162
    25.157
    14.183


    Table 8: Speed of various calls to multiply region() for w = 4.