We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (F 2 ). In particular we present our implementationin the M4RI library-of Strassen-Winograd matrix multiplication and the "Method of the Four Russians for Multiplication" (M4RM) and compare it against other available implementations. Good performance is demonstrated on AMD's Opteron processor and particulary good performance on Intel's Core 2 Duo processor. The open-source M4RI library is available as a stand-alone package as well as part of the Sage mathematics system.
INTRODUCTION
We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (F 2 ). Matrix-matrix multiplication is an important primitive in computational linear algebra and as such, the fundamental algorithms we implement have been well known for some time. Therefore, this article focuses on the numerous techniques employed for the special case of F 2 in the M4RI library and the benefits so derived.
We note that even for problems that do not reduce to matrix-matrix multiplication many of the techniques presented in this paper are still applicable. For instance, Gaussian Elimination can be achieved via the "Method of the Four Russians for Inversion" (M4RI)(cf. Bard [2007, Chap. 5] and Bard [2008] ) and borrows ideas from the "Method of the Four Russians for Multiplication" (M4RM) [Arlazarov et al. 1970] , [Aho et al. 1974 ] which we present here.
The M4RI library implements dense linear algebra over F 2 and is used by Sage [Stein et al. 2008] and POLYBORI [Brickenstein and Dreyer 2007] .
Our optimization efforts focus on 64-bit x86 architectures (x86 64), specifically the Intel Core 2 Duo and the AMD Opteron. Thus, we assume in this paper that each native CPU word has 64 bits (w s = 64). However it should be noted that our code also runs on 32-bit CPUs and on non-x86 CPUs such as the PowerPC.
In machine terms, addition in F 2 is logical-XOR, and multiplication is logical-AND, thus a machine word of 64 bits allows one to operate on 64 elements of F 2 in parallel, that is, at most one CPU cycle for 64 parallel additions or multiplications. As such, element-wise operations over F 2 are relatively cheap. In fact, in this article, we conclude that the actual bottlenecks are memory reads and writes and issues of data locality. We present our empirical findings in relation to minimizing these and give an analysis thereof.
The second author proposed, in Bard [2006] and Bard [2007, Chap. 5] , to count memory accesses rather than arithmetic operations to estimate the complexity of such algorithms and the empirical results of this paper lend further support to this model. However, this model is a simplification as memory access is not uniform, that is, an algorithm which randomly accesses memory will perform much worse than an algorithm with better spatial and temporal locality. These differences only affect the constant of a complexity estimation, if we assume that memory access is O 1 . However, in practice they make a very significant difference, as our empirical results will demonstrate.
The article is structured as follows. We proceed from basic arithmetic (Section 2) via the classical cubic multiplication algorithm (Section 2.3), through a detailed discussion of the Method of the Four Russians (Section 3) to the Strassen-Winograd algorithm (Section 4). We start by introducing our basic data structures and conclude by presenting timing experiments to show the ACM Transactions on Mathematical Software, Vol. 37, No. 1, Article 9, Publication date: January 2010. validity of our approach (Section 6) and a brief discussion of these timing experiments.
The main contribution of this work are variants of the M4RM algorithm which make better use of the memory hierarchy found in modern x86 64 CPUs (cf . Table I ). Particulary, we give a more cache friendly version of the M4RM algorithm, a variant of the M4RM that uses more than one lookup table and tuning parameters for the two architectures considered in this work.
Note that all timings in this article time Strassen-Winograd multiplication (cf. Section 4) but with different base cases.
BASIC ARITHMETIC

Our Matrix Data Structure
We use a "flat row-major representation" for our matrices. Thus 64 consecutive entries in one row are packed into one machine word. Consequently, bulk operations on whole rows are considerably cheaper than on whole columns and addressing a single column is more expensive than addressing a single row. Additionally, we maintain an array-called rowswap-containing the address in memory of the first word for each row in the matrix. To represent in-place submatrices (i.e., without copying out the data) we also use this rowswap array. We call these in-place submatrices "matrix windows" and they consist of addresses of the first word of each row and the number of columns each row contains. This approach is limited to matrix windows which start and end at full word borders but this is sufficient for our application. The advantages and disadvantages of the "flat row-major" data structure are, for instance, analyzed in Pernet [2001] .
Row Additions
Since this basic operation-addition of two rows-is at the heart of every algorithm in this paper, we should briefly mention the SSE2 instruction set [Fog 2008 ] which is available on modern x86 64 architectures. This instruction set offers an XOR operation for 128-bit wide registers, allowing one to handle two 64-bit machine words in one instruction. The use of these instructions does provide a considerable speed improvement on Intel CPUs. Table II shows that up to a 25% improvement is possible when enabling SSE2 instructions. However, in our experiments performance declined on Opteron CPUs when using SSE2 instructions. The authors were unable to identify a cause of this phenomenon. Note however that MAGMA also does not use SSE2 instructions on the Opteron (A. Steel, private communication), which seems to agree with our findings (cf. 
Cubic Multiplication
The simplest multiplication operation involving matrices is a matrix-vector product which can easily be extended to classical cubic matrix-matrix multiplication. To compute the matrix-vector product Av we have to compute the dot product of each row i of A and the vector v. If the vector v is stored as a row rather than a column, this calculation becomes equivalent to wordwise logical-AND and accumulation of the result in a word p via logical-XOR. Finally, the parity of p needs to be computed. However, as there is no native parity instruction in the x86 64 instruction set this last step is quite expensive compared to the rest of the routine. To account for this, 64 parity bits can be computed in parallel [Warren 2002, Chap. 5] . To extend this matrix-vector multiplication to matrix-matrix multiplication B must be stored transposed. Alternatively, we may compute the matrix-matrix product as vB over all rows v of A. This strategy avoids transposing a matrix and the expensive parity operation.
THE METHOD OF THE FOUR RUSSIANS
The Method of the Four Russians matrix multiplication algorithm can be derived from the original algorithm published by Arlazarov et al. [1970] for computing one step in the transitive closure of a directed graph, but does not directly appear there. It has appeared in books including Aho et al. [1974, Chap. 6] .
Consider a product of two matrices C = AB where A is an m × matrix and B is an × n matrix, yielding an m × n for C. A can be divided into /k vertical "stripes" A 0 · · · A /k−1 of k columns each, and B into /k horizontal stripes B 0 . . . B /k−1 of k rows each. For simplicity, assume k divides . The product of two stripes, A i B i requires an m× /k by /k×n matrix multiplication, and yields an m × n matrix C i . The sum of all k of these C i equals C. 
and consequently
Finally, we have
The principal benefit of multiplying in narrow stripes is that the bits across each row of a stripe of A determine which linear combination of rows of B will contribute to the product, for example, in the above example a 0 , . . . , a 3 dictate which linear combination of (b 0 , b 2 ) and (b 1 , b 3 ) must be written to the rows of C. However, if the stripe is relatively narrow as in this example, there is only a small number of binary values each row of the stripe can take, and thus only a small number of possible linear combinations of the rows of B that will be "selected". If we precompute all possible linear combinations of rows of B that could be selected, we can create a lookup table into which the rows of the stripes of A can index.
Returning to our example, if a 0 = a 2 and a 1 = a 3 then the same linear combination would be written to the first and the second row of C. Precomputation of all 2 4 − 1 non-zero linear combinations,
ensures that the repeated linear combination has only been computed once. In our trivial example, this did not reduce the number of operations, but for much larger matrices reuse of the precomputed combinations yields a reduction in the number of operations. Precomputing a table in this fashion is also called "greasing".
The technique just described gives rise to Algorithm 1. In Algorithm 1, the subroutine ReadBits(A, r, sc, k) reads k bits from row r starting at column sc and returns the bit string interpreted as an integer. Meanwhile, AddRowFromTable(C, r, T, x) adds the row x from table T to the row j of matrix C. The subroutine MakeTable(B, r, c, k) in Algorithm 1 constructs a table T of all 2 k − 1 non-zero linear combinations of the rows of B starting in row r and column c. The traditional way of performing this calculation is to use the reflected binary code.
Gray Codes
The Gray code [Gray 1953 ], named after Frank Gray and also known as reflected binary code, is a numbering system where two consecutive values differ in only
Examples of Gray codes for two, three and four bits are given in Figure 1 .
Gray code tables for n-bits can be computed from n − 1-bit Gray code tables by prepending each entry of the n − 1-bit Gray code table with 0. Then, the order of the entries is reversed and a 1 is prepended to each entry. These two half-tables are then concatenated. Of course, there are other more direct ways of constructing these tables, but since we precompute these tables in our code, we are not concerned with optimizing their creation in this article.
These articles can then be used to construct all 2 k − 1 non-zero linear combinations of k rows where each new entry in the table costs one row addition as its index differs in exactly one bit from that of the preceding row. Thus computing all 2 k − 1 non-zero linear combinations of k rows can be done in 2 k − 1 row additions, rather than (k/2 − 1)2 k − 1 as would be expected if each vector were to be tabulated separately.
Overall, the complexity of the algorithm for multiplying two n × n matrices is as follows: The outer loop is repeated n/k times, the construction of the table costs 2 k × n operations and adding the table to C costs n 2 operations: n/k × (2 k × n + n 2 ). If k = log n, this simplifies to O n 3 / log n (cf. Bard [2006] ). From this complexity analysis, it seems one should always choose the parameter k = log 2 n for an n × n matrix. However, in practice this is not the case. First, experimental evidence indicates [Bard 2007, Chap . 5] that 0.75 × log 2 n seems to be a better choice. Also, for cache efficiency it makes sense to split the input matrices into blocks such that these blocks fit into L2 cache (see below). If that technique is employed then the block sizes dictate k and not the total dimensions of the input matrices. Thus, a much smaller k than log 2 n is found to be optimal, in practice (see below); restraining k in this way actually improves performance.
In our implementation, we precompute the Gray Code tables up to size 16. For matrices of dimension > 20 million rows and columns, this is not enough. But, such a dense matrix would have nearly half a quadrillion entries, and this is currently beyond the capabilities of existing computational hardware. Also, for these dimensions the Strassen-Winograd algorithm should be used. Of course, if so desired we may generate the tables on the fly or generate the 2 k − 1 linear combinations using some other technique which also achieves an optimal number of required row additions.
A Cache Friendly Version
Note that the M4RM algorithm creates a table for each stripe of B and then iterates over all rows of C and A in the inner loop. If the matrices C and A are bigger than L2 cache then this means that for each single row addition a new row needs to be loaded from RAM. This row will evict an older row from L2. However, as this row is used only once per iteration of all rows of A and C we cannot take advantage of the fact that it is now in L2 cache. Thus, if the matrices A and C do not fit into L2 cache then the algorithm does not utilize this faster memory. Note that since T instead of B is used in the inner loop, we can ignore the size of B for now.
Thus, it is advantageous to re-arrange the algorithm in such a way that it iterates over the upper part of A completely with all tables for B before going ACM Transactions on Mathematical Software, Vol. 37, No. 1, Article 9, Publication date: January 2010. on to the next part. This gives rise to Algorithm 2, a cache friendly version of the M4RM algorithm. For simplicity we assume that m, , n are all multiples of some fixed block size in the presentation of Algorithm 2. This cache-friendly rearrangement is at the expense of the repeated regeneration of the table T . In fact, the complexity of this cache-friendly version is stricly worse than the original algorithm. Namely it is O n 3 if we set k = log n and treat BlockSize as a constant. However, our experiments indicate that this effect is outweighed by the better data locality for the dimensions we consider (cf. Section 5 below). Table IV shows that this strategy provides considerable performance improvements.
Increasing the Number of Precomputation Tables
Recall that the actual arithmetic is quite cheap compared to memory reads and writes and that the cost of memory accesses greatly depends on where in memory data is located: the L1 cache is approximately 50 times faster than main memory. It is thus advantageous to try to fill all of L1 with tables of linear combinations. For example, consider n = 10000, k = 10 and one such table. In this situation, we work on 10 bits at a time. If we use k = 9 and two tables, we still use the same memory for the tables but can deal with 18 bits at once. The price we pay is one additional row addition, which is cheap if the operands are all in cache. To implement this enhancement, the algorithm remains almost 
AddTwoRowsFromTable(C, j, T, r 1 , TT, r 2 ) end end return C end unchanged, except that t tables are generated for tk consecutive rows of B, tk values x are read for consecutive entries in A and t rows from t different tables are added to the target row of C. This gives rise to Algorithm 3 where we assume that tk divides and fix t = 2. Table IV shows that increasing the number of tables is advantageous. Our implementation uses eight tables, which appears to be a good default value according to our experiments.
STRASSEN-WINOGRAD MULTIPLICATION
In 1969, Strassen [1969] published an algorithm that multiplies two block matrices
with only seven submatrix multiplications and 18 submatrix additions rather than eight multiplications and eight additions. As matrix multiplication (O(n ω ), ω ≥ 2) is much more expensive than matrix addition (O n 2 ), this is an improvement. Later the algorithm was improved by Winograd [1971] to use 15 submatrix additions only, the result is commonly referred to as Strassen-Winograd multiplication. While both algorithms are to a degree less numerically stable than classical cubic multiplication over floating point numbers [Higham 2002, Chap. 26.3 .2] this problem does not affect matrices over finite fields and thus the improved complexity of O n log 2 7 [Strassen 1969; Bard 2007 ] is applicable here. 
Let m, and n be powers of two. Let A and B be two matrices of dimension m × and × n and let C = A × B. Consider the block decomposition
where A 00 and B 00 have dimensions m/2 × /2 and /2 × n/2 respectively. The Strassen-Winograd algorithm, which computes the m × n matrix C = A × B, is given in Algorithm 4. At each recursion step, the matrix dimensions must be divisible by two which explains the requirement of them being powers of two. However, in practice the recursion stops at a given cutoff dimension (c s )-sometimes called "crossover" dimension-and switches over to another multiplication algorithm. In our case, this is the M4RM algorithm. Thus, the requirement can be relaxed to the requirement that for each recursion step the matrix dimensions must be divisible by two.
However, this still is not general enough. Additionally, in case of F 2 the optimal case is when m, n, are 64 times powers of 2 to avoid cutting within • 9:11 words. To deal with odd-dimensional matrices, two strategies are known in the literature [Huss-Lederman et al. 1996] : One can either increase the matrix dimensions-this is called "padding"-to the next "good" value and fill the additional entries with zeros, yielding A + and B + . Then one can compute C + = A + B + and finally cut out the actual product matrix C from the bigger matrix C + . A variant of this approach is to only virtually append rows and columns, that is, we pretend they are present. Another approach is to consider the largest submatrices A − and B − of A and B so that the dimensions of A − and B − match our requirements-this is called "peeling". Then, once the product C − = A − B − is computed, one resolves the remaining rows and columns of C from the remaining rows and columns of A and B that are not in A − and B − (cf. Huss-Lederman et al. [1996] ). For those remaining pieces, Strassen-Winograd is not used but an implementation which does not cut the matrices into submatrices. We use the "peeling" strategy in our implementation, but note that it is easy to construct a case where our strategy is clearly not optimal, Table V gives an example where "padding" would only add one row and one column, while "peeling" has to remove many rows and columns. This is an area for future improvement.
To represent the submatrices in Algorithm 4, we use matrix windows as described earlier, in Section 2.1. While this has the benefit of negligible required additional storage compared to out-of-place submatrices, this affects data locality negatively. To restore data locality, we copy out the target matrix C when switching from Strassen-Winograd to M4RM. On the other hand our experiments show that copying out A and B at this crossover point does not improve performance. Data locality for B is achieved through the precomputation tables and it appears that the read of x from A (cf. Algorithm 1) does not significantly contribute to the runtime.
However, even with matrix windows Strassen-Winograd requires more memory than classical cubic multiplication. Additional storage is required to store intermediate results. The most memory-efficient scheduler (cf. Dumas and Pernet [2007] ) uses two additional temporary submatrices and is utilized in our implementation. We also tried the "proximity schedule" used in FFLAS [Pernet 2001 ] but did not see any improved performance.
TUNING PARAMETERS
Our final implementation calls Strassen-Winograd, which switches over to M4RM if the input matrix dimensions are less than a certain parameter c s . If B then has fewer columns than w s (word size in bits) the classical cubic algorithm This last case is quite common in the fix-up step of "peeling". This strategy gives three parameters for tuning. The first is c s , the crossover point where we switch from Strassen-Winograd to M4RM. Second, b s is the size for block decomposition inside M4RM for cache friendliness. Third, k dictates the size of the tables containing 2 k − 1 linear combination of k rows. We always fix the number of precomputation tables to t = 8, which appears to be a good default value according to our experiments. By default c s is chosen such that two matrices fit into L2 cache, because this provides the best performance in our experiments. For the Opteron (1MB of L2 cache) this results in c s = 2048 and for the Core 2 Duo (4MB of L2 cache) this results in c s = 4096. We only fit two matrices, rather than all three matrices in L2 cache as b s reduces the size of the matrices we are working with to actually fit three matrices in L2 cache. The default value is fixed at b s = c s /2. The value k is set to 0.75 × log 2 b s − 2. We subtract 2 as a means to compensate for the use of 8 precomputation tables. However, if additionally reducing k by 1 would result in fitting all precomputation tables in L1 cache, we do that. Thus, k is either 0.75 × log 2 b s − 2 or 0.75 × log 2 b s − 3 depending on the input dimensions and the size of the L1 cache. These values have been determined empirically and seem to provide the best compromise across platforms.
On the Opteron, these values-c s = 2048, b s = 1024, k = 5, t = 8 tablesmean that the two input matrices fit into the 1MB of L2 cache, while the eight tables fit exactly into L1: 8·2 5 ·2048/8 = 64Kb. The influence of the parameter b s in the final implementation is shown in Table VI for fixed k = 5 and c s = 2048.
On the Core 2 Duo, these values are c s = 4096, b s = 2048, k = 6, t = 8 and ensure that all data fits into L2 cache. Since the Core 2 Duo has only 32kb of L1 cache we do not try to fit all tables into it. So far in our experiments, performance did not increase when we tried to optimize for L1 cache.
RESULTS
To evaluate the performance of our implementation we provide benchmark comparisons against the best known implementations we are aware of. First, MAGMA [Bosma et al. 1997 ] is widely known for its high performance implementations of many algorithms. Second, GAP [The GAP Group 2007] (or equivalently the C-MeatAxe [Ringe 2007 ]) is to our knowledge the best available open-source implementation of dense matrix multiplication over F 2 . Note, that the high-performance FFLAS [Pernet 2001 ] library does not feature a dedicated implementation for F 2 . We note that all three projects implement different variants of matrix multiplication. GAP implements Algorithm 1 with a fixed k = 8 but no asymptotically fast matrix multiplication algorithm. MAGMA implements Strassen-Winograd matrix multiplication with "padding" and a version of Algorithm 1 as base case (A. Steel, private communication). The crossover from Strassen to Algorithm 1 in MAGMA is hardcoded at c s = 2048 for the Core 2 Duo and c s = 1800 for the Opteron. To achieve cache efficiency MAGMA divides the input matrices into submatrices of dimensions 256×512 and 512×2048 on the Opteron before applying Algorithm 1 and into submatrices of dimensions 2048 × 512 and 512 × 2048 on the Core 2 Duo. We note that while dense matrix multiplication over F 2 in MAGMA was optimized for the Core 2 Duo and the Opteron, it was not optimized for any other architecture.
In the Tables VII and VIII, we give the average of ten observed runtimes and RAM usage for multiplying two random square matrices. The timings for M4RI were obtained using Sage [Stein et al. 2008] . M4RI was compiled with GCC 4.3.1 on both machines and we used the options -O2 on the Opteron machine and -O2 -msse2 on the Core 2 Duo machine. (see Table IX .)
We note that the advantage of our approach over other implementations varies greatly with the architecture considered. On the one hand, these timings demonstrate the validity of our approach by showing a 1.2 − 1.3 speedup over the best known implementation on the Core 2 Duo. On the other hand, our approach seems to offer little if any advantage over the simpler approach followed by MAGMA on the Opteron. It seems unclear whether significant gains can be achieved on the Opteron without any further theoretical advancements in the field of matrix multiplication or whether in fact the comparable performance indicates optimal performance using current techniques.
We note that whilst the advantage over MAGMA is considerable on the Itanium this does not allow one to draw conclusions about the underlying strategy, as MAGMA was not optimized for this platform. Also MAGMA hardcodes its optimization parameters whereas we rely on compile time parameters that allow greater flexibility across platforms.
