This paper presents a hardware-optimized variant of the well-known Gaussian elimination over GF(2) 
Introduction
Solving linear systems of equations is a very common computational problem appearing in numerous research disciplines. From a complexity theoretical point of view, the solution of an LSE is efficiently computable, i.e., using for example the well-known Gaussian elimination algorithm any LSE can be solved in at most cubic time. However, for some areas current algorithms and their software implementations are insufficient or unsatisfying in practice. This is often due to the large dimension or number of LSEs that must be solved in order to accomplish a specific task.
For instance, this is the case in the cryptanalysis of asymmetric encryption schemes like RSA or ElGamal, where very large but sparse LSEs over GF (2) or GF(p) (p prime) need to be solved to obtain a secret key. Since these LSEs must be solved accurately, approximative solutions using fast iterative algorithms are simply useless.
Another kind of problems -which we primarily address in this paper -appears in the cryptanalysis of symmetric ciphers. Here, one is faced with the problem of solving individual large dense LSEs or a large number of mediumsized 1 dense LSEs over GF(2) (for some special classes of stream ciphers). Efficiently accomplishing these tasks would potentially mean the reassessment of the security level of almost all symmetric ciphers, since the existing methods for solving the LSEs in software are rather inefficient with respect to brute force attacks.
However, to the best of our knowledge the alternative of an efficient implementation in hardware has not been analyzed thoroughly yet. In this paper we show that there is still room for drastic efficiency improvements of current algorithms when targeting special purpose hardware.
In order to develop such a hardware architecture we analyzed the suitability of various algorithms while taking possible simplifications over GF (2) into account. Gaussian elimination turned out to be basically best suited for hardware implementation due to its simplicity. Moreover, the coefficient matrices of LSEs are not required to have any special properties like being symmetric positive definite as it is the case, e.g., for the conjugate gradient methods [14] . By slightly modifying the logic of the Gauß algorithm and parallelizing element and row operations, we were able to develop a highly efficient architecture for solving mediumsized LSEs and computing inverse matrices. In a similar way, an architecture for fast matrix multiplication can be obtained and possibly be used in conjunction with the other architecture to solve large LSEs by means of Strassen's algorithm [17] . This paper is roughly structured as follows. We start with a brief discussion of previous work in this area. After reviewing standard Gaussian elimination with backward substitution, we present our optimized version of the algorithm for hardware implementation and analyze its performance. After that, further applications of the hardware architecture are discussed including its particular suitability in the area of cryptanalysis. We then present the actual design of the hardware architecture for solving medium-sized LSEs. Finally, we describe our proof-of-concept implementation on a contemporary low-cost FPGA and provide estimates for a possible ASIC implementation.
Previous Work
The special case of efficiently using Gaussian elimination for solving LSEs over GF (2) is not well documented yet. To the best of our knowledge, no previous work covers the particular topic of improving time efficiency by implementing a hardware-optimized, highly parallelized variant of Gaussian elimination. All research done until now deals with parallelized implementations of the standard Gauß algorithm without altering the algorithm itself. Furthermore, there is almost no performance data available.
In [6] , Gaussian elimination over GF(2) is implemented as part of the DARPA HPCS Discrete Mathematics Benchmark. However, since the author's focus was on a different part (pattern matching) of the benchmark, no explicit performance data for the Gauß implementation is provided.
In [16] , an implementation of Gaussian elimination over GF(2) is described for the ICL-DAP
2
. The limit is a matrix containing 4096 × 4096 elements according to the 4096 processors of the ICL-DAP. This architecture allows to compute the componentwise XOR of two rows within one clock cycle which results in an overall quadratic time complexity. However, the need for a supercomputer makes this solution quite expensive.
Recent work describes the implementation of an adapted algorithm for LU-decomposition on an NVidia GeForce 7800 GPU which is able to solve a 3500×3500 matrix with 32-bit real number coefficients in approximately 5 seconds, where the asymptotic time complexity is still cubic [12] . This impressively shows the power of non-x86 hardware, the cost (around $500 for a GeForce 7800 GTX based graphics card at the time of writing) and time complexity are not yet optimal, though.
Further work deals with (hardware) implementations of the Wiedemann and Lanczos algorithms for very large but sparse matrices [5, 15, 13] . Here, the problem of solving an LSE is reduced to the problem of computing matrix-vector products. However, for matrices of dimensions up to a magnitude of 10 algorithms more than compensates their better asymptotic time complexity. Since our focus is on medium sized and preferably not sparse matrices, the work cited above follows a different approach which is not considered in this paper.
Basic Notation
In order to maintain simplicity and consistency throughout this paper, we will use the following basic notation:
• Matrices are denoted by capital letters, e.g., A.
• The unit matrix of dimension n × n is denoted by I n .
• The i-th row vector of a matrix A is denoted by a i .
• The j-th column vector of a matrix A is denoted by a j .
• An element in the i-th row and j-th column of a matrix
A is denoted by a ij .
• Lowercase letters, e.g., b, denote column vectors.
Mathematical Background
The relevant approach for the proposed architecture is Gaussian elimination carried out as follows: A given LSE of the form A · x = b is transformed to the equivalent system U · x = b , where U is an upper triangular matrix, by applying elementary row operations (i.e., swapping rows and adding a multiple of one row to another). The system U · x = b can then trivially be solved by using backward substitution.
For implementation, the right hand side b is treated as an additional passive column to the coefficient matrix A. We call a column passive if it is not involved in the control flow of the algorithm, i.e., the elements of such a column do not influence the execution of the algorithm in any way but every executed row operation also affects these elements. Thus, in the following we do not need to take care of the right hand side while discussing the algorithms.
Gaussian Elimination over GF(2)
For matrices over GF (2) , Gaussian elimination becomes quite simple: The required row operations consist of the conditional XOR of two rows and the swapping of two rows.
Given a binary regular matrix A, Algorithm 1 performs Gaussian elimination and backward substitution in parallel, i.e., when the algorithm terminates the result matrix is not just an upper triangular matrix U but the identity matrix I n , so the corresponding LSE is already solved. Algorithm 1 works as follows: The outer for-loop of the algorithm is executed for each column of A. Within the for-loop, two things happen: First lines 2 to 5 take care of the necessary pivoting by choosing a row below the diagonal with a nonzero element (pivot element) in the currently examined column (pivot column) as the pivot row. Next, lines 6 to 9 execute a column elimination: The pivot row is XORed to each row having a nonzero element in the pivot column, effectively eliminating all these elements except the pivot element itself. Since the pivot element is always moved to the diagonal, the identity matrix remains after the outer for-loop has been executed for every column of A.
Algorithm 1 Binary Gaussian Elimination with Pivoting
Require: Regular matrix A ∈ {0, 1} exchange a k with a s
6:
for each row i = 1 : n do 7: if (i = k) ∧ (a ik = 1) then 8: for each element j = k : n do 9: a ij := a ij ⊕ a kj Worst and average case runtime of Algorithm 1. The worst case runtime overall needed for pivoting is quadratic. Furthermore, n column eliminations are necessary, each consisting of n row operations. A row operation during the k-th column elimination consists itself of n − k + 1 XOR operations. Hence, in total about n 3 XOR operations are required yielding a worst case runtime of O(n 3 ). For the average case let us assume a matrix A = (a ij ) with uniformly distributed entries, i.e., P r(a ij = 1) = 0.5, as input to the algorithm. In this case the overall runtime for pivoting is linear. Moreover, during a column elimination a row operation will only be executed with probability 0.5 (due to the condition in line 7). Thus, we obtain the following (expected) number of required XOR operations
Hence, the running time of Algorithm 1 in the average case as defined above is still cubic.
Proposed Algorithm

Gaussian Elimination in Hardware
Since a naive implementation of Algorithm 1 in hardware would not yield the desired speed-up compared to a software implementation, we have done several slight modifications to the algorithm described in the following.
Clearly, our goal is to make use of parallelization in hardware. Obviously, the following parts of Algorithm 1, responsible for the actual elimination, are basically well suited for this purpose:
• The per-element operations (Lines 8-9).
• The row operations as a whole (Lines 6-9).
Moreover, we change the logic of the algorithm in order to simplify an implementation in hardware: It is equivalent to have a fixed matrix and change the row under examination or to always examine the first row and cyclic shift the whole matrix accordingly. Obviously, this is also true for columns. The shifting approach has the following advantages: We always need only to consider the first column in order to find a pivot element. Furthermore, always the first row can be used for elimination. Hence, we choose this approach since it seems to be better suited targeting hardware.
Applying the changes described above, Gaussian elimination over GF (2) A := shif tup(n − k + 1, A)
4:
A := eliminate(A) works as described in the following. As before, we iterate over all columns of the matrix A. In the k-th iteration we perform the following steps to obtain a pivot element: We do a cyclic shift-up of all rows not yet used for elimination (due to the shifting approach these are the first n − k + 1 rows) until the element at the fixed pivot position (a 11 ) equals 1. More precisely, the mapping shif tup
(in one step) until a 11 = 1, where shif tup is defined as:
. . a n )
T After a pivot element has been found in the k-th iteration, we add the first row a 1 to all other rows a i where i = 1 and a i1 = 1. In addition, we do a cyclic shift-up of all rows and a cyclic shift-left of all columns. By doing the cyclic shift-up operations after an elimination, rows already used for elimination are "collected" at the bottom of the matrix, which ensures that these rows are not involved in subsequent pivoting steps. The cyclic shift-left ensures that the elements which should be eliminated in the next iteration are located in the first column of the matrix. More precisely, the following (partial) mapping is computed which combines the actual elimination, the cyclic shift-up and the cyclic shift-left:
Indeed, the mapping eliminate can be computed within a single clock cycle using our hardware architecture described in Section 7.
In order to illustrate Algorithm 2, Figure 1 depicts an example application to the LSE
For the sake of concreteness we explicitly treat the right hand side as an additional passive column to A and view the changes on this extended matrix resulting from each execution step. Rows already used for elimination are visualized by shading the corresponding elements. For convenience, a small arrow indicates the position of the passive column. When the algorithm terminates, the coefficient matrix has been transformed into the identity matrix I 3 . The passive column is the leftmost column and contains the solution
Performance
In the following we analyze the running time of Algorithm 2 on input A ∈ {0, 1} n×n . Since we assume that an application of shif tup and eliminate can be computed in a single clock cycle, we simply have to count the required total number of these primitive operations. As one can easily see from Algorithm 2, always exactly n applications of eliminate are performed and only the number of shif tup applications varies depending on the concrete matrix.
Best and Worst Case Running Time. In the best case we have no application of shif tup at all and in the worst we have exactly n − k applications of shif tup during the k-th iteration (assuming A is uniquely solvable). Hence, we obtain the following bounds on the running time:
Proposition 1 (Bounds on Running Time) The time complexity T (n) of Algorithm 2 is bounded by
Average Running Time for Random Matrices. Let us consider a random matrix A as input to Algorithm 2 where each entry a ij equals 1 with probability α ∈ (0, 1), i.e., P r(a ij = 1) = α for all
denote the matrix A after the k-th iteration of the for-loop and let . Now, it is important to observe that during the k-th iteration only the elements of the first column of
are considered in the pivoting step. If α = 0.5 then it is easy to see that the probability P r(h (k) ij = 1) = 0.5 stays the same for all 0 ≤ k < n. 4 Thus, the expected number of required shif tup operations equals 1 in each iteration. Hence, for α = 0.5 we obtain the following average running time:
Proposition 2 (Average Running Time for α = 0.5) The average running time of Algorithm 2 on a binary n × n matrix A with P r(a ij = 1) = 0.5 is 2n.
Doing similar considerations (and some simplifying assumptions), the following estimate of the running time can be established for the general case α ∈ (0, 1):
n×n be a random matrix with P r(a ij = 1) = α ∈ (0, 1). Then the average running time T (n) of Algorithm 2 on input A can be estimated by
where P 1 = α and
It is easy to see that the sequence P k converges to 0.5 for all α ∈ (0, 1). Since this sequence converges quite fast for α belonging to a large interval around 0.5, the running time is very close to 2n for almost all matrices except for very sparse (α < 0.05) or very dense (α > 0.95) matrices. The accuracy of our estimations has been approved by a benchmark of an emulation of the proposed hardware architecture, where the number of required clock cycles to solve matrices of different dimensions and densities has been counted. The results of this benchmark are depicted in Figure 2 . 
Solving Overdetermined LSEs
In practice, we are often interested in solutions to overdetermined LSEs. An LSE A · x = b, where A ∈ {0, 1} m×n and b ∈ {0, 1} m , is called overdetermined if m > n (i.e., more equations than unknowns). To compute a solution to a uniquely solvable 5 overdetermined LSE using Algorithm 2, simply the additional rows must be taken into account in the definition (and actual implementation) of shif tup and eliminate by replacing n with m every time it refers to the row count. More precisely, this means that in the k-th iteration the first m − k + 1 rows must be shifted up and each application of the elimination operation must also involve the additional rows. Then, by applying Algorithm 2, the matrix A and the passive column b are transformed to
where the n-bit vector x is the solution to the LSE.
Multiple b-Vectors and Matrix Inversion
Sometimes, solutions to LSEs of the form A · x i = b i (1 ≤ i ≤ k) are required for multiple different right hand sides b i and the same m × n coefficient matrix A. Using Algorithm 2 (without modifications), all solutions 6 x i can be computed in parallel by simply maintaining k passive columns (see Section 4) instead of 1. Clearly, the running time of the algorithm remains the same as in the case of a single passive column. In other words, a solution to A·X = B is computed where X = (x 1 , . . . , x k ) is an n × k and B = (b 1 , . . . , b k ) is an m × k matrix. Hence, as a special case we can compute the inverse of a regular n × n matrix A by simply setting B = I n . 5 A · x = b is uniquely solvable iff rank(A) = rank(A|b) = n, where A|b denotes the matrix A extended by the column vector b. 6 The algorithm still works if A · x i = b i is not uniquely solvable for some b i .
Further Applications
Besides solving possibly overdefined medium-sized binary LSEs and computing matrix inverses, the presented architecture can be used for a whole suite of other useful applications. Moreover, the idea behind the architecture can lead to other very efficient matrix algorithms. In Subsection 6.1, we show how to use the approach to derive a very efficient matrix-by-matrix multiplication algorithm. Furthermore, we sketch how to solve large matrix problems with our design and with the help of Strassen's algorithm. Finally, Subsection 6.3 shows how cryptanalysis of symmetric ciphers is affected by the design at hand.
Matrix Multiplication
Matrix multiplication is computationally equivalent to matrix inversion. To compute the product of two binary n × n matrices C = A · B over GF(2) using matrix inversion, and thus by exploiting our hardware architecture, the following idea can be used. Let the 3n × 3n binary matrix D be defined as follows:
Then it is easy to see that the inverse of D has the form
Thus, by applying Algorithm 2 as outlined in Subsection 5.4 for the inversion of D, one obtains a working and quite efficient algorithm for matrix multiplication. However, further improvements are possible by observing that actually one does not need to run the whole inversion algorithm: To get the result it suffices to perform elimination for the submatrix A only using the elements of the I n in the middle of D for pivoting. Then the desired product could be directly read out of the upper right n×n submatrix of the resulting matrix. Exploiting this observation leads to the following optimized algorithm for matrix multiplication which can be implemented in hardware in a similar way as Algorithm 2.
Algorithm 3 Parallelized Binary Matrix Multiplication
Require: A, B, C ∈ {0, 1} n×n 1: C := 0 2: for k = 1 : n do 3:
Note that the correctness of Algorithm 3 immediately follows from the standard definition of a matrix product. Its running time is n which includes loading the row vector b k and the column vector a k in each iteration. The space complexity of the algorithm is O(n 2 ).
Solving Large Matrix Problems
Due to constraints on the fabrication process of custom designed hardware, Algorithms 2 and 3 can only be implemented for some limited value of n. This circumstance motivates the study of (recursively) splitting large matrix problems into smaller tasks which could in turn be solved using the above described algorithms directly in hardware.
Strassen suggested a number of matrix algorithms breaking down the complexity of the basic matrix operations [17] . Here, only his algorithms for multiplication and inversion of matrices (over GF(2)) are discussed.
The idea behind Strassen's algorithm for matrix multiplication is that of partitioning matrices and exploiting specific block representations to reduce the overall number of required bit operations. For two binary n × n matrices A and B, n being even, the product
can be computed in the following way using the n 2 × n 2 matrix subblocks A ij and B ij : 
Strassen's matrix multiplication algorithm requires 7 multiplications of n 2 × n 2 matrices instead of 8 in the case of the classical algorithm. This is achieved at the cost of several auxiliary matrix additions. In this way asymptotically O(n log 2 7 ) (with a small constant) bit operations are required compared to about O(n 3 ) for the classical multiplication. As a rule, the crossover point lies in the range of n = 16 to n = 128 depending on the concrete implementation platform and optimization techniques used [2, 3, 7] . The algorithm is perfectly parallelizable: The 7 submatrix products P 1 , . . . , P 7 can be computed in parallel, the same holding for the subsequent addition steps yielding C 11 , C 12 , C 21 and C 22 .
Strassen's algorithm for matrix inversion is related to the algorithm for multiplication. Let the n × n matrix C represent the inverse of A:
Then the inverse of A can be computed as follows:
Above algorithm needs 2 inversions and 6 multiplications of half-sized submatrices. These operations are highly dependent. However, the computation of P 2 , P 3 and C 12 , C 21 can be performed in parallel. So the running time of the algorithm is determined by the performance of 2 inversions, 4 multiplications and 2 additions of n 2 × n 2 matrices. Applying Strassen's matrix multiplication and inversion algorithms recursively to solve the half-sized subproblems, we obtain a total complexity of O(n log 2 7 ) operations. Large LSEs, large matrix inversion problems and large matrix multiplication problems can be split into smaller subproblems using Strassen's algorithms for matrix inversion and matrix multiplication recursively. As the optimal minimum splitting size is achieved, our hardware algorithms for inversion and multiplication are to be applied.
Matrix Problems in Cryptanalysis
Matrix problems over GF (2) are in general strongly related to cryptanalysis and in particular to algebraic attacks on symmetric ciphers which are generically applicable to any block or stream cipher. This is due to the fact that the overwhelming majority of deterministic symmetric ciphers can be represented as finite state machine whose output can be described by a (sometimes rather complicated) boolean function of the initial internal state and input values (if any) giving rise to LSEs over GF (2) .
In [1] , an efficient algebraic attack on the keystream generator underlying the encryption scheme E 0 used in the Bluetooth specification is proposed. The key length of the E 0 keystream generator is 128 bit. The equation generation and linearization methods succeed to output an LSE with at most 2
24.056
unknowns. The system of linear equations can be constructed offline once and implemented in one or several ASICs. For each attack (i.e., for each new key) the only input data are the elements of the concrete output stream. Using our algorithms for binary matrix inversion and multiplication in a linear number of clock cycles and following the proposed approach for splitting large systems of linear equations into smaller matrix problems, it seems possible to build a hardware system which could significantly reduce the actual running time of the attack and to achieve a good time/money ratio value.
Other examples are algebraic attacks on the A5/2 algorithm used in GSM networks. Here the attack from [4] is mentioned only. A5/2 is based on 4 maximal-length linear feedback shift registers clocked irregularly. The register R4 controls the clocking process for the registers R1, R2 and R3. The 64-bit key is the initial filling of the four 16-bit registers. The output is computed using a non-linear function of some positions in R1, R2 and R3. It proves to be possible to construct a system of 656 linearly independent equations by trying all 2 16 different values of R4. The solution of a system which agrees with the output stream is the desired session key. That is, one has to solve at most 2 16 systems of 656 linear equations over GF (2) to restore the initial filling of the 4 registers completely. Our algorithm for solving LSEs can be directly applied in this attack breaking down the overall running time significantly. If we neglect the costs for data transmission and constructing the systems of equations, the attack can be completed in 2 16 ·2·656 ≈ 2
26.35
clock cycles on a single ASIC using the proposed hardware architecture. If the hardware runs with 1 GHz, the key could be found within about 86 ms. This running time can be compared to 40 minutes on a Linux 800 MHz PIII PC [4] (where the major part of the running time is needed for solving the LSEs). Of course, further significant improvements of the running time can be achieved by using multiple ASICs in parallel.
Moreover, all stream ciphers (e.g., LILI-128, Toyocrypt, Phelix, Snow, Turing, etc.) based on simple combiners and on combiners with memory 7 are potentially vulnerable to algebraic attacks [9, 8, 11, 10] . In this context the hardware approach to solving large systems of linear equations in hardware may lead to a revaluation of the security of a numerous well-known symmetric ciphers.
The Proposed Hardware Architecture
Functional Description
In order to implement Algorithm 2 in hardware, we use a mesh structure of "smart memory" cells: The whole device consists of memory cells, which are able to execute the two basic operations defined in Subsection 5.1 in a single clock cycle. The design at hand comprises a parallel implementation of both operations where the pivot element is used as multiplexing signal for the appropriate result. This way, we save a clock cycle for deciding between the two operations. Besides performing the actual computation, the operations can also be used for initially loading the values into the design. Below, we will describe the basic constituents of the design and its interrelation. The terms in typewriter describe the name of the signals used in the figures. If appropriate, the corresponding variable of Algorithm 2 is provided additionally. The Basic Cell and its Interconnection. Figure 3 depicts a single "smart memory" cell with all required connections. Each cell stores a single bit of the matrix and has local connections to 4 direct neighbors: above (out1,lock row), below (in1,lock lower row), left above (out2), and right below (in2). Furthermore, it is connected to a global network: The global network comprises the signals from the pivot element (add=a 11 ), the first element in the actual row (row add=a i1 ), and the first element in the actual column (col add=a 1j ). Note that the used-flags for the actual row and the row below are provided by the signals lock row and lock lower row, respectively. Additionally, every element is connected to the global clock (clk) and reset (rst) network. All cells are aligned in a rectangular array as shown in Figure 4 . The upmost left cell always contains the pivot element. Its out1-signal is used as add-signal for the whole design. The out1-signals of all other cells in the first column are used as row add-signals for the actual row. Similarly, the shif tup-Operation. In case of the actual pivot element being '0', the architecture performs a simple shif tupoperation. All elements in the matrix with the used-flag set to '0' (lock row=0) simply shift their value one row up. Values in the upmost row will be shifted to the lowest unused row, resulting in a cyclic shift of all unused rows. All rows with the used-flag set to '1' (lock row=1) are not affected by the shift and stay constant. For implementation, we use the col add-signals to realize the cyclic shift from the first row to the last unused row.
In the very beginning, values have to be loaded into the design. Since a reset of the design sets all elements to '0', a shift-up is applied n times until the matrix is completely loaded. In order not to load more than n values into the design, the loading phase is coordinated by an input multiplexer, switching off the input after n steps.
eliminate-Operation. Elimination is performed when the pivot element equals '1' (add=1=a 11 ). Every element except for those in the first row will compute an XOR with the upmost column entry (col add=a 1j ) if the first entry in the row is '1' (row add=1=a i1 ). In the same step, the result is shifted to the element on the left in the upper row. The leftmost elements need no computation, since this step always yields a column consisting of zeroes and a single '1'. The first row (with the pivot element) is unaffected and will be shifted to the last row and its used-flag (lock row) will be set to '1'. The latter procedure prevents this row from being used again for pivoting.
End of Computation. When all rows have actively been used for pivoting, the used-flag reaches the topmost row and the architecture stops. In our design, this signal is used for indicating the end of computation. In the case of a single b-vector, the result can simply be read from the first column by reading the column add-signals, which are wired out explicitly. Depending on the outer control logic, the topmost used-flag can be used to indicate the finished state and eventually stop the computation until the result has been read.
Improvement of AT Complexity
A dramatic improvement in complexity of the average running time from a software implementation of binary Gauß (Algorithm 1) of O( 3 ) to the architecture at hand of O(2n) is achieved. We simply compute on all elements in parallel, yielding an improvement of O(n 2 ) in the running time. However, we do not simply trade this improvement in running time with an equivalent increase in area. In fact, the design scales with the same asymptotic complexity as the standard Gauß in software.
In both algorithms, the memory requirement scales with O(n 2 ), i.e., with the number of elements of the matrix. The only difference is that the type of memory cells in our design is a bit more complex (smart memory). However, the area occupied by a single cell is still constant and does not depend on the problem size. Hence, our hardware implementation does not only exhibit a much better running time but also a lower overall AT product compared to a software implementation of the standard algorithm.
Handling Un-and Not Uniquely Solvable LSEs
For the sake of simplicity, we always assumed uniquely solvable LSEs in this paper. However, in some applications the solvability of an LSE might not be known prior to the computation. The design can easily be extended to also detect and handle unsolvable and not uniquely solvable LSEs by observing the following facts:
• A uniquely solvable LSE is always solved after n applications of the eliminate-operation. Hence, the used-flag is the stop criterion.
• In the case of a solution manifold the stop criterion will never be reached since infinitely many shif tupoperations will occur at some point of the computation. To avoid such a dead-lock, we have to initialize a counter after every elimination step to limit the number of subsequent shif tup-operations to be less than the number of rows minus the number of locked rows.
• In the case of an unsolvable non-overdetermined LSE we are in the same situation as described above.
• In the case of unsolvable overdetermined LSEs the computation may terminate with an invalid solution having a '1' at a position corresponding to a zero row in the resulting matrix. Thus, all such positions have to be checked for '0'.
Pipelining
In order to achieve improved performance when solving many LSEs on a single "smart memory" chip, one can exploit the fact that once a column has undergone column elimination it does not have to be kept in memory anymore. Instead of keeping these columns, it is possible to load a new column of the matrix which will be processed next each time a column elimination is executed on the current matrix. With this approach, solving the current matrix and loading the next matrix could be done simultaneously.
However, there are some potential caveats to take care of: It has to be ensured that neither row permutations nor XOR operations can corrupt the columns of the partially loaded new matrix. This can for example be achieved with a flag similar to the already implemented used-flag protecting the rightmost columns. In fact, the already implemented usedflag can easily be reused for this purpose.
Implementation
This section explains the methodology used to implement the previously presented architecture on FPGAs. Along with implementational results such as running time and area consumption, we provide estimates about a possible ASIC implementation.
Methodology
The architecture has been programmed in VHDL and is kept generic to allow for easily changing to arbitrary matrix sizes and different dimensions of the b-vector. The presented results originate from the running implementation.
For loading data from and storing data into the FPGA, we additionally implemented a simple interface to a host PC. It facilitates automated tests with several test values for architectures of different size. Note that the speed of the data in-and output is critical in case of an ultimate implementation. Hence, interfaces should be capable of running at high-speed (e.g., by using PCI-express or proprietary systems such as those on a Cray-XD1). Table 1 presents the results of the implementation in terms of occupied FPGA resources depending on the size of the matrix. We considered quadratic matrices of dimension n = 5, 10, 20 and 50. All values represent the results for the running and verified design on the actual FPGA.
Results
A basic element can be implemented using only about 1.5 slices on average, which limits implementations on FPGAs to a maximum possible matrix size (see Table 1 ). With the current low-cost FPGA (Xilinx XC3S1000), matrix sizes of up to 70x70 are feasible. Note that we did not optimize the code for a certain FPGA, i.e., FPGA-specific optimizations might yield better area usage. The implementation at hand should be seen as proof-of-concept and not as ultimate design. However, the frequency of the matrix elements can be as high as 300 MHz which is close to the maximum possible frequency of the FPGA at hand. All designs have been tested at that frequency and displayed the correct results. Due to the restrictive nature of conventional FPGAs concerning the matrix size, an ASIC implementation should be considered for larger matrices and is discussed in the following section.
Estimates for an ASIC Design
Conventional FPGAs dispose of several features which can not be used for our design and, hence, the FPGA can not be utilized completely for the algorithm presented. Furthermore, the FPGA's inherent overhead to maintain its programmability prevents from an optimal and AT efficient design. When targeting a high-performant matrix solver for medium-sized matrices, a realization as Application Specific Integrated Circuit (ASIC) which can be tweaked completely to fit the requirements should be considered. Since our architecture has been successfully implemented on an FPGA, we now analyze its area consumption as ASIC and conclude with an estimate of the expected performance. For the estimate, we will not take drivers and buffers for the high-fanout wires into account. Furthermore, the maximum chip size will be limited due to the long data paths which presumably prevent the technical feasibility for matrices much larger than 1000 × 1000.
Equivalent Gate Count. As previously shown, the presented design is extremely simple and does not require complicated control logic. The matrix cell can be implemented with a few basic gates, as shown in Figure 5 . Thus, the area consumption in terms of transistors or equivalent gates (standard CMOS logic) can be evaluated pretty accurate. As essential building blocks, we need XORs, ORs, ANDs, NOTs and memory cells. For a single element of the matrix, we estimate an equivalent of 54 gates (two-input NAND).
For comparison, a Pentium 4 "Prescott" processor accumulates approximately 1.25 · 10 8 transistors (0.09 microns). Hence, with current technology such an architecture for matrices up to a dimension of 1000 could be implemented on a die size of less than that for a conventional desktop CPU. Note that power consumption and cooling might become an issue since, in contrast to conventional CPUs with cache, the whole circuit is active during computation.
Remark: The number of estimated gates is based upon simple 2-input logic gates. Hence, further optimization is possible and will result in a lower transistor count. Due to the regularity and relatively low latency of the design, clock frequencies in the range of 500 MHz up to a few GHz should be feasible.
Expected Performance. Loading an LSE of dimension 1000 into such a device, solving it, and reading out the solution could be done in 4 µs on average, assuming a clock rate of 1 GHz. Hence, up to 250 000 LSEs could be solved per second, assuming an adequate data input and output rate. For applications, an integrated pre-and post processing might be of interest in order to avoid the architecture from running idle. Only a high-speed I/O stream can bring out the best performance.
Conclusion
The paper at hand seems to be the first contribution towards a highly efficient architecture for solving linear systems of binary equations. We formulate an algorithmic improvement for the well-known Gaussian elimination for the binary case regarding hardware implementations. We provide lower and upper bounds on the runtime as well as the average running time. As a result, an architecture implementing the proposed algorithm can solve an LSE over GF(2) of dimension n × n in 2n steps on average, compared to A hardware implementation for LSEs up to a dimension of 50×50 has been realized using VHDL. The resulting implementation can be operated at high speed (up to 300MHz) and was verified on a contemporary low-cost FPGA (Xilinx Spartan3-1000). In addition, we provide a first estimate of the area complexity of an ASIC design: With current technology, the design can be implemented for LSEs up to dimension of 1000 × 1000 yielding conventional chip sizes. As a remarkable feature, the area complexity of the architecture scales with O(n 2 ), which is similar to the standard Gaussian approach, though the complexity of the running time is O(n 2 ) lower. Thus, the overall improvement regarding the AT -product is of order O(n 2 ) compared to the standard algorithm in software.
Due to technological constraints, we always face a maximum possible matrix size which can be realized in practice. Consequently, we describe how to solve larger LSEs by breaking down the initial matrix size with Strassen's algorithms. As shown within the paper, further applications include fast matrix-by-matrix multiplication and utilization in cryptanalysis. In the latter case, time consuming parts of actual cryptanalytic algorithms can be accelerated by order of several magnitudes with the architecture presented.
