We analyze how fast we can solve general systems of multivariate equations of various low degrees over F2; this is a well known hard problem which is important both in itself and as part of many types of algebraic cryptanalysis. Compared to the standard exhaustive search technique, our improved approach is more efficient both asymptotically and practically. We implemented several optimized versions of our techniques on CPUs and GPUs. Our technique runs more than 10 times faster on modern graphic cards than on the most powerful CPU available. Today, we can solve 48+ quadratic equations in 48 binary variables on a 500-dollar NVIDIA GTX 295 graphics card in 21 minutes. With this level of performance, solving systems of equations supposed to ensure a security level of 64 bits turns out to be feasible in practice with a modest budget. This is a clear demonstration of the computational power of GPUs in solving many types of combinatorial and cryptanalytic problems.
Introduction
Solving a system of m nonlinear polynomial equations in n variables over F q is a natural mathematical problem that has been investigated by various research communities. The cryptographers are among the interested parties since an NP-complete problem whose random instances seem hard could be used to design cryptographic primitives, as witness the development of multivariate cryptography in the last few decades, using one-way trapdoor functions such as HFE, SFLASH, and QUARTZ [12, 20, 21] , as well as stream ciphers such as QUAD [4] . Conversely, in "algebraic cryptanalysis" one distills from a cryptographic primitive a system of multivariate polynomial equations with the secret among the variables. This does not break AES as first advertised, but does break KeeLoq [11] , for a recent example, and find a faster collision on 58-round SHA-1 [24] .
Since the pioneering work by Buchberger [9] , Gröbner-basis techniques have been the most prominent tool for this problem, especially after the emergence of faster algorithms such as F 4 or F 5 [15, 16] , which broke the first HFE challenge [17] . The cryptographic community independently rediscovered some of the ideas underlying efficient Gröbner-basis algorithms as of the XL algorithm [13] and its variants. They also introduced techniques to deal with special cases, particularly that of sparse systems [1, 23] .
In this paper we take a different path, namely improving the standard and seemingly well-understood exhaustive search algorithm. When the system consists of n randomly chosen quadratic equations in n variables, all the known solution techniques have exponential complexity. In particular, Gröbner-basis methods have an advantage on very overdetermined systems (with many more equations than unknowns) and systems with certain algebraic "weaknesses", but were shown to be exponential on "generic" enough systems in [2, 3] . In addition, the computation of a Gröbner basis is often a memorybound process; since memory is more expensive than time at the scale of interest, such sophisticated techniques can be inferior in practice when compared to simple testing of all the possible solutions, which uses almost no memory.
For "generic" quadratic systems, experts believe [2, 25] that Gröbner basis methods will go up to degree D 0 , which is the minimum possible D where the coefficient of t
−m goes negative, and then require the solution of a system of linear equations with T n D0−1 variables. This will take at least poly(n)·T 2 bit-operations, assuming we can afford a sufficiently large amount of memory and that we can solve such a linear system of equations with non-negligible probability in O(N 2+o (1) ) time for N variables. For example, if we assume we can operate a Wiedemann solver on a T × T submatrix of the extended Macaulay matrix of the original system, then the polynomial is 3n(n − 1)/2. When m = n = 200, D 0 = 25, making the value of T exceeds 2 102 ; even taking into consideration guessing before solving [6, 26] , we can still easily conclude that Gröbner-basis methods would not outperform exhaustive search in the practically interesting range of m = n ≤ 200.
The questions we address are therefore: how far can we go, on both theoretical and practical sides, by pushing exhaustive search further? Is it possible to design more efficient exhaustive search algorithms? Can we get better performance using different hardware such as GPUs? Is it possible to solve in practice, with a modest budget, a system of 64 equations in 64 unknowns over F 2 ? Less than 15 years ago, this was considered so difficult that it even underlied the security of a particular signature scheme [19] . Intuitively, some people may consider an algebraic attack that reduces a cryptosystem to 64 equations of degree 4 in 64 F 2 -variables to be a successful practical attack. However, the matter is not that easily settled because the complexity of a naïve exhaustive search algorithm would actually be much higher than 2 64 : simply testing all the solutions in a naïve way results in 2 · 64 4 · 2 64 ≈ 2 84 logical operations, which would make the attack hardly feasible even on today's best available hardware.
Our Contribution. Our contribution is twofold. On the theoretical side, we present a new type of exhaustive search algorithm which is both asymptotically and practically faster than existing techniques. In particular, we show that finding all zeroes of a single degree-d polynomial in n variables requires just d · 2 n bit operations. We then extend this technique and show how to find all the common zeroes of m random quadratic polynomials in log 2 n · 2 n+2 bit operations, which is only slightly higher. Surprisingly, this complexity is independent of the number of equations m.
On the practical side, we have implemented our new algorithms on x86 CPUs and on NVIDIA GPUs. While our CPU implementation is fairly optimized using vector instructions, our GPU implementation running on one single NVIDIA GTX 295 graphics card runs up to 13 times faster than the CPU implementation using all four cores of an Intel quad-code Core i7 at 3 GHz, one of the fastest CPUs currently available. Today, we can solve 48+ quadratic equations in 48 binary variables using just an NVIDIA GTX 295 graphics card in 21 minutes. This device is available for about $500. It would be 36 minutes for cubic equations and two hours for quartics. The 64-bit signature challenge [19] can thus be broken with 10 such cards in 3 months, using a budget of $5000. Even taking into account Moore's law, this is still quite an achievement. In contrast, the implementation of F 4 in MAGMA-2.16, often cited as the best Gröbner-basis solver commercially available today, will completely use up 64 GB of RAM in solving just 25 cubic equations in as many F 2 -variables. We have also tested it with overdefined systems, for which Gröbner-basis algorithms are known to work better. While it does not run out of memory, the results are not satisfying: 2.5 hours to solve 20 cubic equations in 20 variables, half an hour for 45 quadratic equations in 30 variables, and 7 minutes for 60 quadratic equations in 30 variables on one 2.2-GHz Opteron core. Some very recent improvements on Gröbner-basis solvers have reported speed-up over MAGMA F 4 of two-to five-fold [10] . However, even with such significant improvements, Gröbner-basis solvers do not seem to be able to compete with exhaustive search algorithms in this range, as each of the above is solved in a split second using negligible amount of memory on the same CPU by the latter.
Implications. The new exhaustive search algorithm can be used as a black box in cryptanalysis that needs to solve quadratic equations. This includes, for instance, several algorithms for the Isomorphism of Polynomials problem [7, 22] , as well as attacks that rely on such algorithms, e.g., [8] .
We also show with a concrete example that (relatively simple) computations requiring 2 64 operations can be easily carried out in practice with readily available hardware and a modest budget. Lastly, we highlight the fact that GPUs have been used successfully by the cryptographic community to obtain very efficient implementations of combinatorial algorithms or cryptanalytic attacks, in addition to the more numeric-flavored cryptanalysis algorithm demonstrated by the implementation of the ECM factorization algorithm on GPUs [5] .
Organization of the Paper. Section 2 establishes a formal framework of exhaustive search algorithms including useful results on Gray Codes and derivatives of multivariate polynomials. Known exhaustive search algorithms are reviewed in Section 3. Our algorithm to find the zeroes of a single polynomial of any degree is given in Section 4, and it is extended to find the common zeroes of a collection of polynomials in Section 5. Section 6 describes the two platforms on which we implemented the algorithm, and Section 7 describes the implementation and performance evaluation results.
Generalities
In this paper, we will mostly be working over the finite vector space (F 2 )
n . The canonical basis is denoted by (e 0 , . . . , e n−1 ). We use ⊕ to denote addition in (F 2 ) n , and + to denote integer addition. We use i k (resp. i k) to denote binary left-shift (resp. right shift) of the integer i by k bits. 
Definition 1. GRAYCODE(i)
Derivatives. Define the F 2 derivative ∂f ∂i of a polynomial with respect to its i-th variable as
. Then for any vector x, we have:
If f is of total degree d, then ∂f ∂i is a polynomial of degree d − 1. In particular, if f is quadratic, then ∂f ∂i is an affine function. In this case, it is easy to isolate the constant part (which is a constant in 
n . More precisely, we have
Then equation (1) becomes:
Enumeration Algorithms. We are interested in enumeration algorithms, i.e., algorithms that evaluate a polynomial f over all the points of (F 2 ) n to find its zeroes. 
. Finding all the zeroes of f is then achieved with the loop shown below.
if State.y = 0 then State.x is a zero of f 5:
NEXT(State) 6: end for 7: end procedure
Known Techniques for Quadratic Polynomials
We briefly discuss the enumeration techniques known to the authors. Naïve Evaluation. The simplest way to implement an enumeration algorithm is to evaluate the polynomial f from scratch at each point of (F 2 )
n . This requires two AND per quadratic monomial, and (almost) as many XORs. Since the evaluation takes place many times for the same f with different values of the variables, we will usually assume that the polynomial can be hard-coded, i.e., that it is not necessary to include the terms for which a ijk = 0. Each call to NEXT would then require at most n(n + 1) bit operations, half-AND and half-XOR (not counting the cost of enumerating (F 2 ) n , i.e., incrementing a counter). This can be improved a bit, by factoring out the monomials:
The bit-operation count falls down to n(n + 3)/2, and in general for degree-d polynomials to a sum dominated by n d . This method is simple but not without its advantages, chiefly (a) insensitivity to the order in which the points of (F 2 ) n are enumerated, and (b) we can bit-slice and get a speed up of nearly ω, where ω is the maximum width of the CPU logical instructions.
The Folklore Differential Technique. It was pointed out in Sec. 2 that once f (x) is known, computing f (x + e i ) amounts to compute ∂f ∂i (x), and this derivative happens to be a linear function which can be efficiently evaluated by computing a vectorvector product and a few scalar additions. This strongly suggests to evaluate f on (F 2 ) n using a Gray Code, i.e., an ordering of the elements of (F 2 ) n such that two consecutive elements differ in only one bit. This leads to the algorithm shown below.
y ← y ⊕ z 6:
x ← x ⊕ e k 7: end function (a) Initialize (b) Update
We believe this technique to be folklore, and in any case it appears more or less explicitly in the existing literature. Each call to NEXT requires n ANDs, as well as n + 2 XORs, which makes a total bit operation count of 2(n + 1). This is about n/4 times less than the naive method. Note that when we describe an enumeration algorithm, the variables that appear inside NEXT are in fact implicit functions of State. The dependency has been removed to lighten the notational burden.
A Faster Recursive Algorithm for Any Degree
We now describe one of the main contributions of this paper, a new algorithm which is both asymptotically and practically faster than standard exhaustive search in enumerating the solutions of one polynomial equation, as summarized by Theorem 1 below. The proof is given in the full version.
Construction of the Recursive Enumeration Algorithm. We will construct an enumeration algorithm in two stages. First, if f is of degree 0, then we only need to "enumerate" through all vectors by updating with x ← x ⊕ e b1(i) at the i-th step. When f is of higher degree, we need a little more effort. The main idea is that in the folklore differential algorithm of Sec. 3, the computation of z essentially amounts to evaluate ∂f ∂k on something that looks like a Gray Code. We may then use the enumeration algorithm recursively on ∂f ∂k , since it is a polynomial of strictly smaller degree. The resulting algorithm is shown below.
It is not difficult to see that the complexity of NEXT is O (d), where d is the degree of f . The temporal complexity of INIT is n d times the time of evaluating f , which is itself upper-bounded by n d and its spatial complexity is also of order O n d . This means that the complexity of the algorithm is
, this is about n times faster than the algorithm described in Sec. 3.
for i from 0 to 2 n − 1
x ← x ⊕ e k+k 0 5:
for v from 0 to 2 u − 2 10:
end for 16 : end for (a) General Setting (b) Unrolled version for quadratic f
Common Zeroes of Several Multivariate Polynomials
We will use several time the following simple idea: all the techniques we discussed above perform a sequence of operations that is independent of the coefficients of the polynomials. Therefore, m instances of (say) algorithm in Sec. 4 could be run in parallel on f 1 , . . . , f m . All the parallel runs would execute the same instruction on different data, making it efficient to implement on vector or SIMD architectures. In each iteration of the main loop, it is easy to check if all the polynomials vanished on the current point of (F 2 ) n . Evaluating all the m (quadratic) polynomials in parallel using algorithm in Sec. 4 would take 2m2 n bit operations. The point of this section is that it is possible to do much better than this.
Note that for the sake of simplicity, we limit our discussion to the case of quadratic polynomials (this case being the most relevant in practice). Our objective is now to show the following result. Theorem 2. The common zeroes of m (random) quadratic polynomials in n variables can be found after having performed in expectation log 2 n · 2 n+2 bit operations.
We sketch a proof (a complete one given in the extended version) to the theorem. Let us introduce a useful notation. Given an ordered set U , we denote the common zeroes of
and
. It should be clear that Z = Z m is the set of common zeroes of the polynomials, and therefore the object we wish to obtain. The key idea is to compute Z k using k parallel runs of algorithm in Sec. 4, and then computing Z k+1 , . . . , Z m one by one. Computing Z k requires 2k2 n bit operations. It then remains to compute Z m from Z k , and to find the best possible value of k. If we use the naïve evaluation strategy with early abort to compute Z m from Z k , then it can be shown that the best value of k is k = 2 log 2 n − log 2 log 2 n + o(log log n), yielding a total complexity of about 8 log 2 n · 2 n . In general, for degree-d systems, the same reasoning would lead to a total complexity of about 4d · log 2 n · 2 n . In practice, it makes more sense to choose k to be the word width on a microprocessor in order to use the hardware in the most efficient way.
A Brief Description of the Hardware Platforms

Vector Units on x86-64
The most prevalent SIMD (single instruction, multiple data) instruction set today is SSE2, available on all current Intel-compatible CPUs. SSE2 instructions operate on 16 architectural xmm registers, each of which is 128-bit wide. We use integer operations, which treat xmm registers as vectors of 8-, 16-, 32-or 64-bit operands.
The highly non-orthogonal SSE instruction set includes Loads and Stores (to/from xmm registers, memory -both aligned and unaligned, and traditional registers), Packing/Unpacking/Shuffling, Logical Operations (AND, OR, NOT, XOR, Shifts Left, Right Logical and Arithmetic -bit-wise on units and byte-wise on the entire xmm register), and Arithmetic (add, substract, multiply, max-min) with some or all of the arithmetic widths. The interested reader is referred to Intel and AMD's manuals for details on these instructions, and to references such as [18] for throughput and latencies.
G2xx-Series Graphics Processing Units from NVIDIA
We choose NVIDIA's G2xx GPUs as they have the least hostile GPU parallel programming environment called CUDA (Compute Unified Device Architecture). In CUDA, we program in the familiar C/C++ programming language plus a small set of GPU extensions.
An NVIDIA GPU contains anywhere from 2-30 streaming multiprocessors (MPs). There are 8 ALUs (streaming processors or SPs in market-speak) and two super function units (SFUs) on each MP. A top-end "GTX 295" card has two GPUs, each with 30 MPs, hence the claimed "480 cores". The theoretical throughput of each SP per cycle is one 32-bit integer or floating-point instruction (including add/subtract, multiply, bitwise AND/OR/XOR, and fused multiply-add), and that of an SFU 2 floating-point multiplications or 1 special operation. The arithmetic units have 20+-stage pipelines.
Main memory is slow and forms a major bottleneck in many applications. The read bandwidth from memory on the card to the GPU is only one 32-bit read per cycle per MP and has a latency of > 200 cycles. To ease this problem, the MP has a register file of 64 KB (16,384 registers, max. of 128 per thread), a 16-bank shared memory of 16 KB, and an 8-KB cache for read-only access to a declared "constant region" of at most 64 KB. Every cycle, each MP can achieve one read from the constant cache, which can broadcast to many thread at once.
Each MP contains a scheduling and dispatching unit that can handle a large number of lightweight threads. However, the decoding unit can only decode once every 4 cycles. This is typically 1 instruction, but certain common instructions are "half-sized", so two such instructions can be issued together if independent. Since there are 8 SPs in an MP, CUDA programming is always on a Single Program Multiple Data basis, where a "warp" of threads (32) should be executing the same instruction. If there is a branch which is taken by some thread in a warp but not others, we are said to have a "divergent" warp; from then on only part of the threads will execute until all threads in that warp are executing the same instruction again. Further, as the latency of a typical instruction is about 24 cycles, NVIDIA recommends a minimum of 6 warps on each MP, although it is sometimes possible to get acceptable performance with 4 warps.
Implementations
We describe the structure of our code, the approximate cost structure, our design choices and justify what we did. Our implementation code always consists of three stages:
Partial Evaluation: We substitute all possible values for s variables (x n−s , . . . , x n−1 ) out of n, thus splitting the original system into 2 s smaller systems, of w equations each in the remaining (n − s) variables (x 0 , . . . , x n−s−1 ), and output them in a form that is suitable for ... Enumeration Kernel: Run the algorithm of Sec. 4 to find all candidate vectors x satisfying w equations out of m (≈ 2 n−w of them), which are handed over to ... Candidate Checking: Checking possible solutions x in remaining m − w equations.
CPU Enumeration Kernel
Typical code fragments from the unrolled inner loops can be seen below: testing. All zeroes in one byte, word, or dword in a XMM register can be tested cheaply on x86-64. We hence wrote code to test 16 or 32 equations at a time. Strangely enough, even though the code above is for 16 bits, the code for checking 32/8 bits at the same time is nearly identical, the only difference being that we would subtitute the intrinsics _mm_cmpeq_epi32/8 for _mm_cmpeq_epi16 (leading to the SSE2 instruction pcmpeqd/b instead of pcmpeqw). Whenever one of the words (or double words or bytes, if using another testing width) is non-zero, the program branches away and queues the candidate solution for checking.
unrolling. One common aspect of our CPU and GPU code is deep unrolling by upwards of 1024× to avoid the expensive bit-position indexing. To illustrate with quartics as an example, instead of having to compute the positions of the four rightmost non-zero bits in every integer, we only need to compute the first four rightmost non-zero bits in bit 10 or above, then fill in a few blanks. This avoids most of the indexing calculations and all the calculations involving the most commonly used differentials. We wrote similar Python scripts to generate unrolled loops in C and CUDA code. Unrolling is even more critical with GPU, since divergent branching and memory accesses are prohibitively expensive.
GPU Enumeration Kernel
register usage. Fast memory is precious on GPU and register usage critical for CUDA programmers. Our algorithms' memory complexity grows exponentially with the degree d, which is a serious problem when implementing the algorithm for cubic and quartic systems, compounded by the immaturity of NVIDIA's nvcc compiler which tends to allocate more registers than we expected.
Take quartic systems as an example. Recall that each thread needs to maintain third derivatives, which we may call d ijk for 0 ≤ i < j < k < K, where K is the number of variables in each small system. For K = 10, there are 120 d ijk 's and we cannot waste all our registers on them, especially as all differentials are not equal -d ijk is accessed with probability 2 −(k+1) . Our strategy for register use is simple: Pick a suitable bound u, and among third differentials d ijk (and first and second differentials d i and d ij ), put the most frequently used -i.e., all indices less than u -in registers, and the rest in device memory (which can be read every 8 instructions without choking). We can then control the number of registers used and find the best u empirically.
fast conditional move. We discovered during implementation an undocumented feature of CUDA for G2xx series GPUs, namely that nvcc reliably generates conditional (predicated) move instructions, dispatched with exceptional adeptness.
// d_y^=d_yz xor.b32 $p1|$r20, $r17, $r20 mov.b32 $r3, $r1 mov.b32 $r1, s[$ofs1+0x0038] xor.b32 $r4, $r4, c0[0x0010] xor.b32 $p0|$r20, $r19, $r20 // res^=d_y @$p1.eq mov.b32 $r3, $r1 @$p1.eq mov.b32 $r1, s[$ofs1+0x003c] xor.b32 $r19, $r19, c0[0x0000] xor.b32 $p1|$r20, $r4, $r20 @$p0.eq mov.b32 $r3, $r1 // cmov @$p0.eq mov.b32 $r1, s[$ofs1+0x0040] // cmov ... ... diff0^= deg2_block [ 3 ] ; // d_y^=d_yz res^= diff0; // res^=d_y if( res == 0 ) y = z; // cmov if( res == 0 ) z = code233; // cmov diff1^= deg2_block [ 4 ] ; res^= diff1; if( res == 0 ) y = z; if( res == 0 ) z = code234; diff0^= deg2_block[ 0 ]; res^= diff0; if( res == 0 ) y = z; if( res == 0 ) z = code235; ... Consider, for example, the code displayed above right. According to our experimental results, the repetitive 4-line code segments average less than three SP (streamprocessor) cycles. However, decuda output of our program shows that each such code segment corresponds to at least 4 instructions including 2 XORs and 2 conditional moves [as marked in above left]. The only explanation is that conditional moves can be dispatched by the SFUs (Special Function Units) so that the total throughput can exceed 1 instruction per SP cycle. Further note that the annotated segment on the right corresponds to actual instructions far apart because an NVIDIA GPU does opportunistic dispatching but is nevertheless a purely in-order architecture, so proper scheduling must interleave instructions from different parts of the code.
testing. The inner loop for GPUs differs from CPUs due to the fast conditional moves.
Here we evaluate 32 equations at a time using Gray code. The result is used to set a flag if it happens to be all zeroes. We can now conditional move of the index based on the flag to a register variable z, and at the end of the loop write z out to global memory.
However, how can we tell if there are too many (here, two) candidate solutions in one small subsystem? Our answer to that is to use a buffer register variable y and a second conditional move using the same flag. At the end of the thread, (y, z) is written out to a specific location in device memory and sent back to the CPU. Now subsystems which have all zero constant terms are automatically satisfied by the vector of zeroes. Hence we note them down during the partial evaluation phase include the zeros with the list of candidate solutions to be checked, and never have to worry about for all-zero candidate solution. The CPU reads the two doublewords corresponding to y and z for each thread, and:
1. z==0 means no candidate solutions, 2. z!=0 but y==0 means exactly one candidate solution, and 3. y!=0 means 2+ candidate solutions (necessitating a re-check).
Checking Candidates
Checking candidate solutions is always done on CPU because the programming involves branching and hence is difficult on a GPU even with that available. However, the checking code for CPU enumeration and GPU enumeration is different.
CPU.
With the CPU, the check code receives a list of candidate solutions. Today the maximum machine operation is 128-bit wide. Therefore we should collect solutions into groups of 128 possible solutions. We would rearrange 128 inputs of n bits such that they appear as n __int128's, then evaluate one polynomial for 128 results in parallel using 128-bit wide ANDs and XORs. After we finish all candidates for one equation, go through the results and discard candidates that are no longer possible. Repeat the result for the second and any further equations (cf. Sec. 3).
We need to transpose a bit-matrix to achieve the effect of a block of w inputs n-bit long each, to n machine-words of w-bit long. This looks costly, however, there is an SSE2 instruction PMOVMSKB (packed-move-mask-bytes) that packs the top bit of each byte in an XMM register into a 16-bit general-purpose register with 1 cycle throughput. We combine this with simultaneous shifts of bytes in an XMM and can, for example, on a K10+ transpose a 128-batch of 32-bit vectors (0.5kB total) into 32 __int128's in about 800 cycles, or an overhead of 6.25 cycles per 32-bit vector. In general the transposition cost is at most a few cycles per byte of data, negligible for large systems.
GPU. As explained above, for the GPU we receive a list with 3 kinds of entries:
1. The knowledge that there are two or more candidate solutions within that same small system, with only the position of the last one in the Gray code order recorded. 2. A candidate solution (and no other within the same small system). 3. Marks to subsystems that have all zero constant terms.
For Case 1, we take the same small system that was passed into the GPU and run the Enumerative Kernel subroutine in the CPU code and find all possible small systems.
Since most of the time, there are exactly two candidate solutions, we expected the Gray code enumeration to go two-thirds of the way through the subsystem. Merge remaining candidate solutions with those of Case 2+3, collate for checking in a larger subsystem if needed, and pass off to the same routine as used in the CPU above. Not unexpectedly, the runtime is dominated by the thread-check case, since those does millions of cycles for two candidate solutions (most of the time).
Partial Evaluation
The algorithm for Partial Evaluation is for the most part the same Gray Code algorithm as used in the Enumeration Kernel. Also the highest degree coefficients remain constant, need no evaluation and and can be shared across the entire Enumeration Kernel stage. As has been mentioned in the GPU description, these will be stored in the constant memory, which is reasonably cached on NVIDIA CUDA cards. The other coefficients can be computed by Gray code enumeration, so for example for quadratics we have 
Peculiarities of GPUS.
Many warps of threads are required for GPUs to run at full speed, hence we must split a kernel into many threads, the initial state of each small system being provided by Partial Evaluation. In fact, for larger systems on GPUs, we do two stages of partial evaluation because 1. there is a limit to how many threads can be spawned, and how many small systems the device memory can hold, which bounds how small we can split; but 2. increasing s decreases the fast memory pressure; and 3. a small systems reporting two or more candidate solutions is costly, yet we can't run a batch check on a small system with only one candidate solution -hence, an intermediate partial evaluation so we can batch check with fewer variables.
More Test Data and Discussion
Some minor points which the reader might find useful in understanding the test data, a full set of which will appear in the extended version.
Candidate Checking. The check code is now 6-10% of the runtime. In theory (cf. Sec. 3) evaluation should start with a script which hard-wires a system of equations into C and compiling to an excutable, eliminating half of the terms, and leading to n−s d SSE2 (half XORs and half ANDs) operations to check one equation for w = 128 inputs. The check code can potentially become more than an order of magnitude faster. We do not (yet) do so presently, because compiling may take more time than the checking code. However, we may want to go this route for even larger systems, as the overhead from testing for zero bits, re-collating the results, and wasting due to the number of candidate solutions is not divisible by w would all go down proportionally. 
2(s+w−n)+1 , provided that n < s+w. As an example, for n = 48, s = 22, w = 32, the thread-recheck probability is about 1 in 2 13 , and we must re-check about 2 9 threads using Gray Code. This pushes up the optimal s for GPUs.
Architecture and Differences. All our tests with a huge variety of machines and video cards show that the kernel time in cycles per attempt is almost a constant of the architecture, and the speed-up in multi-cores is almost completely linear for almost all modern hardware. So we can compute the time complexity given the architecture, the frequency, the number of cores, and n. The marked cycle count difference between Intel and AMD cores is explained by Intel dispatching three XMM (SSE2) logical instructions to AMD's two per cycle and handling branch prediction and caching better. As the Degree d increases. We plot how many cycles is taken by the inner loop (which is 8 vectors per core for CPUs and 1 vector per SP for GPUs) on different architectures in Fig. 1 . As we can see, all except two architectures have inner loop cycle counts that are increasing roughly linearly with the degree. The exceptions are the AMD K10 and NVIDIA G200 architectures, which is in line with fast memory pressure on the NVIDIA GPUs and fact that K10 has the least cache among the CPU architectures.
More Tuning. We can conduct a Gaussian elimination among the m equations and such that m/2 selected terms in m/2 of the equations are all zero. We can of course make this the most commonly used coefficients (i. e., c 01 , c 02 , c 12 , . . . for the quadratic case). The corresponding XOR instructions can be removed from the code by our code generator. This is not yet automated and we have to test everything by hand. However, this clearly leads to significant savings. On GPUs, we have a speed up of 21% on quadratic cases, 18% for cubics, and 4% for quadratics. [The last is again due to the memory problems.] Notes and Acknowledgements C. Bouillaguet thanks Jean Vuillemin for helpful discussions. The Taiwanese authors thank Ming-Shing Chen for assistance with programming and fruitful discussion, Taiwan's National Science Council for partial sponsorship under grants NSC96-2221-E-001-031-MY3, 98-2915-I-001-041, and 98-2219-E-011-001 (Taiwan Information Security Center), and Academia Sinica for the Career Development Award. Questions and esp. corrections about the extended version should be addressed to by@crypto.tw.
