We present techniques to obtain small circuits which also have low depth. The techniques apply to typical cryptographic functions, as these are often specified over the field GF (2), and they produce circuits containing only AND, XOR and XNOR gates. The emphasis is on the linear components (those portions containing no AND gates). A new heuristic, DCLO (for depth-constrained linear optimization), is used to create small linear circuits given depth constraints. DCLO is repeatedly used in a See-Saw method, alternating between optimizing the upper linear component and the lower linear component. The depth constraints specify both the depth at which each input arrives and restrictions on the depth for each output. We apply our techniques to cryptographic functions, obtaining new results for the S-Box of the Advanced Encryption Standard, for multiplication of binary polynomials, and for multiplication in finite fields. Additionally, we constructed a 16-bit S-Box using inversion in GF (2 16 ) which may be significantly smaller than alternatives.
Mathematics Subject Classification (2010) 94C10

Introduction
Constructing optimal combinational circuits is an intractable problem under almost any meaningful metric (gate count, depth, energy consumption, etc.). In practice, no known techniques can reliably find optimal circuits for functions with as few as eight Boolean inputs and one Boolean output (there are 2 256 such functions). Thus, heuristic or specialized techniques are necessary in practice.
Reducing the number of gates is important for reducing area and power consumption. Reducing depth, i.e. the number of gates on a longest path, leads to faster circuits. However, obvious ways to reduce depth lead to an explosion in size. In this paper we reduce size and depth simultaneously.
Combinational circuit optimization
Many different logically complete bases are possible for circuits. Since the operations in the basis (XOR, AND) are equivalent to addition and multiplication modulo 2 (i.e., in GF (2) ), much work on circuits for cryptographic functions uses this basis. For logical completeness, we use the basis (XOR,AND,XNOR), although most of the paper uses only (XOR,AND). This platform-independent basis leads to easy comparison with previous results.
Classic results by Shannon [16] and Lupanov [9] show that almost all predicates on n bits have circuit complexity about 2 n n . The multiplicative complexity of a function is the number of AND gates necessary and sufficient to compute the function. Analogous to the Shannon-Lupanov bound, it was shown in [3, 11] that almost all Boolean predicates on n bits have multiplicative complexity about 2 n 2 . Strictly speaking, these theorems say nothing about the class of functions with polynomial circuit complexity. However, it is reasonable to expect that, in practice, the multiplicative complexity of these functions is significantly smaller than their Boolean complexity. This is one of the principles that guide our design strategy.
Circuits with few AND gates will naturally have large sections which are purely linear, i.e., contain no AND gates. Boyar and Peralta [2] and Courtois et al. [7] have used this insight to construct circuits much smaller than previously known for a variety of applications (see [15] ). Both of those papers use a two-step process which first reduces multiplicative complexity and then optimizes linear components. The second of these steps involves solving a problem which is NP-hard and MAX-SNP hard [2] , implying limits to its approximability. Early published heuristics for this step [1, 2, 14] do not consider depth. We do so here, and obtain circuits that are smaller in both size and depth for several functions of interest to cryptography. In particular, we improved on the results in [5, 13] for the S-Box of the Advanced Encryption Standard (AES).
We note that our results should not be interpreted as trading AND gates for XOR gates. We typically are able to produce circuits which have fewer XOR gates, fewer AND gates, and smaller depth than previously published circuits for the same functions.
Algorithm to find small low-depth circuits
We can consider a circuit as a directed acyclic graph where the nodes are either gates or inputs of fan-in zero. Nodes for gates that produce circuit outputs are referred to as "outputs". The depth of a node X = A op B is depth(X) = 1 + max{depth(A), depth(B)}.
where op is a binary operation (AND, XOR, XNOR). Note that this definition allows us to assign arbitrary depths to the input nodes, though nodes which are inputs to the entire circuit are always assigned depth zero.
The depth-constrained linear optimization (DCLO) problem
We consider a linear component of a circuit as a set of functions with fixed depths associated to the input variables and depth constraints associated to the outputs. This problem is best represented as a matrix with input depth constraints associated with columns and goal depth constraints associated with rows. The optimization problem is to find a circuit that satisfies the constraints and minimizes the number of gates. We call this problem DCLO (for Depth-Constrained Linear Optimization). For example, the matrix of Fig. 1 represents the problem of computing the four functions
(recall that addition is modulo 2).
The column heading x i : d i states that input x i has input depth d i . The row heading y i : t i states that a solution must compute function y i at depth no more than t i .
For example, the straight-line program
is not allowed because the depth of y 1 is 3. A valid straight-line program is
The See-Saw method
We now describe the method to find a small circuit given an overall depth constraint Tar We alternate restructuring the upper linear and lower linear components until there is no further improvement in size or depth. When one of these linear components is being restructured, the other is fixed. Each restructuring step is an instance of the DCLO problem described in the previous section. It is solved using a heuristic, DCLO, which we define later.
The process starts with the upper linear component. The input depth constraints for the upper linear component are always set to 0, and initially the goal depth constraints are set to the minimum feasible. The minimum feasible goal depths for the upper linear component are calculated as follows: If the Hamming weight (number of 1s) of the row corresponding to the output is w, at least depth log 2 (w) is necessary. This required depth can be achieved by placing the XOR gates in a balanced binary tree. Starting with these required depths allows us to jump-start the process and also will give us a lower bound on the depth of any solution that does not restructure the middle component.
After finding a new circuit for the upper linear component, we replace the subcircuit for this upper linear component with the new circuit. We thus create a new circuit, possibly with lower depth than the original. Now the top linear component, together with the nonlinear component, are fixed while we apply DCLO to the bottom linear component. For the lower linear component, if we know that some inputs are available at lower depth than others, this slack may help in creating an implementation with fewer gates. The input depths of the inputs to the lower linear component are set to the depth at which these inputs are calculated in the portion of the circuit which we have fixed. The goal depths of all outputs of the lower linear component are set to TargetD (if this depth is not feasible, the algorithm aborts).
The new circuit for the lower linear component replaces the old subcircuit. For the upper linear component, we may now be able to allow some of its outputs to be computed at a larger depth than in the previous phase. This allows us to force the DCLO to be more strict for some outputs which are critical for the total depth of the circuits and allows us to be less strict for others. If, for example, an output, y i , of the upper linear component requires depth w 1 because of its Hamming weight, but is not used before depth w 2 > w 1 in the entire circuit, it might be possible to create a circuit using fewer gates if output y i is allowed to be computed at depth w 2 . Starting with the second iteration of See-Saw, we calculate these allowed depths as follows: the height of a node v, denoted by height (v) , is the length of the longest path from the node to an output of the entire circuit. If v is an output node of the upper linear component, then we set its goal depth to T argetD − height (v) . Note that this goal depth is at least as large as the required depth.
Note that, for the lower linear component, using the variable, calculated input depths is important to get the minimal depth, even if one is unconcerned about size. One might assume that if minimum depth circuits are found for the upper linear and lower linear components (where one can also assume that the circuit for the upper linear component satisfies the goal depths for each of its outputs), then attaching them to the middle nonlinear component would always give the smallest depth circuit (given the fixed middle component). However, this is not always true. For the AES S-Box, one of the outputs of the lower linear component has twelve 1s, so the minimum depth circuit computing it is not a complete binary tree. Some of the inputs to this circuit from the nonlinear component can be at higher depth than others, still allowing a depth 16 circuit. However, if the wrong inputs are combined first, the total depth can become 17.
To summarize, with the See-Saw Method, we alternate between improving the upper and lower linear components, updating the values for the goal depths and the input depths after each improvement. After the first iteration, the goal depth has been achieved, so the goal is to reduce the number of gates.
Paar's algorithm
The heuristic DCLO used within the See-Saw Method uses ideas from a well-known algorithm due to Paar [14] . Paar's technique keeps a list of the variables already computed, which is initially only the inputs. Then it repeatedly determines which two variables, XORed together, occur in most outputs. One such pair is selected and XORed together. This result is added as a new variable which appears in all outputs where both variables previously appeared. This is repeated until all required outputs have been computed. Paar's technique is implemented by starting with the initial matrix with input columns corresponding to the inputs and rows corresponding to the outputs which the circuit should calculate. The algorithm adds columns corresponding to the new variables which are computed. When a new column is added, this corresponds to adding two existing variables, u and v. In all rows in the matrix which currently have a one in both of the columns corresponding to u and v, those two 1s are changed to 0s, and a one is placed in the corresponding row of the new column. All other values in the new column are set to 0. This operation, adding a new column to the matrix, corresponding to a new XOR gate in the circuit, is called a Paar-like operation. The 1s in a row indicate which variables still need to be added together to produce that output. The algorithm terminates when all rows have Hamming weight 1.
Cancelation occurs in a circuit when the inputs to an XOR gate are of the form (f + g, h + g). The XOR gate in that case computes the function f + h, "canceling" the term g. In addition to being oblivious to depth, Paar's algorithm incurs a significant cost in size due to the fact that it can only produce circuits which do not allow cancelation.
Proposition 1 Paar-like operations are cancelation-free.
Proof We show by induction that the 1s in any row represent sums modulo 2 of disjoint sets of variables. Initially, each 1 represents a single input variable, and they are all distinct. When a Paar-like operation occurs, 1s representing two disjoint sets are removed and the new column represents the union of those two sets. Since the two sets were disjoint, no cancelation occurs in that operation, and the new set is disjoint from all other sets represented by 1s in that row.
It is shown in [4] that non-cancelation can increase the size of a circuit by a factor (n/ log 2 n). The techniques described in the following sections allow both depth restriction and cancelation.
Randomized construction heuristic, RAND-GREEDY-ALG
RAND-GREEDY-ALG, our first algorithm for finding small programs with depth restrictions, is a randomized version of Paar's algorithm. It keeps track of the depths of the gates and only adds gates if the global depth restrictions can be satisfied.
Recall that the input to the algorithm is a matrix representing the linear combinations (outputs) to be computed in a linear component, plus the input depths of each of the inputs and the goal depths of each output. For each column, c, let v(c) be a bit vector indicating which variables are present in the linear combination computed by column c. Initially, each column represents a single input variable, and each vector has exactly one 1. When creating a new gate for adding columns c 1 , c 2 in the matrix, a new column c 3 is created with v(c 3 ) set to the symmetric difference of v(c 1 ) and v(c 2 ). For the subset of the rows where the new gate is used, the bits in positions c 1 and c 2 are flipped and the bit in position c 3 is set to 1 (for the other rows, it is set to 0). We call this an update operation. Note that in RAND-GREEDY-ALG, updates only occur if the bits in positions c 1 and c 2 are both 1, as in Paar's algorithm, so no cancelation occurs. In DCLO, this is not the case.
The following invariant holds for the matrix input and will hold while running RAND-GREEDY-ALG and DCLO:
Row-sum invariant: For any row , the linear combination to be computed is the sum of all for which position in the matrix is 1.
As in Paar's algorithm, when a gate is added, RAND-GREEDY-ALG updates each row by changing the entries corresponding to the two inputs from 1 to 0 and placing a 1 in the new column corresponding to that gate. There is, however, a feasibility requirement: An update is only applied to a row r if, after doing the update to row r, it would still be possible to produce the output for that row in its goal depth. See Section 3.5 for a description of how feasibility is checked.
In order to choose two columns, c 1 and c 2 , as inputs for the next gate, RAND-GREEDY-ALG determines how many of the output rows could benefit from having that gate, i.e., how many rows have 1's in columns c 1 and c 2 , both of which can be flipped while maintaining feasibility. For each possible next gate, the number of output rows which would benefit from the gate is calculated, giving an improvement. The inputs to the new gate are chosen at random from those that give the largest improvement. The algorithm is shown in Fig. 2 . The function FEASIBLE-UPDATE is shown in Fig. 4. 
Computing the feasibility of a gate
The algorithm, RAND-GREEDY-ALG, adds one gate at a time to the circuit. For each candidate for the next gate, RAND-GREEDY-ALG checks how many output rows can benefit from the candidate. We require that a row have 1s in the columns for the two inputs to the candidate gate in order to benefit from that gate. In addition, it must still be possible to calculate the row within its goal depth after using this gate. If both of these conditions hold, then we say the candidate is feasible for that output. For example, suppose two required outputs for a linear component are x 1 ⊕x 2 ⊕x 3 ⊕x 4 with goal depth 2 and x 1 ⊕x 2 ⊕x 3 ⊕x 5 ⊕x 7 with goal depth 3. A candidate gate computing ((x 1 ⊕ x 2 ) ⊕ x 3 ) at depth two would not be feasible for the first output, but would be feasible for the second.
In earlier work [5] , computing a depth-16 circuit for the AES S-Box, feasibility of the gates was maintained by working in phases, never using gates produced in the current phase within that same phase. Thus, there was an upper bound on the depth of the gates produced in the same phase. Here, instead of using phases, we check feasibility explicitly, giving more freedom as to which candidate gates can be chosen.
First, we define a function FEASIBLE, which is used to check if a row is currently feasible. The depth of a column c is d(c); it is the input depth for the inputs and is 1 + max{d(c 1 ), d(c 2 )} for a column created using columns c 1 , d 2 ) . At the end, if the only depth value remaining is no larger than the goal depth for that output, the row is feasible. Otherwise, it is not. Pseudocode for this calculation can be found in Fig. 3 .
Assuming that H is a priority queue (min-heap order) containing the current depths of the variables to be XORed for row r, the algorithm, FEASIBLE, correctly determines whether or not there exists a circuit which computes the function for row r in depth gd. FEASIBLE determines if repeatedly taking the two columns with lowest weight for the next gate results in a circuit having no more depth than the goal depth for that output. The algorithm is thus correct if for any linear combination F = x i 1 ⊕ x i 2 ⊕ · · · ⊕ x i k , where the x i j could be original inputs to the component or outputs of XOR gates produced earlier, and any given depths for these k inputs, no circuit implementing this linear combination with the given depth constraints has lower depth than any circuit, C, produced in this way.
Lemma 1 The algorithm, FEASIBLE, for determining feasibility returns 1 if and only if there exists a circuit computing the XOR of the set of variables available at the depths in the priority queue, H , within goal depth gd.
We consider any minimum depth circuit C for F . Note that we can assume without loss of generality that all gates in C are cancelation-free, since the inputs where cancelation occurs can be computed (for example, by essentially copying the computation of the original inputs, but removing the variables the inputs have in common from both subcircuits) in a cancelation-free manner without increasing the depth. Since a cancelation-free XOR circuit for k inputs has k − 1 gates, C has k − 1 gates. C also has k − 1 gates, since each iteration of the while loop decreases the number of remaining elements in the priority queue by 1. Assume without loss of generality that the lowest depth of any input is 0.
We show that for any depth d, the number of gates at depth at most d in C is at least as large as the number of gates at depth at most d in C . This clearly holds for all gates at depth 1. Note that in C at most one input at depth 0 is not an input to a gate at depth 
which is an increasing function of S d−1 . Inductively, C has at least as many gates at level at most d as C .
If feasibility holds for all rows, there exists a circuit which evaluates all required outputs within the given depth constraints. By Lemma 1, we can say that feasibility holds for a row, r, if and only if FEASIBLE returns 1 for r. In a valid DCLO initial matrix, all goal depths are initially feasible. Feasibility is the second invariant in our algorithm:
Feasibility invariant: For any row , the goal depth remains feasible.
We ensure this invariant by explicitly testing for feasibility before each matrix update. To calculate if an update (candidate gate) is feasible for an output corresponding to row r and columns c 1 , c 2 at depths d 1 , d 2 , we need to calculate
depth(r)).
For RAND-GREEDY-ALG, we also require that the entries in columns c 1 and c 2 of row r are both 1s (Fig. 4) . We can now show that RAND-GREEDY-ALG runs in polynomial time.
Theorem 1 RAND-GREEDY-ALG is correct. Let M be an m × n 0-1 matrix containing H 1s. Suppose that, every row in M is feasible, according to the function FEASIBLE initially. The running time of RAND-GREEDY-ALG is O(tm(t 2 + n log n)), where t is the final number of columns and is at most H
Proof By Lemma 1, no input or gate is considered as possible inputs to a new gate for a particular row unless that row can still be computed within its goal depth with that new gate. The algorithm continues as long as any row has Hamming weight greater than 1. As long as it continues, by the Feasiblity Invariant, there is a candidate gate which would be feasible for some row. Since the updates of the matrix are Paar-like operations and the depths are calculated correctly, the algorithm is correct.
Within FEASIBLE, at most two Insert and DeleteMin priority queue operations are performed for each 1 in row r, since the number of elements in the priority queue is decreased by one each time through the while loop. The number of 1s in any row is at most the initial number of columns in the matrix, n. Thus, the running time of FEASIBLE is O(n log n). The for each loop in RAND-GREEDY-ALG takes the most time within the outer while loop. Each pair of the first s −1 columns is considered, O(t 2 ) pairs. For each pair, only constant work is done in a row by FEASIBLE-UPDATE unless there are 1s in that row for both columns. If there are 1s for both columns, FEASIBLE is called. Thus, the running time for each row, in one iteration of the while loop, is O(t 2 + n log n).
Since the while loop is executed once for each new gate, it is executed at most t times. There are m rows to process, so RAND-GREEDY-ALG runs in time O(tm(t 2 + n log n)).
Since there are at most n 1s in every row initially, each row will be computed using at most n − 1 XORs, and all m rows will be computed with at most m(n − 1) XORs. There are n columns initially, so in all t ≤ H + n − m ≤ mn + n − m.
Improvements to RAND-GREEDY-ALG: DCLO
In the following, we introduce three new techniques that were added to the simplified algorithm, RAND-GREEDY-ALG, to give the algorithm we used to produce the improved circuits for several functions useful for cryptography that are mentioned in this paper, with straightline programs for the circuits available at [15] . We refer to the algorithm with these three improvements as DCLO.
Reducing the size of the matrix-preprocessing Using the See-Saw Method, particularly after the first iteration, the upper linear component may have some outputs which can be computed at a larger depth than some others, without affecting the total depth of the circuit. In the case of the AES S-Box, and presumably in some other applications, there are cases where an output g can be computed as the sum of exactly two of the other outputs, and both of those other outputs have to be computed at a lower depth than is allowed for g. Thus, there is no reason for the row representing output g to be included in the input matrix to RAND-GREEDY-ALG. Adding one extra gate at the end of the computation, adding those two other outputs, will suffice. This extra gate at the end will typically, but not necessarily, have some cancelation. In our experiments with the upper linear component of the AES S-Box, there were actually 5 such outputs which could be automatically computed in this manner, yielding the the 27-gate circuit for the upper linear component. The preprocessing is never relevant for the lower linear component, since all outputs are given the same goal depths.
Finding these triples of outputs is relatively straight-forward, checking all triples of rows in the matrix, checking that one of the three rows has a larger goal depth than the other two, and checking that the bitwise XOR of the three rows is 0. This preprocessing can be done each time, before RAND-GREEDY-ALG is run. The gates found are saved and then added to the end of the straight-line program for the circuit.
Allowing more cancelation -the generalized Paar operation
The simplified algorithm, RAND-GREEDY-ALG, without the preprocessing of the matrix, produces cancelation-free circuits since it only does Paar-like operations. Recall that a Paar-like operation updates a row when a new gate is created based on the inputs to that gate. If the inputs come from columns i and j , the row in question must have 1s in both columns i and j . The 1s must be changed to 0s, and a 1 must be placed in the new column for that gate. No other changes are made to the row.
In the following example, every cancelation-free circuit is suboptimal. Consider running Paar's algorithm on the following matrix: Note that a flip uses (or omits using) the gate defined by column c in row r and preserves the row-sum invariant. We do a flip if the Hamming weight of the row decreases as a result. 1 The flip operation introduces the possibility of cancelation.
In our example, note that v ( Allowing second best pairs In order to choose a pair of columns for the update, RAND-GREEDY-ALG counts for each possible pair the number of rows where it is feasible, computes the maximum of these counts, and chooses randomly among those pairs with the maximum count. Clearly, this greedy approach of choosing among those pairs with the maximum count is intuitively reasonable. However, it may be that this greedy approach sometimes gives a suboptimal solution overall. Thus, with probability 2%, in DCLO a random choice is made among the column pairs with the second to largest count, rather than the maximum count.
The See-Saw Method applied to the AES S-Box, an example
The algorithm DCLO, which we applied to obtain our new, small, low-depth circuits, was RAND-GREEDY-ALG, plus the improvements of the last three subsections, the preprocessing, the Generalized Paar Operations, and allowing the second best pairs. We applied the See-Saw Method to the AES S-Box (in the forward direction) and obtained a circuit of size 125 and depth 16. This required very few iterations of DCLO.
With each iteration, we ran RAND-GREEDY-ALG 10,000 times and chose one of the circuits produced with the smallest number of gates.
In the forward direction, there are four outputs of the AES S-Box which are negated. These negations were ignored until the end. We started with the middle nonlinear component found in [5] . This had 63 gates and was used in all of our circuits. For the upper linear and lower linear components, we used the original Paar algorithm [14] , with no regard for depth, always choosing the first of the pairs of columns where 1s occurred in the most rows. The circuit we created had 27 gates for the upper linear component, 34 for the lower linear, for a total of 124 gates and depth 19.
Then we created a new upper linear component with our algorithm, setting the goal depths for all outputs to the minimum possible; for a row with Hamming weight h, this minimum is log 2 (h) . Since all rows have weight less than 8, the values ranged from 1 to 3. This resulted in a component with 29 gates, increasing the total circuit size to 126, but decreasing the total depth to 18. Note that for a function, such as the AES S-Box, consisting only of an upper linear component, a middle nonlinear component and a lower linear component, initially creating an upper linear circuit with minimum goal depths, will make it possible to obtain a minimum depth circuit (assuming the nonlinear component is fixed) in the next iteration. This holds since the See-Saw Method next applies DCLO to the lower linear component with input depths determined by the current circuit and goal depths all equal to the optimal depth (given the middle nonlinear component being used).
Then we ran the algorithm on the lower linear component, with goal depths of 16 (15 is impossible given the middle nonlinear component used; three of the outputs of the lower linear component could not be computed in depth 15, though the others could). The smallest circuit found had size 35, giving us a circuit with 127 gates and depth 16. One generalized Paar operation was used. This lower linear component was then used to try to get a smaller upper linear component, but still depth 16. The goal depths were relaxed as much as possible and the preprocessing was used. The smallest number of gates found was 27, giving us our final circuit of depth 16 with only 125 gates in all. One of the outputs was at depth 15 and the others were all at depth 16. (In our second try for a lower linear component of size 35, we found one giving depth 16 for all outputs and chose this instead. Our final circuit has depth 16 for all outputs. Five preprocessing gates were used.)
Working on the inverse AES S-Box, starting with the middle nonlinear component found in [5] and minimum goal depths for the top linear component, we got a component with 29 gates. Working on the lower linear component with goal depths of 16, we got 35 gates, but 10,000 times was not enough and we changed it to 100,000 times. Introducing the slack goal depths for the upper linear component gave us 28 gates using 5 preprocessing gates. Thus, the total size is 126 gates. Note that this uses some XNOR gates in the upper linear component, while the forward direction only uses them on outputs that need to be negated. For the inverse, the inputs corresponding the negated outputs from the forward direction need to be negated. This effect can be achieved by computing the desired circuit without the negations and XNORs and then changing some XORs to XNORs when exactly one of the inputs should have been negated.
The depth cannot be reduced without changing the circuit for the middle nonlinear component. Of course, if the logical base is expanded, one could probably decrease the sizes slightly. For example, if NAND gates are used in the circuit for inversion in GF (2 4 ), it is not hard to reduce the number of gates by two without increasing the depth (see Appendix A). Since there are only 256 possible inputs, we verified the circuits fully against the specifications in [12] .
Rows with Hamming weight 2
Previous work, also obtaining a small depth-16 circuit for the AES S-Box [5] , was less automated, obtained a slightly worse result, and had a minor error in the algorithm. In this section, we explain that error. The algorithm works in phases: During phase i ≥ 0, no row in the current matrix has Hamming weight more than 2 k−i and only inputs or gates already produced at depth i or less are considered as possible inputs to gates in phase i. Thus, the depth of gates in phase i is at most i + 1. At the beginning of each phase of that algorithm, there is a check to see if there are any rows with Hamming weight 2. If so, the algorithm created the final gate for any such rows at that point. At first glance, it seems as if this can only help the algorithm, since that gate would need to be produced at some point anyway. However, there are a couple of problems with this strategy.
If handling a row with Hamming weight 2 only takes place at the beginning of a phase, there is no conflict with the definition of a phase, since the columns chosen must have been created in the previous phase, so their depth would be acceptable for the new phase. However, if such handling had been allowed in the middle of a phase, one of the columns chosen could be from the current phase and the depth of the new gate could be one too large to be used for another row in the current phase. 
Now only five more gates are necessary to finish. Thus, one cannot arbitrarily choose a row with Hamming weight 2 and assume that this cannot have a negative affect on the number of gates used.
Circuits
This work concentrated on optimizing the linear components of circuits. In [5] , the search technique in [2] , to find circuits for nonlinear components with few AND gates, was modified to reject candidate gates with too large depth. This decreased the depth of the GF (2 4 ) inversion from 9 to 4 while only increasing the number of gates from 16 to 17, changing 5 AND gates and 11 XOR gates to 7 AND gates and 10 XOR gates. The techniques presented in this paper and in [5] appear, not surprisingly, to lead to a trade-off between size and depth. To illustrate this trade-off, we list first the depths and sizes of some of the circuits for the AES S-Box which we obtained earlier: We applied RAND-GREEDY-ALG to the multiplication of binary polynomials of degree 9, starting from the straight-line program given by Bernstein on http://binary.cr.yp.to/m. html. Running RAND-GREEDY-ALG on the lower linear component of that circuit, we obtained the same size as Bernstein with 155 gates, but reduced the depth from 9 to 6. Cenk and Hasan [6] report the same number of gates, but depth 8. Running RAND-GREEDY-ALG on the lower linear component of their circuit also gave 154 gates, but depth 7. Depth 6 was not feasible, given their nonlinear component. No Generalized Paar Operations were used, and of course no preprocessing of the matrix, since all goal depths were 7.
We applied RAND-GREEDY-ALG to computing the product of degree 12 polynomials over GF (2) . Bernstein has a circuit with depth 9 and 256 gates. Our techniques, starting with the straight-line program given on his homepage http://binary.cr.yp.to/m.html, gave depth 8 and 255 gates, which is the same result obtained by Cenk and Hasan [6] . We have no reason to believe that RAND-GREEDY-ALG would not produce similar results for computing the products of binary polynomials of other degrees, but did not pursue this since it did not seem to reduce the number of gates.
The tower field construction (see, for example, [10] ) for a Galois Field with 2 2 k elements lends itself well to the methods described here. We built circuits for multiplication and inversion in GF (2 k ) for k = 2, 4, 8, 16. We constructed GF (2 2k ) from an optimized circuit for GF (2 k These results are much better than anything previously known. However, the depths of the inversion circuits seem much bigger than what is likely possible. This may be due to the specific inversion formulas that are natural in tower fields (see Appendix B). We may attempt to improve on these depths in future research.
Inversion in GF (2 16 ) is the basis of the 16-bit S-Box proposed by Kelly et al. in [8] . The paper quotes a size of 1382 gates (1238 XOR gates and 144 AND gates, no depth is given). This size is derived from the work in [17, 18] . If instead we use the tower field representation of Appendix B, the resulting S-Box has 446 gates (113 AND gates, 324 XOR gates, and 9 XNOR gates) and depth 35. This is not exactly the same S-Box as the one by Kelly et al. A quick way to derive the latter S-Box from the tower field S-Box is by doing a change of basis. Without further optimization, the resulting circuit has 537 gates.
The circuit for inversion in GF (2 16 ) was verified using an automatically generated circuit for multiplication and then using this circuit to verify that x multiplied by x −1 is the identity element of the field for all nonzero values of x. We saw it necessary to add this verification method after an anonymous referee determined that a previous circuit was incorrect. We are grateful for his/her contribution.
The circuits described here have been posted at [15] .
Conclusion
Automated techniques for finding small, low-depth circuits for cryptographic functions were presented. The See-Saw Method and the algorithm, DCLO, used within it were successful in finding better results for the AES S-Box and other functions. In the following, bases will be defined for each of the finite fields. Each base (b 1 , b 2 ) will be such that b 1 +b 2 = 1. This identity can be verified by repeated squaring of the defining irreducible polynomial and adding a telescoping sequence (verify GF (2 k ) before GF (2 2k )). For each k, the irreducible polynomial for GF (2 2k ) was found using the circuits for multiplication and addition in GF (2 k The operation (a + b) 2 is usually referred to as "square-scaling". Both square-scaling and inversion in the equations for c, d are operations in the lower field GF (2 8 ).
B.1 Multiplication and inversion in
