Abstract-Galois field (GF) arithmetic circuits find numerous applications in communications, signal processing, and security engineering. Formal verification techniques of GF circuits are scarce and limited to circuits with known bit positions of the primary inputs and outputs. They also require knowledge of the irreducible polynomial P (x), which affects final hardware implementation. This paper presents a computer algebra technique that performs verification and reverse engineering of GF(2 m ) multipliers directly from the gate-level implementation. The approach is based on extracting a unique irreducible polynomial in a parallel fashion and proceeds in three steps: 1) determine the bit position of the output bits; 2) determine the bit position of the input bits; and 3) extract the irreducible polynomial used in the design. We demonstrate that this method is able to reverse engineer GF(2 m ) multipliers in m threads. Experiments performed on synthesized Mastrovito and Montgomery multipliers with different P (x), including NIST-recommended polynomials, demonstrate high efficiency of the proposed method.
I. INTRODUCTION
D ESPITE considerable progress in verification of random and control logic, advances in formal verification of arithmetic circuits have been lagging. This can be attributed to the difficulty in efficient modeling of arithmetic circuits and datapaths, without resorting to computationally expensive Boolean methods. Contemporary formal techniques, such as Binary Decision Diagrams (BDDs), Boolean Satisfiability (SAT), Satisfiability Modulo Theories (SMT), etc., are not directly applicable to verification of integer and finite field arithmetic circuits [1] [2] . This paper concentrates on formal verification and reverse engineering of finite (Galois) field arithmetic circuits.
Galois field (GF) is a number system with a finite number of elements and two main arithmetic operations, addition and multiplication; other operations can be derived from those two [3] . GF arithmetic plays an important role in coding theory, cryptography, and their numerous applications. Therefore, developing formal techniques for hardware implementations of GF arithmetic circuits, and particularly for finite field multiplication, is essential.
The elements in field GF(2 m ) can be represented using polynomial rings. The field of size m is constructed using irreducible polynomial P (x), which includes terms of degree with d ∈ [0, m] with coefficients in GF (2) . The arithmetic operation in the field is then performed modulo P (x). The choice of the irreducible polynomial has a significant impact on the hardware implementation of the GF circuit and its performance. Typically, the irreducible polynomial with a minimum number of elements gives the best performance [4] , but it is not always the case.
Due to the rising number of threats in hardware security, analyzing finite field circuits becomes important. Computer algebra techniques with polynomial representation seem to offer the best solution for analyzing arithmetic circuits. Several works address the verification and functional abstraction problems, both in Galois field arithmetic [1] [5] [6] and integer arithmetic implementations [7] [2] [8] [9] [10] . Symbolic computer algebra methods have also been used to reverse engineer the word-level operations for GF circuits and integer arithmetic circuits to improve verification performance [11] [12] [5] . The verification problem is typically formulated as proving that the implementation satisfies the specification, and is accomplished by performing a series of divisions of the specification polynomial by the implementation polynomials. In the work of Yu et al. [11] , the authors proposed an original spectral method based on analyzing the internal algebraic expressions during the rewriting procedure. SayedAhmed et al. [12] introduced a reverse engineering technique in Algebraic Combinational Equivalence Checking (ACEC) process by converting the function into canonical polynomials and using Gröbner Basis.
However, the above mentioned algebraic techniques have several limitations. Firstly, they are restricted to implementations with a known binary encoding of the inputs and outputs. This information is needed to generate the specification polynomial that describes the circuit functionality regarding its inputs and outputs, necessary for the polynomial reduction process (described in Section II-D). Secondly, these methods are unable to explore parallelism (inherent in GF circuits), as they require that the polynomial division is applied iteratively using reverse-topological order [2] [9] [6] . Thirdly, the approaches applied specifically to GF(2 m ) arithmetic circuits [5] [6] , require knowledge of the irreducible polynomial P (x) of the circuit.
In this work, we present a formal approach to reverse engineer the gate-level finite field arithmetic circuits that exploit inherent parallelism of the GF circuits. The method is based on a parallel algebraic rewriting approach [13] and applied specifically to multipliers. The objective of reverse engineering is as follows: given the netlist of a gate-level GF multiplier, extract the bit positions of input and output
arXiv:1802.06870v1 [cs.SC] 16 Feb 2018
bits and the irreducible polynomial used in constructing the GF multiplication; then extract the specification of the design using this information. Bit position i indicates the location of the bit in the binary word according to its significance (LSB vs MSB). Our approach solves this problem by transforming the algebraic expressions of the output bits into an algebraic expression of the input bits (specification), and is done in parallel for each output bit. Specifically, it includes the following steps 1 :
• Extract the algebraic expression of each output bit.
• Determine the bit position of the outputs.
• Determine the bit position of the inputs.
• Extract the irreducible polynomial P (x).
• Extract the specification by algebraic rewriting. We demonstrate the efficiency of our method using GF(2 m ) Mastrovito and Montgomery multipliers of up to 571-bit width in a bit-blasted format (i.e., flattened to bit-level), implemented using various irreducible polynomials.
II. BACKGROUND

A. Canonical Diagrams
Several approaches have been proposed to check an arithmetic circuit against its functional specification. Different variants of canonical, graph-based representations have been proposed, including Binary Decision Diagrams (BDDs) [14] , Binary Moment Diagrams (BMDs) [15] [16], Taylor Expansion Diagrams (TED) [17] , and other hybrid diagrams. While BDDs have been used extensively in logic synthesis, their application to verification of arithmetic circuits is limited by the prohibitively high memory requirement for complex arithmetic circuits, such as multipliers. BDDs are being used, along with many other methods, for local reasoning, but not as monolithic data structure [18] . BMDs and TEDs offer a better space complexity but require word-level information of the design, which is often not available or is hard to extract from bit-level netlists. While the canonical diagrams have been used extensively in logic synthesis, high-level synthesis, and verification, their application to verify large arithmetic circuits remains limited by the prohibitively high memory requirement of complex arithmetic circuits [2] [1].
B. SAT, ILP and SMT Solvers
Arithmetic verification problems have been typically modeled using Boolean satisfiability (SAT). Several SAT solvers have been developed to solve Boolean decision problems, including ABC [19] , MiniSAT [20] , and others. Some of them, such as CryptoMiniSAT [21] , target specifically XORrich circuits, and are potentially useful for arithmetic circuit verification, but are all based on a computationally expensive DPLL (Davis, Putnam, Logemann, Loveland) decision procedure [22] . Some techniques combine automatic test pattern generation (ATPG) and modular arithmetic constraint-solving techniques for the purpose of test generation and assertion checking [23] . Others integrate linear arithmetic constraints with Boolean SAT in a unified algebraic domain [24] , but their effectiveness is limited by constraint propagation across the Boolean and word-level boundary. To avoid this problem, methods based on ILP models of arithmetic operators have been proposed [25] [26] , but in general ILP techniques are known to be computationally expensive and not scalable to large scale systems. SMT solvers depart from treating the problem in a strictly Boolean domain and integrate different welldefined theories (Boolean logic, bit vectors, integer arithmetic, etc.) into a DPLL-style SAT decision procedure [27] . Some of the most effective SMT solvers, potentially applicable to our problem, are Boolector [28] , Z3 [29] , and CVC [30] . However, SMT solvers still model functional verification as a decision problem and, as demonstrated by extensive experimental results, neither SAT nor SMT solvers can efficiently solve the verification problem of large arithmetic circuits [1] [10].
C. Theorem Provers
Another class of solvers include Theorem Provers, deductive systems for proving that an implementation satisfies the specification, using mathematical reasoning. The proof system is based on a large and strongly problem-specific database of axioms and inference rules, such as simplification, term rewriting, induction, etc. Some of the most popular theorem proving systems are: HOL [31] , PVS [32] , ACL2 [33] , and the term rewriting method described in [34] . These systems are characterized by high abstraction and powerful logic expressiveness, but they are highly interactive, require intimate domain knowledge, extensive user guidance, and expertise for efficient use. The success of verification using theorem provers depends on the set of available axioms and rewrite rules, and on the choice and order in which the rules are applied during the proof process, with no guarantee for a conclusive answer [35] .
D. Computer Algebra Approaches
The most advanced techniques that have potential to solve the arithmetic verification problems are those based on Symbolic Computer Algebra. The verification problem is typically formulated as a proof that the implementation satisfies the specification [2] 
. This is accomplished by performing a series of divisions of the specification polynomial F by a set of implementation polynomials B, representing circuit components, the process referred to as reduction of F modulo B. Polynomials f 1 , ..., f s ∈ B are called the bases, or generators, of the ideal J. Given a set f 1 , ..., f s of generators of J, a set of all simultaneous solutions to a system of equations f 1 =0; ... ,f s =0 is called a variety V (J). Verification problem is then formulated as a test if the specification F vanishes on V (J), i.e., if F ∈ V (J). This is known in computer algebra as ideal membership testing problem [1] .
Standard procedure to test if F ∈ V (J) is to divide polynomial F by the polynomials {f 1 , ..., f s } of B, one by one. The goal is to cancel, at each iteration, the leading term of F using one of the leading terms of f 1 , ..., f s . If the remainder r of the division is 0, then F vanishes on V (J), proving that the implementation satisfies the specification. However, if r = 0, such a conclusion cannot be made; B may not be sufficient to reduce F to 0, and yet the circuit may be correct. To reliably check if F is reducible to zero, a canonical set of generators, G = {g 1 , ..., g t }, called Gröbner basis, is needed. It has been shown that for combinational circuits with no feedback, certain conditions automatically make the set B a Groebner basis [36] . Specifically, if the polynomials f 1 , ..., f s ∈ B are ordered in reverse topological order of logic gates, from primary outputs to primary inputs, and the leading term of each polynomial is the output of a logic gate, then set B is automatically a Groebner basis. Some of the authors use Gaussian elimination, rather than explicit polynomial division, to speed up the polynomial reduction process [1] [8] . The polynomials corresponding to fanout-free logic cones can be precomputed to reduce the size of the problem [8] .
The polynomial reduction technique has been successfully applied to both integer arithmetic circuits [9] and Galois field arithmetic [1] . Verification work of Galois field arithmetic has been presented in [1] [5] . Formulation of problems in GF arithmetic takes advantage of known properties of Galois field during polynomial reductions. Specifically, the problem reduces to the ideal membership testing over a larger ideal that includes ideal J 0 = x 2 − x in F 2 , for each internal signal x of the circuit. Inclusion of this ideal basically assures that each signal assumes a binary value. In this paper, we provide comparison between this technique and our approach.
E. Function Extraction
Function extraction is an arithmetic verification method originally proposed in [2] for arithmetic circuits in modular integer arithmetic Z 2 m . It extracts a unique bit-level polynomial function implemented by the circuit directly from its gate-level implementation. Instead of expensive polynomial division, extraction is done by backward rewriting, i.e., transforming the polynomial representing encoding of the primary outputs (called the output signature) into a polynomial at the primary inputs (the input signature) using algebraic models of the logic gates of the circuit. That is, the rewriting is performed in a reverse topological order. This technique has been successfully applied to large integer arithmetic circuits, such as 512-bit integer multipliers. However, it is not directly applicable to large Galois Field multipliers because of potentially exponential number of polynomial terms, before the internal term cancellations takes place during rewriting. Fortunately, arithmetic GF(2 m ) circuits offer an inherent parallelism which can be exploited in backward rewriting, without memory explosion.
In the rest of the paper, we first describe how to apply such parallel rewriting in GF(2 m ) circuits while avoiding memory explosion experienced in integer arithmetic circuits. Using this approach, we extract the function of each output bit in F 2 m and the function is represented in a pseudo-Boolean polynomial expression, where all variables are Boolean. Finally, we propose a method to reverse engineer the GF(2 m ) designs by analyzing these expressions.
III. GALOIS FIELD MULTIPLICATION
Galois field (GF) is a number system with a finite number of elements and two main arithmetic operations, addition and multiplication; other operations such as division can be derived from those two [3] . Galois field with p elements is denoted as GF(p). The most widely-used finite fields are Prime Fields and Extension Fields, and particularly Binary Extension Fields. Prime field, denoted GF(p), is a finite field consisting of finite number of integers {1, 2, ...., p − 1}, where p is a prime number, with additions and multiplication performed modulo p. Binary extension field, denoted GF(2 m ) (or F 2 m ), is a finite field with 2 m elements. Unlike in prime fields, however, the operations in extension fields are not computed modulo 2 m . Instead, in one possible representation (called polynomial basis), each element of GF(2 m ) is a polynomial ring with m terms with coefficients in GF(2), modulo P (x). Addition of field elements is the usual addition of polynomials, with coefficient arithmetic performed modulo 2. Multiplication of field elements is performed modulo irreducible polynomial P (x) of degree m and coefficients in GF (2) . The irreducible polynomial P (x) is analogous to the prime number p in prime fields GF (p). In this work, we focus on the verification problem of GF(2 m ) multipliers that appear in many cryptography and in some DSP applications.
A. GF Multiplication Principle
Two different GF multiplication structures, constructed using different irreducible polynomials P 1 (x) and P 2 (x), are shown in Figure 1 . The integer multiplication takes two nbit operands as input and generates a 2n-bit word, where the values computed at lower significant bits ripple through the carry chain all the way to the most significant bit (MSB). In contrast, in GF(2 m ) implementations the number of outputs is reduced to n using irreducible polynomial P(x). The product terms are added for each column (output bit position) modulo 2, hence there is no carry propagation. For example, to represent the result in GF(2 4 ), with only four output bits, the four most significant bits in the result of the integer multiplication have to be reduced to GF (2 4 ). The result of such a reduction is shown in Figure 1 . In GF(2 4 ), the input and output operands are represented using polynomials A(x), B(x) and
The function of each multiplication bit s i (i ∈ [0, 6]) is represented using polynomials in GF(2), namely:
The output bits z n (n ∈ [0, 3]) are computed modulo the irreducible polynomial P (x). Using P 2 (x)=x 4 +x+1, we obtain : z 0 =s 0 +s 4 ,
The coefficients of the multiplication results are shown in Figure 2 . In digital circuits, partial products are implemented using AND gates, and addition modulo 2 is done using XOR gates. Note that, unlike in integer multiplication, in GF(2 m ) circuits there is no carry out to the next bit. For this reason, as we can see in Figure 1 , the function of each output bit can be computed independently of other bits.
Figure 2: Extracted algebraic expressions of the four output bits of GF(2 4 ) multiplier for P (x) = x 4 + x + 1.
B. Irreducible Polynomials
In general, there are various irreducible polynomials that can be used for a given field size, each resulting in a different multiplication result. For constructing efficient arithmetic functions over GF(2 m ), the irreducible polynomial is typically chosen to be a trinomial, x m +x a +1, or a pentanomial x m +x a +x b +x c +1 [37] . For efficiency reason, coefficients m, a are chosen such that m -a ≥ m/2.
An example of constructing GF(2 4 ) multiplication using two different irreducible polynomials is shown in Figure 1 . We can see that each polynomial produces a unique multiplication result. The size of the corresponding multiplier can be estimated by counting the number of XOR operations in each multiplication. Since the number of AND and XOR operations for generating partial products (variables s i in Figure 1 ) is the same, the difference is only caused by the reduction of the corresponding polynomials modulo P (x). The number of two-input XOR operations introduced by the reduction with P (x) can be obtained as the number of terms in each column minus one. For example, the number of XORs using P 1 (x) is 3+1+2+3=9; and using P 2 (x), the number of XORs is 1+2+2+1=6.
As will be shown in the next section, given the structure of the GF(2 m ) multiplication, such as the one shown in Figure 1 , one can readily identify the irreducible polynomial P (x) used during the GF reduction. This can be done by extracting the terms s k corresponding to the entry s m (here s 4 ) in the table and generating the irreducible polynomial beyond x m . We know that P (x) must contain x m , and the remaining terms x k of P (x) are obtained from the non-zero terms corresponding to the entry s m . For example, for the irreducible polynomial P 1 (x) = x 4 + x 3 + x 0 , the terms x 3 and x 0 are obtained by noticing the placement of s 4 in columns z 3 and z 0 . Similarly, for P 2 (x) = x 4 + x 1 + x 0 , the terms x 1 and x 0 are obtained by noticing that s 4 is placed in columns z 1 and z 0 . The reason for it and the details of this procedure will be explained in the next section.
IV. PARALLEL EXTRACTION IN GALOIS FIELD
In this section, we introduce our method for extracting the unique algebraic expressions of the output bits (e.g. Figure  2 ) using computer algebraic method. This can be used to verify the GF(2 m ) multipliers when the binary encoding of inputs and output and the irreducible polynomial are given. We introduce a parallel function extraction framework in GF(2 m ), which allows us to individually extract the algebraic expression of each output bit. This framework is used for reverse engineering, since our reverse engineering approach is based on analyzing the algebraic expression of output bits in GF(2), as introduced in Section I.
A. Computer Algebraic model
The circuit is modeled as a network of logic elements of arbitrary complexity, including basic logic gates (AND, OR, XOR, INV) and complex standard cell gates (AOI, OAI, etc.) generated by logic synthesis and technology mapping. We extend the algebraic model of Boolean operators developed in [10] for integer arithmetic to finite field arithmetic in GF (2), i.e., modulo 2. For example, the pseudo-Boolean model of XOR(a, b)=a + b −2ab is reduced to (a + b + 2ab) mod 2 = (a + b) mod 2. The following algebraic equations are used to describe basic logic gates in GF (2 m ) [1] :
B. Outline of the Approach
Similarly to the work of [2] and [10] , the arithmetic function computed by the circuits is obtained by transforming (rewriting) the polynomial representing the encoding of the primary outputs (called output signature) into the polynomial at the primary inputs, the input signature. The output signature of a GF (2 m ) multiplier,
with coefficients P i ∈ GF (2) being product terms, and addition operation performed modulo 2. If the irreducible polynomial P (x) is provided, Sig in is know; otherwise, it will be computed by backward rewriting from Sig out . The goal is to transform the output signature, Sig out , using polynomial representation of the internal logic elements (1), into an input signature Sig in in GF (2 m ), which determines the arithmetic function (specification) computed by the circuit.
Theorem 1: Given a combinational arithmetic circuit in GF (2 m ), composed of logic gates, described by Eq. 1, input signature Sig in computed by backward rewriting is unique and correctly represents the function implemented by the circuit in GF (2 m ). Proof: The proof of correctness relies on the fact that each transformation step (rewriting iteration) is correct. That is, each internal signal is represented by an algebraic expression, which always evaluates to a correct value in GF (2 m ). This is guaranteed by the correctness of the algebraic model in Eq.
(1), which can be proved easily by inspection. For example, the algebraic expression of
m is represented by a + b. The proof of uniqueness is done by induction on i, the step of transforming polynomial F i into F i+1 . A detailed induction proof of this theorem is provided in [2] (1) , provide the basis for polynomial reduction using backward rewriting. This is described by Algorithm 1. The method takes the gate-level netlist of a GF(2 m ) multiplier as input and first converts each logic gate into an algebraic expression using Eq. (1). The rewriting process starts with the output signature F 0 = Sig out and performs rewriting in reverse topological order, from outputs to inputs. It ends when all the variables in F i are primary inputs, at which point it becomes the input signature Sig in [2] .
Each iteration includes two basic steps: 1) substitute the variable of the gate output using the expression in the inputs of the gate (Eq.1), and name the new expression F i+1 (lines 3 -6); and 2) simplify the new expression in two ways: a) by eliminating terms that cancel each other (as in the integer arithmetic case [2] ), and b) by removing all the monomials (including constants) that reduce to 0 in GF (2) Sig out Sig in Figure 3 : The gate-level netlist of post-synthesized and mapped 2-bit multiplier over GF(2 2 ). The irreducible polynomial is P (x) = x 2 + x + 1. can be used to verify if the circuit performs the desired arithmetic function by checking if the computed polynomial Sig in matches the expected specification, if known. This equivalence check can be readily performed using canonical word-level representations, such as BMD [15] or TED [17] which can efficiently check equivalence of two polynomials. Alternatively, if the specification is not known, the computed signature can serve as the specification extracted from the circuit. Example 2 ( Figure 3 ): We illustrate our method using a post-synthesized 2-bit multiplier in GF (2 2 ), shown in Figure  3 . The irreducible polynomial is P (x) = x 2 + x + 1. The output signature is Sig out = z 0 +z 1 x, and input signature is
Sig out is transformed into F 8 using polynomial of gate g8, z 1 =i 5 + i 6 and simplified to F 8 = z 0 + i 5 x + i 6 x. Then, the polynomials F i are successively derived from F i+1 and checked for a possible reduction. The first reduction happens when F 5 is transformed into F 4 , where i 4 (at gate g 4 ) is replaced by (1 + a 0 b 0 ) . After simplification, a monomial 2x is identified and removed by modulo 2 from F 4 . Similar reductions are applied during the transformations F 3 → F 2 and F 2 → F 1 . Finally, the function of the design is extracted as expression F 1 . A complete rewriting process is shown in Figure 4 . We can see that F 1 = Sig in , which indicates that the circuit indeed implements the GF (2 2 ) multiplication with P (x)=x 2 + x + 1. An important observation is that the potential reductions take place only within the expression associated with the same degree of polynomial ring (Sig out ). In other words, the reductions happen in a logic cone of every output bit independently of other bits, regardless of logic sharing between the cones. For example, the reductions in F 4 and F 2 happen within the logic cone of output z 1 only. Similarly, in F 1 , the reduction is within logic cone of z 0 . Details of the proof are provided in [13] .
C. Implementation
This section describes the implementation of our parallel verification method for Galois field multipliers. Our approach takes the gate-level netlist as input, and outputs the extracted function of the design. It includes four steps:
Step1
Step3: Parallel extraction. Apply Algorithm 1 to each equation file to extract the polynomial expression of each output in parallel. In contrast to work on integer arithmetic [2] , the internal expression of each output bit does not offer any polynomial reduction (monomial cancellations) with other bits. Ideally, our approach can extract GF(2 m ) multiplier in m threads. However, due to the limited computing resources, it is impossible to extract GF(2 m ) multipliers in m threads when m is very large. Hence, our approach puts a limit on the number of parallel threads T (T = 5, 10, 20 and 30 have been tested in this work). This process is illustrated in Figure 5 . The m extraction tasks are organized into several task sets, ordered from LSB to MSB. In each set, the extractions are performed in parallel. Since the runtime of each extraction within the set can differ, the tasks in the next set will start as soon as any previous task terminated.
Step4: Finalization. Compute the final function of the multiplier. Once the algebraic expression of each output bit in GF(2) is computed, our method computes the final function by constructing the Sig out using the rewriting process in step 3. Our algorithm uses a data structure that efficiently implements iterative substitution and elimination during backward rewriting. It is similar to the data structure employed in function extraction for integer arithmetic circuits [2] , suitably modified to support simplifications in finite fields algebra. Specifically, in addition to cancellation of terms with opposite signs, it performs modulo 2 reduction of monomials and constants. The data structure maintains the record of the terms (monomials) in the expression that contain the variable to be substituted. It reduces the cost of finding the terms that will have their coefficients changed during substitution. Each element represents one monomial consisting of the variables in the monomials and its coefficient. The expression data structure is a C++ object that represents a pseudo-Boolean expression, which contains of all the elements in the data structure. It supports both fast addition and fast substitution with two C++ maps, implemented as binary search trees, a terms map, and a substitution map. This data structure includes two cases of simplifications: 1) after substitution the coefficients of all the monomials are updated and the monomials with coefficient zero are eliminated; 2) the monomials whose coefficient modulo 2 evaluate to 0 are eliminated. The second case is applied after each substitution. Example 3 ( Figure 6 ): We illustrate our parallel extraction method using a 2-bit multiplier in GF(2 2 ) in Figure 3 . The output signature Sig out = z 0 +z 1 x is split into two signatures, Sig out0 = z 0 and Sig out1 = z 1 . Then, the rewriting process is applied to Sig out0 and Sig out1 in parallel. When Sig out0 and Sig out1 have been successfully extracted, the two signatures are merged into Sig out0 + x·Sig out1 , resulting in the polynomial Sig in . In Figure 4 , we can see that elimination happens three times (F 4 , F 2 , and F 1 ). As expected, this happens within each element in GF(2 n ). In Figure 6 one elimination in Sig out0 and two eliminations in Sig out1 have been done independently, as shown earlier (refer to Example 2).
Eqns of netlist
Sig out = z 0
Eqns of netlist
Sig out = z 1
Eqns of netlist
Sig out = z m-2
Eqns of netlist
Sigout0=z0 elim Sigout1=x·z1 elim G8: z0 - G8: i5x+i6x - G7: i1+i2 - G7: i5x+i6x - G6: i1+i2 - G6: i2x+x+i6x - G5: i1+i2 - G5: i2x+x+i3x+i4x - G4: i1+i2 - G4: i2x+x+i3x+a0b1x+x 2x G3: i1+i2 - G3: i2x+a1b0x+x+a0b1x - G2: i1+a1b1+1 - G2: a1b1x+x+a1b0x+x+a0b1x 2x G1: 1+a0b0+a1b1+1 2 G1: x(a1b1+a1b0+a0b1) - z0=a0b0+a1b1, z1=x(a1b1+a1b0+a0b1)
V. REVERSE ENGINEERING
In this section, we present our approach to perform reverse engineering of GF(2 m ) multipliers. Using the extraction technique presented in the previous section, we can extract the algebraic expression of each output bit. In contrast to the algebraic techniques of [6] [10], our extraction technique can extract the algebraic expression of each output bit independently. This means that the extraction can be done without the knowledge of the bit position of the inputs and outputs. Two theorems are provided and proved to support this claim.
In a GF(2 m ) multiplication, let s i (i ∈ {0,1,...,2m-1}) be a set of partial products generated by AND gates and combined with an XOR operations. For example, in Figure 1 , there are six product sets, s 0 , s 1 , ..., s 6 , where s 1 =a 1 b 0 +a 0 b 1 ; or written as a set: s 1 ={a 1 b 0 , a 0 b 1 }, etc. These product sets are divided into two groups: those with index i ≤ m − 1, called infield product sets; and those with index i ≥ m, called outof-field product sets. The in-field product sets s i , in this case s 0 , s 1 , s 2 , s 3 , correspond to the output bits z i . The out-of-field product sets will be reduced into the field GF(2 m ) using mod P (x) operation, and assigned to the respective output bit, as determined by P (x). In the case of Figure 1 , the out-of-field sets are s 4 , s 5 , s 6 . In general, for a GF(2 m ) multiplication, m product sets are in-field, and m-1 product sets are out-of-field [38] .
A. Output Encoding Determination
We will now demonstrate how to determine the encoding, and hence bit position, of the outputs.
Theorem 2: Given a GF(2 m ) multiplication, the in-field product sets (s 0 , s 1 , ..., s m−1 ) appear in exactly one element of GF(2 m ) each, and the out-of-field product sets (s m , s m+1 , ..., s 2m−1 ) appear in at least two elements (outputs) of GF(2 m ), as a result of reduction mod P (x).
Proof: An irreducible polynomial in GF(2 m ) has the standard form P (x) = x m + P (x), where the tail polynomial P (x) contains at least two monomials x d with degree d < m. For example, there are two such monomials for a trinomial, four for pentanomial, etc. Since P (x) = 0 we have x m = P (x) in GF(2 m ). Hence the variable x m , associated with the first out-of-field partial product set s m will appear in at least two outputs, determined by P (x). Other variables, x k , associated with out-of-field partial product set s k , for k > m, can be expressed as
and will contain at least two elements. QED In fact, the number of outputs in which the out-of-field set s k will appear is equal to the number of monomials in the above product x k−m P (x), provided that every monomial x j with j > m is recursively reduced mod P (x), i.e., by using relation x m = P (x). We illustrate this fact with an example of multiplication in GF(2 4 ) using irreducible polynomial P 1 (x) = x 4 + x 3 + 1 shown in the left side of Figure  1 . The in-field sets, associated with outputs z 0 , z 1 , z 2 , z 3 , are s 0 , s 1 , s 2 , s 3 . Since P 1 (x) = x 4 + x 3 + 1 = 0, we obtain x 4 = x 3 + 1. This means that set s 4 appears in two output columns, z 3 and z 0 . Then
which means that s 5 appears in three outputs: z 3 , z 1 , z 0 . Finally,
that is, s 6 will appear in four outputs: z 3 , z 2 , z 1 , z 0 . As expected, this matches the left Table in Figure 1 . Note the recursive derivation of x k for k > m, which increases the number of columns to which a given set s k is assigned.
Based on Theorem 2, we can find the in-field product sets, s 0 , s 1 , ..., s m−1 , by searching the unique products in the resulting algebraic expressions of the output bits. In this context, unique products are the products that exist in only one of the extracted algebraic expressions. Since the in-field product set indicates the bit position of the output, we can determine the bit positions of the output bits as soon as all the in-field product sets are identified.
Example 4 ( Figure 2 ): We illustrate the procedure of determining bit positions with an example of a GF(2 4 ) multiplier implemented using irreducible polynomial P 2 (x)=x 4 +x+1 (see Figure 1) . Note that in this process the labels do not offer any knowledge of the bit positions of inputs and outputs. The extracted algebraic expressions of the four output bits are shown in Figure 2 . The labels of the variables do not indicate any binary encoding information. We first identify the unique products that include set s 0 =a 0 b 0 in algebraic expression of z 0 ; set
Note that the number of products in the in-field product set s i is i. Hence, we find all the in-field product sets and their relation to the extracted algebraic to be as follows: We can now determine the bit position of the input variables using the procedure outlined in Algorithm 2. The input bit position can be determined by analyzing the in-field product sets, obtained in the previous step. Based on the GF multiplication algorithm, we know that s 0 is generated by an AND function with two LSBs of the two inputs; and the two products in s 1 are generated by the AND and XOR operations using two LSBs and two 2 nd input bits, etc. For example in a GF(2 4 ) multiplication (Figure 1 nd LSBs. This allows us to determine the bit position of the input bits recursively by analyzing the algebraic expression of s i . We illustrate this with the GF(2 4 ) multiplier implemented using P 2 (x) = x 4 +x+1 (Figure 2 ). Example 5 (Algorithm 2): The input of our algorithm is a set of algebraic expressions of the in-field product sets, s 0 , s 1 , s 2 , s 3 (line 1). We initialize vector V to store the variables in which their bit positions are assigned (line 2). The first algebraic expression is s 0 . Since the two variables, a 0 and b 0 are not in V , the bit positions of these two variables are assigned index i = 0 (line 4-8). In the second iteration, V ={a 0 , b 0 }, and the input algebraic expression is s 1 , including  variables a 0 , b 0 , a 1 and b 1 . Because a 1 and b 1 are not in V , their bit position is i = 1. The loop ends when all the algebraic expressions in S have been visited, and returns V ={ (a 0 , b 0 ) 0 , (a 1 , b 1 ) 1 , (a 2 , b 2 ) 2 , (a 3 , b 3 ) 3 }. The subscripts are the bit position values of the variables returned by the algorithm. Note that this procedure only gives the bit position of the input bits; the information of how the input words are constructed is unknown. There are 2 m−1 combinations from which the words can be constructed using the information returned in V . For example, the two input words can be W 0 =a 0 +2a 1 +4b 2 +8a 3 and W 1 =b 0 +2b 1 +4a 2 +8b 3 ; or they can be W 0 =a 0 +2a 1 +4b 2 +8b 3 and W 1 =b 0 +2b 1 +4a 2 +8a 3 . Although there may be many combinations for constructing the input words, the specification of the GF(2 m ) is unique.
C. Extraction of the Irreducible Polynomial
Theorem 3: Given a multiplication in GF(2 m ), let the first out-of-field product set be s m . Then, the irreducible polynomial P (x) includes monomials x m and {x i } iff all products in the set s m appear in the algebraic expression of the i th output bits, for all i < m. Proof: Based on the definition of field arithmetic for GF(2 m ), the polynomial basis representation of s m is x m s m . To reduce s m into elements in the range [0, m − 1], the field reductions are performed modulo irreducible polynomial P (x) with highest degree m (c.f. the proof of Theorem 2). As before, let P (x) = x m + P (x). Then,
Hence, if x i exists in P (x), it also exists in P (x). Therefore,
Even though the input bit positions have been determined in the previous step, we cannot directly generate s m since the combination of the input bits for constructing the input words is still unknown. In Example 5 (m=4), we can see that s m ={a 1 b 3 , a 2 b 2 , a 3 b 1 } when input words are W 0 and W 1 ; but s m ={a 1 a 3 , a 2 b 2 , b 1 b 3 } when inputs words are W 0 and W 1 . To overcome this limitation, we create a set of products s m , which includes all the possible products that can be generated based on all input combinations. The set s m includes the true products, i.e., those that exist in the first out-of-field product set; and it also includes some dummy products. The dummy products are those that never appear in the resulting algebraic expressions. Hence, we first generate the set s m and eliminate the dummy products by searching the algebraic expressions. After this, we obtain s m . Then, we use s m to extract the irreducible polynomial P (x) using Algorithm 3.
Example 6: We illustrate the method of reverse engineering the irreducible polynomial using the GF (2 4 ) multiplier of Fig.  1 . The procedure is outlined in Algorithm 3. The extracted algebraic expressions S (line 1 at Algorithm 3) is shown in Figure 2 . The bit position of input bits is determined by Algorithm 2 (line 2). Based on the result of Algorithm 2, we generate s m ={a 1 a 3 , b 1 b 3 , a 2 b 2 , a 3 b 1 , a 1 b 3 }. To eliminate the dummy products from s m , we search all algebraic expressions in S, and eliminate the products that cannot be part of the resulting products. In this case, we find that a 1 a 3 and b 1 b 3 are the dummy products. Hence, we get s m ={a 3 b 1 , a 2 b 2 , a 1 b 3 }  (line 3) . Based on the definition of irreducible polynomial, P (x) must include x m ; in this example m = 4 (line 4). While looping over all the algebraic expressions, the expressions for z 0 and z 1 contain all the products of s m . Hence, x 0 and x 1 are included in P (x), so that P (x) = x 4 +x 1 +x 0 . We can see that it is the same as P 2 (x) in Figure 1 .
Input: the algebraic expressions of output bits S Input: the first out-of-field product set sm Output: Irreducible polynomial P (x)
if all products in sm exist in expi then 7: In summary, using the framework presented in Section IV-C, we first extract the algebraic expressions of all output bits. Then, we analyze the algebraic expressions to find the bit position of the input bits and the output bits, and extract the irreducible polynomial P (x). In the example of the GF(2 4 ) multiplier implemented using P (x) = x 4 +x+1, shown in Figure 1 , the final results returned by our approach gives the following: 1) the input bits set (a 3 , b 3 ) 3 }, where the subscripts represent the bit position; 2) z 0 is the least significant bit (LSB), z 1 is the 2 nd output bit, z 2 is the 3 rd output bit, and z 3 is the most significant bit (MSB); 3) irreducible polynomial is P (x) = x 4 +x+1; 4) the specification can be verified using the approach presented in Section IV with the reverse engineered information.
VI. RESULTS
The experimental results of our method are presented in two subsections: 1) evaluation of parallel verification of GF(2 m ) multipliers; and 2) evaluation of reverse engineering of GF(2 m ) multipliers. The results given in this section include data (total time and maximum memory) for the entire verification or reverse engineering process, including translating the gate-level verilog netlist to the algebraic equation, performing backward rewriting and other required functions.
A. Parallel Verification of GF(2 m ) Multipliers
The verification technique for GF(2 m ) multipliers presented in Section IV was implemented in C++. It performs backward TABLE II: Results of verifying Montgomery multipliers using our parallel approach. T is the number of threads. T O=Time out of 18 hours. (*T=1 shows the maximum memory usage of a single thread.) rewriting with variable substitution and polynomial reductions in Galois field in parallel fashion. The program was tested on a number of combinational gate-level GF (2 m ) multipliers taken from [6] , including the Montgomery multipliers [39] and Mastrovito multipliers [40] . The bit-width of the multipliers varies from 32 to 571 bits. The verification results for various Galois field multipliers obtained using SAT, SMT, ABC [41] , and Singular [42] , have already been presented in works of [1] and [6] . They clearly demonstrate that techniques based on computer algebra perform significantly better than other known techniques. Hence, in this work, we only compare our approach to those two, and specifically to the tool described in [6] . However, in contrast to the previous work on Galois field verification, all the GF(2 m ) multipliers used in this paper are bit-blasted gate-level implementations. The bit-level multipliers are taken from [6] and mapped onto gate-level circuits using ABC [41] . Our experiments were conducted on a PC with Intel(R) Xeon CPU E5-2420 v2 2.20 GHz ×12 with 32 GB memory. As described in the next section, our technique can verify Galois field multipliers in multiple threads by applying Algorithm 1 to each output bit in parallel. The number of threads is given as input to the tool.
The experimental results of our approach and comparison with [6] are shown in Table I for gate-level Mastrovito multipliers with bit-width varying from 32 to 571 bits. These multipliers are directly mapped using ABC without any optimization. The largest circuit includes over 1.6 million gates. This is also the number of polynomial equations and the number of rewriting iterations (see Section IV). The results generated by the tool, presented in [6] are shown in columns 3 and 4 of Table I . We performed four different series of experiments, with the number of threads T varying from 5 to 30. The table shows CPU runtime and memory usage for different values of T . The timeout limit (TO) was set to 12 hours and memory limit (MO) to 32 GB. The experimental results show that our approach provides on average 26.2×, 37.8×, 42.7×, and 44.3× speedup, for T = 5, 10, 20, and 30 threads, respectively. Our approach can verify the multipliers up to 571 bit-wide multipliers in 1.5 hours, while that of [6] fails after 12 hours.
The reported memory usage of our approach is the maximum memory usage per thread. This means that our tool experiences maximum memory usage with all T threads running in the process; in this case, the memory usage is T ·M em. This is why the 571-bit Mastrovito multipliers could be successfully verified with T = 5 and 10, but failed with T = 20 and 30 threads. For example, the peak memory usage of 571-bit Mastrovito multiplier with T = 20 is 2.6 × 20 = 52 GB, which exceeds the available memory limit.
We also tested Montgomery multipliers with bit-width varying from 32 to 283 bits; the results are shown in Table  II . These experiments are different than those in [6] . In our work, we first flatten the Montgomery multipliers before applying our verification technique. That is, we assume that only the positions of the primary inputs and outputs are known, without the knowledge of any high-level structure. In contrast, [6] verifies the Montgomery multipliers that are represented with four hierarchical blocks. For 32-to 163-bit Montgomery multipliers, our approach provides on average a 9.2×, 15.9×, 16.6×, and 17.4× speedup, for T = 5, 10, 20, and 30, respectively. Notice that [6] cannot verify the flattened Montgomery multipliers larger than 233 bits in 12 hours.
Analyzing Table I we observe that the rewriting technique of our approach when applied to Montgomery multipliers require significantly more time than for Mastrovito multipliers. The main reason for this difference is the internal architecture of the two multiplier types. Mastrovito multipliers are obtained directly from the standard multiplication structure, with the partial product generator followed by an XOR-tree structure, as in modular arithmetic. Since the algebraic model of XOR in GF arithmetic is linear, the size of the polynomial expressions generated during rewriting of this architecture is relatively small. In contrast, in a Montgomery multiplier the two inputs are first transformed into Montgomery form; the products of these Montgomery forms are called Montgomery products.
Since the polynomial expressions in Montgomery forms are much larger than partial products, the increase in size of intermediate expressions is unavoidable.
1) Dependence on P (x): In Table II , we observe that CPU runtime for verifying a 163-bit multiplier is greater than that of a 233-bit multiplier. This is because the computational complexity depends not only on the bit-width of the multiplier, but also on the irreducible polynomial P (x) used in constructing the multiplier.
We illustrate this fact using two GF(2 4 ) multiplications implemented using two different irreducible polynomials (c.f. Figure 1 ). We can see that for P 1 (x)=x 4 + x 3 + 1, the longest logic paths for z 3 and z 0 , include ten and seven products that need to be generated using XORs, respectively. However, when P 2 (x)=x 4 + x + 1, the two longest paths, z 1 and z 2 , have only seven and six products. This means that the GF(2 4 ) multiplication requires 9 XOR operations using P 1 (x) and 6 XOR operations using P 2 (x). In other words, the gate-level implementation of the multiplier implemented using P 1 (x) has more gates compared to P 2 (x). In conclusion, we can see that irreducible polynomial P (x) has significant impact on both design cost and the verification time of the GF(2 m ) multipliers. 2) Runtime vs. Memory: In this section, we discuss the tradeoff of runtime and memory usage of our parallel approach to Galois Field multiplier verification. The plots in Figure 7 show the average runtime and memory usage for different number of threads, over the set of multipliers shown in Tables  I and II (32 to 283 bits) . The vertical axis on the left is CPU runtime (in seconds), and on the right is memory usage (MB). The horizontal axis represents the number of threads T , ranging from 1 to 30. The runtime is significantly improved for T ranging from 5 to 15. However, there is not much speedup when T is greater than 20, most likely due to the memory management synchronization overhead between the threads. Similarly to the results for Mastrovito multipliers (Table I) , our approach is limited here by the memory usage when the size of the multiplier and the number of threads T are large. In our work, T = 20 seems to be the best choice. Obviously, T varies for different platforms, depending on the number of cores and the memory.
We also analyzed the runtime complexity of our verification algorithm for a single thread (T=1) computation; it is shown in Figure 8 . The y-axis shows the total runtime of rewriting the polynomial expressions, and x-axis indicates the size of the Mastrovito multiplier. The result shows that the overall speedup is roughly the same for each value of T. Montgomery multipliers exhibit similar behavior, regardless of the choice of the irreducible polynomial. 3) Effect of Synthesis on Verification: In [10] the authors conclude that highly bit-optimized integer arithmetic circuits are harder to verify than their original, pre-synthesized netlists. This is because the efficiency of the rewriting technique relies on the amount of cancellations between the different terms of the polynomial, and such cancellations strongly depend on the order in which signals are rewritten. A good ordering of signals is difficult to achieve in highly bit-optimized synthesized circuits.
To see the effect of synthesis on parallel verification of GF circuits, we applied our approach to post-synthesized Galois field multipliers with operands up to 409 bits (571-bit multipliers could not be synthesized in a reasonable time). We synthesized Mastrovito and Montgomery multipliers using ABC tool [41] . We repeatedly used the commands resyn2 and dch 3 until the number of AIG levels or nodes could not be reduced anymore. The synthesized multipliers were mapped using a 14nm technology library. The verification experiments shown in Table III are performed by our tool with T = 20 threads. Our tool was able to verify both 409-bit Mastrovito and Montgomery multipliers within just 13 minutes. We observed that in our parallel approach Galois field multipliers are easier to be verified after optimization than in their original form. For example, the verification of a 283-bit Montgomery multiplier takes 15,300 seconds for T =20. After optimization, the runtime dropped to just 169.2 seconds, which means that such a verification is 90x faster than of the original implementation. The memory usage has also been reduced from 488 MB to 194 MB. In summary, in contrast to verification problems of integer multipliers [10] , the bit-level optimization actually reduces the complexity of backward rewriting process. This is because extracting the function of an output bit of a GF multiplier depends only on the logic cone of that bit and does not require logic expression from other bits to be simplified (c.f. Theorem 3). Hence, the complexity of function extraction is naturally reduced if the logic cone is minimized. 
B. Reverse Engineering of GF(2 m ) Multipliers
The reverse engineering technique presented in this paper was implemented in the framework described in Section V in C++. It reverse engineers bit-blasted GF(2 m ) multipliers by analyzing the algebraic expressions of each element using the approach presented in Section IV. The program was tested on a number of gate-level GF (2 m ) multipliers with different irreducible polynomials, including Montgomery multipliers and Mastrovito multipliers. The multiplier generator, taken from [1] , takes the bit-width and the irreducible polynomial as inputs and generates the multipliers in the equation format. The experimental results show that our technique can successfully reverse engineer various GF(2 m ) multipliers, regardless of the GF(2 m ) algorithm and the irreducible polynomials. We set the number of threads to 16 for all the reverse engineering evaluations in this section. This is dictated by the fact that T=16 gives most promising performance (runtime) and scalability (memory usage) metrics on our platform, based on the analysis presented in Section VI-A2 (Figure 7 ). Our program takes the netlist/equations of the GF(2 m ) implementations, and the number of threads as input. Hence, the users can adjust the parallel efforts depending on the limitation of the machines. In this work, all results are performed in 16 threads. Typical designs that require reverse engineering are those that have been bit-optimized and mapped using a standard-cell library. Hence, we apply our technique to the bit-optimized Mastrovito and Montgomery multipliers (Table  IV) . For the purpose of our experiments, the multipliers are optimized and mapped using ABC [41] . Compared to the verification runtime of synthesized multipliers (Table III) , the CPU time spent on analyzing the extracted expressions for reverse engineering is less than 10% of the extraction process. This is because most computations of reverse engineering approach are associated with those for extracting the algebraic expressions, as presented in Section VI-A2, Table III . The reverse engineering approach has been further evaluated using four Mastrovito multipliers, each implemented with a different irreducible polynomial P (x) in GF (2 233 ). The polynomials are obtained from [43] and optimized using ABC synthesis tool. The results are shown in Figure 9 . We can see that the multipliers implemented with trinomial P (x) are much easier to be reverse engineered than those based on a pentanomial P (x). This is because the multipliers implemented with pentanomial P (x) contain more gates and have longer critical path, since the reduction over pentanomial requires more XOR operations. The CPU runtime for irreducible polynomial of the same class (trinomials or pentanomials) is almost the same. As discussed in Section III-B, comparison of the two trinomials shows that the efficient trinomial irreducible polynomial, x m +x a +1, typically satisfies m-a>m/2. The results for designs synthesized with 14nm technology library are shown in Figure 10 . It shows that the area and delay of the Mastrovito multiplier implemented with P (x)=x 233 +x 74 +1 are 5.7% and 7.4% less than for P (x)=x 233 +x 159 +1, respectively. VII. CONCLUSION This paper presents a parallel approach to verification and reverse engineering of gate-level Galois Field multipliers using computer algebraic approach. It introduces a parallel rewriting method that can efficiently extract functional specification of Galois Field multipliers as polynomial expressions. We demonstrate that compared to the best known algorithms, our approach tested on T =30 threads provides on average 44× and 17× speedup in verification of Montgomery and Mastrovito multipliers, respectively. We presented a novel approach that reverse engineers the gate-level Galois Field multipliers, in which the irreducible polynomial, as well as the bit position of the inputs and outputs are unknown. We demonstrated that our approach can efficiently reverse engineer the Galois Field multipliers implemented using different irreducible polynomials. Future work will focus on formal verification of prime field arithmetic circuits and complex cryptography circuits.
