Abstract-In this paper a novel multiplier based on Karatsuba-Ofman Algorithm is presented. A binary field multiplication in polynomial basis is typically viewed as a two steps process, a polynomial multiplication followed by a modular reduction step. This research proposes a modification to the original Karatsuba-Ofman Algorithm in order to integrate the modular reduction inside the polynomial multiplication step. Modular reduction is achieved by using parallel linear feedback registers. The new algorithm is described in detail and results from a hardware implementation on FPGA technology are discussed. The hardware architecture is described in VHDL and synthesized for a Virtex-6 device. Although the proposed field multiplier can be implemented for arbitrary finite fields, the targeted finite fields are recommended for Elliptic Curve Cryptography. Comparing other KOA multipliers, our proposed multiplier uses 36% less area resources and improves the maximum delay in 10%. 
I. INTRODUCTION
Nowadays binary field arithmetic has achieved great importance thanks to different applications like cryptography and error-correcting code. Several algorithms used by these applications are based on this kind of arithmetic [1] . Among them, one of the most relevant is Elliptic Curve Cryptography, which provides the same security levels as RSA but uses shorter key lengths, which is desirable for wireless and mobile environments.
Among binary field arithmetic operations, multiplication is one of the most expensive. Typically a multiplication on is a two steps process: 1) a polynomial multiplication, and 2) a modular reduction step. The Karatsuba-Ofman Algorithm (KOA) [2] performs the first step. Techniques, such as Barret reduction [3] or Lazy reduction [4] , can be used for modular reduction. Improving multiplication performance is tackled in [5] - [18] . (2 ) m
GF
There are different algorithms to perform binary field multiplications, such as the Montgomery [19] , the FFT [20] and the Cantor [21] multipliers. The Karatsuba-Ofman algorithm [2] was the first to achieve below complexity and, additionally it is well suited for hardware implementation because its structure is highly parallel.
( ) O n
In this paper, a novel multiplier based on the original KOA algorithm that integrates the modular reduction step is introduced. Usually, the reduction step is performed independently and is not considered in the original KOA. The reduction step is executed by parallel linear feedback shift registers. An analysis on the theoretical cost in terms of area and maximum delay is carried out for the proposed multiplier and the classical KOA with a separate reduction step. This analysis considers an irreducible polynomial defining the finite field as trinomials. The new KOA algorithm is developed on FPGA technology, using VHDL for hardware description and the Xilinx ISE tools for implementation. Results in terms of area and time are presented for finite fields recommended in Elliptic Curve Cryptography. The proposed multiplier improves resources usage and processing time when compared to the KOA algorithm with Classical Reduction. 
GF
The rest of this document is organized as follows: Section 2 explains LFSRs and their use in binary field arithmetic and realization in hardware. Section 3 explains the KOA algorithm for multiplication in including the Classical Reduction step. Section 4 describes the proposed modification for the KOA algorithm and the hardware architecture is presented providing a comparison with the classical approach. Details of the architecture implementation and results are discussed in Section 5. Finally, conclusions of this research are drawn in Section 6. 
A LFSR is a -bit shift register that pseudo-randomly scrolls among n 2 n 1 − states at high speed [22] . It requires minimal logic to generate binary sequences. After reaching all states, the output sequence is repeated cyclically.
A LFSR of length has memory cells which together form the initial state ( 
where ( ) A x is normally represented as a -bit vector containing all coefficients defining its corresponding polynomial, that is,
Thus, ( ) xA x becomes a shift to the left operation on , the resulting polynomial is reduced by 
in results in another field element that is computed in two steps:
There are several algorithms to compute , among them and widely known is the classical or Schoolbook method consisting of a shift-and-add scheme. Most of the proposed field multiplication algorithms are based on this method whose complexity is . In 1962, a multiplication algorithm was published by Karatsuba and Ofman [2] with complexity. The KOA algorithm computes the first step of a field multiplication by using the divide and conquer technique. The multiplication is computed recursively using three field multiplications with low order operands. KOA splits the multiplier and multiplicand as it is shown in the following equation: 
Thus, the next equations are sustained: 
where,
At this point, ( )· ( ) A x B x requires four multiplications with operands that are half the size the initial ones. KOA can be used recursively to compute these new multiplications and it reduces the number of multiplications to three at the cost of some more additions by redefining , 1 z as shown in Equation (9) . (9) 1 2
In , additions and subtractions are the same and are performed as bitwise XOR operations, thus redefining has no substantial cost. The recursive Karatsuba-Ofman method for multiplying two polynomials
The KOA algorithm receives as input the multiplier and multiplicand as well as their bit-length . In the first call . At each recursive call, operands are divided resulting in ( ) n n m = 2 n -bit vectors. The recursive KOA finishes when , returning as a result the bitwise AND of
Steps 7-8 in Algorithm 1 perform a recursive call to KOA and the resulting polynomials , and are (
vectors. In step 9, the final multiplication is calculated, resulting in a ( )-order polynomial. When all recursive calls are finished, the final result is a ( 2 1
operation is depicted in Fig. 2 . Up to this point, it is assumed that , but in many applications, such as cryptography, m is not a power of 2. One strategy is padding with 0's the bit vector representation of the input operands until reaching a power of 2 length, but with this strategy many gates remain unused. Thus, a modification to KOA called Binary Karatsuba Multiplier. (BKM) was proposed in [17] . More details on this technique are provided next. 
For general irreducible polynomials ( ) f x , specialized reduction methods must be applied, such as the Barret [23] or the Montgomery method [24] .
For special ( ) f x classes, such as trinomials and pentanomials, the reduction step of KOA algorithm can be performed using a matrix of XOR gates [25] . This technique has been used in KOA hardware implementations [5] , [16] , [17] . The reduction technique is based on the fact that if
f x x x = + + expresses polynomial in the following way: The last expression in Equation (10) states that can be formulated as a -bit vector that results from adding five terms obtained from , achieving the desired
Graphically, this reduction is shown in Fig. 3 .
B. Theoretical cost analysis for KOA with Classical Reduction
Let be the cost in area of a KOA hardware implementation. If S 1 m = , the total cost is only one 1-bit AND gate. If , the total cost is given by three KOA recursive calls with half size operands:
S . In addition, the following XOR gates are also needed:
• Two ( 1 n − ) XOR gates to add three ( 1 n − )-bit numbers, Algorithm 1, step 8.
• One ( 1 n − )-bit XOR to concatenate and , Algorithm 1, step 9.
The total number of XOR gates required is 4n 3 − . The reduction step cost is given by the number of XOR gates necessary to add five terms of Equation (10), which is 2m a + , where corresponds to the power of the second term in the irreducible polynomial
The total area cost for the KOA algorithm considering the Classical Reduction technique for trinomials, is given by the . Thus, the delay for the reduction step is 3 X T .
Time complexity for KOA algorithm is given by the recurrence in the following equation: x . The proposed approach takes advantage of the module operation and integrates the modular reduction step within KOA algorithm through Equation (13). 
In the previous section, it was demonstrated that Equation (13) can be solved using LFSR. Following this approach, two PLFSR are required to compute [17] . BKM considers that , where is the largest power of 2 that is smaller than , and are the remainder bits. Then, instead of splitting the input polynomial in two equal size bit-vectors, both input polynomials are split according to the next equation: 
and a reduction is also necessary.
Before analyzing subsequent recursive calls in the BKM strategy, it is observed that several PLFSRs are required resulting in an expensive hardware architecture. Hence, a different strategy to optimize the number of PLFSRs is approached. The proposed strategy is similar to that used in [7] and [12] . It consists in splitting the input bit vectors by half using the function ceiling to ensure an integer result, since could be an odd number, see next equation: Algorithm 2 presents the proposed novel KaratsubaOfman algorithm based on LFSR. It is worth noticing that result is already reduced ( ) C x mod ( ) f x . Steps 4 and 5 use the splitting strategy explained before in Equation (15) whereas steps 6-8 perform the recursive calls.
Step 9 evaluates n m = which is true only for the first call when using PLFSRs. For the rest of the calls and partial results sizes are smaller than therefore a reduction is not needed. 
GF

Input:
an integer smaller or equal to 
z z + +
9:
IF n m = 10: In Fig. 5a , the first call case using PLFSRs is shown. In Fig. 5b the recursive calls case is drawn, where simple shifts are used instead of PLFSR.
A. Theoretical cost analysis for KOA-LFSR multiplier
This novel approach leads to the next space and time complexity analysis. To simplify this analysis, only the special case of having an even is considered, that is: In Table I , a theoretical cost comparison for the KOA algorithm with Classical Reduction and the proposed KOA-LFSR is presented considering the trinomial case. It is observed that the proposed KOA-LFSR algorithm achieves a reduction in hardware cost and in time delay required to implement the multiplier on a hardware platform.
V. ARCHITECTURE IMPLEMENTATION AND RESULTS ANALYSIS
To validate the proposed modification of the KOA algorithm, a fully parallel Karatsuba-Ofman Multiplier has been designed, simulated and synthesized. Different fields with irreducible polynomials considering both, trinomials and pentanomials are assessed, see Table II . These polynomials define finite fields recommended by the NIST for cryptographic applications [26] , while the others are proposed by CERTICOM as a challenge 1 . For comparative purposes, results for a fully parallel Binary Karatsuba Multiplier using the Classical Reduction are presented.
The proposed architecture was implemented using VHDL as a description language. For design validation, a C routine to generate test data vectors was created and ModelSim PE Student Edition 10.1c was used as simulation environment. For synthesis, Xilinx ISE 13.2 was used targeting a Xilinx Virtex-6 (xc6vlx240t) device. In Fig. 6 , the improvement for the proposed KOA-LFSR algorithm in time and area is presented when compared to the BKM technique with Classical Reduction. Theoretical cost for the BKM technique with Classical Reduction is not provided.
These results not only confirm the theoretical improvement shown in Table I , but also demonstrate that the proposed multiplier helps the synthesis tool to optimize the FPGA's resources usage. The KOA-LFSR algorithm has a very regular structure from which the synthesis tool takes advantage and optimizes the result. In Fig. 6 , the area and time tendency, when the field size increases, are observed showing a better performance for the proposed KOA-LFSR algorithm.
In Table III , the proposed multiplier is compared to a different Karatsuba-Ofman Multiplier using the same device. A direct comparison with other works is difficult because to the best of our knowledge, other works do not consider the cost of the KOA multiplication and the reduction step together. Some authors only work on the polynomial multiplier; others focus on the reduction for general polynomials. Moreover, in order to compare different hardware architectures, the same FPGA devices should be considered, because it would not be fair to compare the required area on a 4-in LUT FPGA versus a 6-in LUT FPGA.
In [5] , a multiplier based on the BKM technique is presented. It truncates the recursion at a predefined number of bits and then uses a more efficient multiplier. The idea in this work is that for small multipliers there are better multipliers than the KOA approach. Thus, in this work the . This work reports the number of slices used and the time required. As a reduction step, the classical method explained in Section III.A is used.
In [9] , a KOA based multiplier with pipelining is presented. This multiplier truncates KOA's recursive calls after some steps and thereafter the Classic Method is used. Pipeline registers are placed between every KOA recursive call. The proposed approach is compared to its more similar experimental case. This design is assessed considering several pipeline stages in order to find the best compromise between area and time. The used modular reduction strategy is not explicitly mentioned.
In [7] , authors perform a detailed analysis of several KOA-based multipliers implemented in FPGAs and ASICs. This work considers multipliers that are a mix of the KOA and the Classic algorithms. First, it analyzes separately both approaches and realizes that the classic method is better for small fields. Then, it implements a KOA multiplier that truncates recursive calls and executes small multipliers with the classic method. The KOA multiplier used on that work uses a splitting strategy very similar to the one used in this research. In [7] , experiments with several multipliers were carried out. In order to provide a fair comparison for the research herein presented, those approaches that do not consider the modular reduction but which are closely related to this study were chosen. That work also presents placeand-route results, however a direct comparison is not possible because the classic method implementation is carried out manually. Their results show the number of LUTs required in their design.
In [8] , several combinations of parallel and sequential multipliers are provided. Results for a sequential 240-bit multiplier are presented, for comparison with the proposed KOA-LFSR approach a 239-bit multiplier is selected.
In [10] , the number of slices used by the architecture is reported, for comparison the same parameter has been used. This paper explores different architectures of Karatsuba multipliers, some of them are fully parallel while others are a hybrid of parallel and sequential multipliers. The fastest (fully parallel) and the smallest architectures are shown. The reduction step is not considered in this research. Because, is only considered as a power of 2, fields closer to 128 and 256 are chosen for comparison. Exact 128 and 256 fields are not selected because to the best of our knowledge, there are not irreducible polynomials reported for these fields.
m
VI. CONCLUSIONS
In this paper, a novel multiplier called KOA-LFSR has been presented. The proposed approach is a modification of the original Karatsuba-Ofman algorithm (KOA) to perform modular multiplication in . Contrary to the original Karastuba-Ofman multiplier that performs only the multiplication step, the KOA-LFSR performs both multiplication and modular reduction. An array of Linear Feedback Shift Registers connected in cascade to carry out the reduction is used, this array is computed during KOA recursive calls. The proposed multiplier performs better than the original KOA with Classical Reduction, saving area resources and achieving better timing. It is important to notice that the way of splitting the input operands is crucial for achieving an optimal performance. The splitting of input operands as shown in Equation (2 ) m GF (15) resulted in the best way to integrate the reduction step in the KOA algorithm. Because the LFSR is a regular and compact module, the synthesis tool optimally mapped this module leading to a better usage of hardware resources. For future work a hybrid multiplier will be tackled, where recursive calls can be truncated at a specific value and simpler multipliers would be used such as the Scholarbook one or multipliers embedded in the same FPGA device. 
