In this paper, we first propose a fast division algorithm in GF( 2 163 ) using standard basis representation, and then it is mapped into divider for GF( 2 163 ) with iterative hardware structure. The proposed algorithm is based on the binary ExtendedGCD algorithm, and the arithmetic operations for modular reduction are performed within only one "while-statement" unlike conventional approach which uses two "while-statement". In this paper, we use reduction polynomial f(x)=x 163 +x 7 +x 6 +x 3 +1 that is recommended in SEC2(Standards for Efficient Cryptography) using standard basis representation, where degree m = 163. We also have implemented the proposed iterative architecture in FPGA using Verilog HDL, and it operates at a clock frequency of 85 MHz on Xilinx-VirtexII XC2V8000 FPGA device. From implementation results, we will show that computation speed of the proposed scheme is significantly improved than the existing two approaches.
Introduction 1)

ECC (Elliptic Curve
public key cryptography systems has been widely used in wireless application. In ECC algorithm, the most time consuming part is scalar multiplication that can be computed by point addition and doubling operations. In either case, major operations for time consuming are field multiplication and field inversion, while squaring and field addition have less computation time [1] [2] [3] .
Several algorithms have been introduced for computing field inversion/division operation based on the Extended Euclidean algorithm [2] [3] [4] [5] [6] . Although these algorithms can be easily implemented using software programs on a general-purpose computer, they would be slow and inefficient for public key cryptosystems which is used a very large field [3, 4] . In order to resolve these problems, the first sublinear time parallel algorithm that uses a polynomial number of processors has been introduced by Kannan Miller, and Rudolph [7] , and a parallel extended GCD (Greatest Common Divisor) algorithm has been presented, which uses the concurrent-read concurrentwrite (CRCW) parallel RAM (PRAM) model of computation [8] .
The binary Extended GCD algorithm was known that it is simple, but it has difficulty of hardware implementation [2] [3] [4] . In [3] , an efficient algorithm is presented based on a modified version of the Euclid's GCD algorithm. Although this algorithm is suitable for implementing GF divider with systolic array structure, it is still time-consuming. Thus a fast algorithm that can perform arithmetic operation in fewer clock cycles [9, 10] is required, which is suitable for iterative hardware implementation.
In this paper, we propose the hardware implementation of iterative divider based on a fast division algorithm in GF (2 163 ) using standard (Polynomial) basis representation.
The proposed algorithm is based on the binary Extended algorithm, and the arithmetic operations are performed for modular reduction in only one while-statement unlike conventional approach. Through implementation results, we have shown that the computation speed of our approach is significantly improved than that of the conventional approaches [2, 4] due to reduction of the number of clock cycles used. This paper is organized as follows. Section 2 introduces problems of the conventional two algorithms for performing field division operation based on the Extendedalgorithm. Section 3 describes a proposed division algorithm and fast iterative divider design for speeding-up division operation in GF (2 163 ). In section 4, simulation results and performance analysis are given, which is based on the improved division algorithm. Finally, conclusion is given in section 5. A procedure for performing the inversion operation of "1/B(x) mod G(x)" over GF (2 m ) is shown in (Fig. 1) , which is called binary Extended GCD algorithm [2] . This algorithm can be divided into three steps written in (1), (2) and (3). In step (1), to calculate "U/ x mod G", the algorithm examines the LSB (Least Significant Bit) of U to determine whether it is even (u0 = 0) or odd (u0 ≠ 0). If it is even, the algorithm performs U/ x, otherwise it performs (U+G)/ x. In this algorithm, modular reduction is accomplished by a simple shift operation. Now, we consider that this algorithm is implemented in iterative hardware structure. By using first clock cycle, the initial parameters stored in four registers of a size of 163 bits are transferred to their outputs of control block, and then in the module of step (1), the modular reduction of variable sets, (R, U) is performed depending on control bits of r0 and u0. Continuously, the updated values are fed to the same registers, and then the control bit r0 is tested again in the module. The variable sets can be also updated if the bit r0 is even, where one clock cycle is required. As a result, we can see that the number of clock cycle which will be used is the same as the iteration times in the module (see Table 1 ). In the module of step (2), modular reduction process is also performed for variable sets of (S, V) and a division result is at least obtained from the use of one clock in step (3) . Note that U will have the division result P(x) = A(x)/B(x) mod G(x) if we replace U = 1 by U = A(x) [4] .
Related Works
<Table 1> shows an example for computing division in GF (2 4 ) based on the algorithm of ( clocks are totally used since one cycle is used in each step of (1), (2), and (3). For 2 nd iteration, step (2) takes three cycles since a while-statement repeats three times.
Thus, 5 clocks are totally used after 2 nd iteration. The algorithm terminates after 4 iterations, and then 16 clocks are needed for obtaining final division result, U = x+1.
Proposed Algorithm and Fast Divider Design
Division algorithm
To speeding-up division operation in GF (2 163 ), we present an advanced division algorithm without affecting the basic functionby modifying the binary Extended GCD algorithm described in (Fig. 1 ) [2] . (Fig. 2) shows the proposed algorithm for performing fast division operation in GF (2 163 ).
Now, we reconsider classical binary Extended algorithm described in (Fig. 1) . In order to perform GCD operation, in step (1), A(x) and B(x) are computed depend on the control bits of u0and r0, respectively. Continuously G(x) and V are computed after completing the check of two conditions, s0 and v0, respectively. Finally, the computation of both GCD (S, R) and GCD (V, U) is performed by comparing S to R in step (3).
Itr
Step For hardware implementation of this algorithm, a number of processing time will be needed because final results are obtained in step (3) after completing the check of each condition in two while-statements of (1) and (2) every iteration routine.
In the proposed algorithm, only one while-statement (see step (1)) is first executed, which is controlled by two bits of r0 and s0,and then two if-statements perform modular reduction within the while-statement. Thus, modular reduction for (R, U) is performed in statement "if r0 == 0 then" and (S, V) is also performed in statement "if s0 == 0 then" depend on the conditions of r0 and s0, respectively. If the proposed division algorithm is implemented in an iterative hardware structure, these two if-statements in while-statement can be constructed to each independent module which is controlled by same clock signal. As a result, processing time is very fast since each input variable which is need to obtain both GCD (S, R) and GCD (V, U) is calculated by using the same clock signal.
<Table 2> demonstrates the proposed algorithm of (Fig. 2) (1) of (Fig. 2) , respectively. These two modules are operated by one clock to execute modular operation while two clock cycles are required for this operation in conventional approach [2] .
Thus, the proposed architecture requires no more than 3m clock cycles (after m iterations) to yield the final division result. (Fig. 4) and (Fig. 5 ) illustrate detailed block diagrams of RU_Blk and SV_Blk shown in (Fig. 3) , respectively.
These two reduction modules can be directly designed from step (1) of our algorithm using three operators of ( Fig. 6) shows a detailed block diagram of GCD_Cal shown in (Fig. 3) . It also can be directly derived from step (2) of our algorithm using four operators of XOR,
MUX, INV and CMP (Comparator), where INV (Inverter)
is used for handling else-statement described in step (2).
This module performs mainly arithmetic operations for comparison (CMP) and addition (XOR) using four parameters calculated in previous modules, and it outputs operation results after a given clock cycles.
For hardware implementation, the previous division algorithm [2] needs several processing time because reduction operation for variable sets, (U, R) is first performed in step (1) of 1 st while-statement and then variable sets of (S, V) are caculated in step (2) of 2 nd while-statement.
It should be noted here that two modules for perorming modular reduction in designed divider are controlled by same common clock while conventional approaches [2] requires different two clcok signals for this modular reduction.
Implementation Results
In this paper, we use reduction polynomial f(x), f( for Efficient Cryptography) [11] using standard basis representation, where an irreducible binary polynomial of degree m=163 is used.
The proposed division algorithm was described using Verilog HDL at the Behavioral level, and it has been successfully implemented with Xilinx FPGA using the ISE 6.x. tool. To verify functionality of the designed divider, timing simulation is performed using Xilinx simulator and Mentor Graphics ModelSim Tm . (Fig. 7) shows the timing simulation result of the designed divider for GF (2 163 ) using Xilinx simulator.
To compare performance of conventional algorithm to the proposed algorithm, the set of input data used for In (Fig. 7) , a_x, b_x, g_x, and p_x represent A(x), B(x), G(x), and P(x), respectively, which have each 163
bits. Through simulation result, we can see that In <Table 7>, "Total delay" means overall delay time required to obtain final division result. In comparison of total delay time, compared to two conventional methods [2, 4] , the proposed method is approximately improved by 26% and 50% for GF (2 32 ), and by 24% and 44% for GF (2 64 ), respectively. This is because that the number of the used clocks is dramatically reduced compared to conventional two approaches. The designed divider operates at a clock frequency of 85 MHz on Xilinx-VirtexII XC2V8000 FPGA device. 
Conclusion
