This paper proposes a new systolic array architecture to perform division operations over GF(2 m ) based on the modified Stein's algorithm. The systolic structure is extracted by applying a regular approach to the division algorithm. This approach starts by obtaining the dependency graph for the intended algorithm and assigning a time value to each node in the dependency graph using a scheduling function and ends by projecting several nodes of the dependency graph to a processing element to constitute the systolic array. The obtained design structure has the advantage of reducing the number of flip-flops required to store the intermediate variables of the algorithm and hence reduces the total gate counts to a large extent compared to the other related designs. The analytical results show that the proposed design outperforms the related designs in terms of area (at least 32% reduction in area) and speed (at least 60% reduction in the total computation time) and has the lowest AT complexity that ranges from 80% to 94%.
Introduction
Polynomial division over finite fields is extensively used in several applications such as data coding, error detection, digital communication systems, cryptography, and signal processing. Among the arithmetic operations in finite fields, division has been considered as the most complicated and expensive finite-field operation. Therefore, there are many algorithms are presented in the literature to perform field division with their high performance realization in both software and hardware. These algorithms can be computed based on several schemes such as Fermat's little theorem [1, 2] , Extended Euclid's algorithm (EEA) [3, 4, 5, 6, 7] , and Extended Stein's algorithm (ESA) [8, 9, 10, 11, 12] .
There are two hardware techniques used to implement the filed division algorithms. The first technique is the conventional technique that is based on Lookup tables and is efficient for VLSI implementation of field division algorithms over GF(2 m ) for small field size m [3, 6] . When m gets larger, we can not easily use this technique in VLSI implementations due to the increasing overhead cost of area. The second technique is the systolic array technique that is considered the efficient hardware technique used to implement division algorithms over GF(2 m ) for large values of field size m [12, 13] . This is due to the distinguished features of the systolic architectures such as regularity, modularity, concurrency, and local communication between processing elements that makes them more suited to high performance VLSI system design.
In this paper, we propose new systolic array architecture to perform modular division over GFð2 m Þ based on the ESA scheme [12] . The architecture consists of one-dimensional systolic array of processing elements (PEs) and has area and AT complexities of OðmÞ and Oðm 2 Þ, respectively. The latency of the proposed design is 2m À 1 that makes it outperforms the previously reported designs in terms of speed.
The paper is organized as follows: Section 2 provides a brief discussion about the adopted division algorithm over GFð2 m Þ [12] . Section 3 explains the proposed divider systolic structure. Section 4 shows the complexities of the proposed design and compares it with the previously reported designs. Section 5 concludes this work.
2 Division over GFð2 m Þ Let GðxÞ ¼ P m i¼0 g i x i be the irreducible polynomial used to define a finite field over GFð2 m Þ, where g i 2 f0; 1g, for 0 < i < m, and g m ¼ g 0 ¼ 1. We can represent any field element Q 2 GFð2 m Þ in the the polynomial form as QðxÞ ¼ P mÀ1 i¼0 q i x i , where q i 2 f0; 1g, for 0 i < m. Also, we can represent the reciprocal polynomial Q Ã ðxÞ of QðxÞ as Q Ã ðxÞ ¼ P mÀ1 i¼0 q i x mÀiÀ1 , where q i 2 f0; 1g, for 0 i < m. In [12] , the authors proposed the following division algorithm, Algorithm 1, to obtain VðxÞ ¼ AðxÞ=BðxÞ mod GðxÞ. In this algorithm, Δ represents the upper bounds on degðRÞ and degðSÞ of the intermediate polynomials R and S, where degðÞ indicates the polynomial degree.
Algorithm 1 Finite field division algorithm [12] . 1: Input: AðxÞ, BðxÞ, GðxÞ, and Δ 2: Output: VðxÞ ¼ ðAðxÞ=BðxÞÞ mod GðxÞ 3: Initialization: R 0 ðxÞ BðxÞ, S 0 ðxÞ GðxÞ, U 0 ðxÞ AðxÞ, V 0 ðxÞ 0; PðxÞ GðxÞ, Á 0 À1 4: Algorithm: For large values of field size m, the counter used to compute the variable Δ will dominate the Critical Path Delay (CPD) and reduces the divider speed to a large extent. To reduce the effect of increasing CPD, Δ is replaced by two variables h and D such that [12] : h denotes the sign of
This representation of Δ replaces the counter operation with a shift for the value of D and a bit inversion for the sign h as shown in Algorithm 2.
In Algorithm 2, variables R i ðxÞ, S i ðxÞ, U i ðxÞ, and V i ðxÞ represent the computed values of polynomials RðxÞ, SðxÞ, UðxÞ, and VðxÞ, respectively, after iteration i. We can express the recurrence equations for updating R i ðxÞ, U i ðxÞ, S i ðxÞ, and V i ðxÞ in bit-level form as:
Algorithm 2 Modified Finite field division algorithm [12] .
1: Input: AðxÞ, BðxÞ, GðxÞ, h, D 2: Output: VðxÞ ¼ ðAðxÞ=BðxÞÞ mod GðxÞ 3: Initialization: R 0 ðxÞ BðxÞ, S 0 ðxÞ GðxÞ, U 0 ðxÞ AðxÞ, V 0 ðxÞ 0, PðxÞ GðxÞ, h 0 1, D 0 2 4: Algorithm:
S i ðxÞ R iÀ1 ðxÞ 10: 
end if 36: end if 37: end for where, Converting variable D i to bit-level from will increase the complexity of hardware design, thus we preferred to keep it in the word-level form. It can be updated along with h i using the following recurrence equations:
where,
The subscript j used in the previous equations refers to the j th coefficient bit of polynomials RðxÞ, UðxÞ, SðxÞ, VðxÞ, and PðxÞ, where 0 j m. u iÀ1 0 , r iÀ1 0 , v iÀ1 0 , and d iÀ1 0 indicate the Least Significant Bits (LSBs) of UðxÞ, RðxÞ, VðxÞ, and D, respectively, at iteration i.
Proposed systolic array design of the division algorithm
The proposed systolic array design is obtained by using an approach previously explained by the second author in [14] . This approach starts by obtaining the dependency graph (DG) for the intended algorithm and assigning a time value to each node in the DG using a scheduling function as explained by authors in [14, 15, 16, 17, 18, 20] . The approach ends by projecting several nodes of the DG to a processing element (PE) to constitute the systolic array [14, 17, 19, 20, 21, 22] .
We can extract the DG from Equations (1)-(10) for m ¼ 5 as shown in Fig. 1 . The equations has two indexes ði; jÞ, thus the computation domain is the convex hull in the two-dimensional (2D) space and the circles in this domain define the operations. Inputs of the DG are s 0 j , r 0 j , u 0 j , v 0 j , p 0 j , D 0 , and h 0 . The nodes in right most column (gray nodes) compute control signals c3 i , c4 i and c5 i . These signals are broadcasted to all nodes in the same row i. The red lines (slanted lines) denote the updated bits u i jÀ1 and r i jÀ1 . p iÀ1 j bit is assigned to all nodes in each column j. The resulted output bits at the top of DG are the last output bits of the variable V, v 2mÀ1 j .
The systolic array architecture can be extracted by choosing the scheduling vector S ¼ ½1 0 and the projection vector P ¼ ½0 1 T for the DG. These vectors are obtained from applying the approach previously reported by authors in [14, 17, 20, 21, 22] . Fig. 2 shows the resulted systolic array after applying the suggested mapping vectors on the DG. It consists of m þ 1 PEs. PE 0 produces control signals c3 i , c4 i , and c5 i through each i iteration. Also, it updates variables D i , h i , and the LSB of V i , v i 0 . The bits s i 0 and r i 0 are not processed in PE 0 because s i 0 is always equal to '1' and r i 0 is shifted out [9] . PE j updates r i jÀ1 , u i jÀ1 , s i j , v i j based on Equations (1)-(4). PE m is a simplified form of PE j and computes u i Fig. 3(a) to set the shift direction (right/left). The shifter hardware details are shown in Fig. 3(b) .
The operation of the systolic array can be summarized as follows: 1. At time n ¼ 1, the external control signal c2 is set to '1' to pass the initial inputs to the D-FFs and registers located inside the PEs. 2. At time n > 1, control signals c3 i , c4 i , and c5 i are produced inside PE 0 and transmitted to the remaining PEs. Also, c2 control signal is set to '0' to latch the updated values of the intermediate variables. 3. At the last time step n ¼ 2m À 1, the systolic array produces the output bits of V, v 0 and v j , from PE 0 and PE j , 1 j m À 1, respectively. Table I shows a comparison between the proposed systolic design and the related work [5, 10, 12] in terms of area requirements and delay. In this table, T A , T X , and T MUX denote the delay of 2-input AND gate, 2-input XOR gate, and 2-input Multiplexer, respectively. The total gate count is estimated in terms of 2-input NAND gate based on the NanGate (15 nm, 0.8 V) Open Cell Library that is based on the NCSU FreePDK15 process kit. The area and delay information of the standard cells is given in Table II . We notice from Table I that the area requirements and latency of the proposed design are lower than that of the compared systolic designs. The systolic arrays presented in [5, 10] consist of 2m and 2m À 1 PEs, respectively. Besides the large number of PEs used in each systolic array, each PE consists mainly of 2 parts: datapath part and control part. The control part is repeated in each PE and this increases the area complexity of each PE and hence increases the total area complexity of the systolic arrays. Also, more flip-flops (FFs) are added inside the PEs to reduce the critical path delay and this leads to increasing the area complexity and the latency of these designs. The systolic array presented in [12] consists of m þ 1 PEs like the proposed systolic array, but it employed a pipeline interleaving technique to improve the utilization of the PEs. This resulted in increasing the number of FFs used in the systolic array to a large extent and hence increasing the latency of the resulted systolic array to be 5m À 2. Due to this reason and the longer critical path delay of this design, the area and time complexities of this design exceeds the proposed one. The total gate count (area) of the proposed design is lower than the other compared designs because of the reduced number of FFs required to store the intermediate variables. The reduced number of FFs leads to reducing the latency of the proposed design to be 2m À 1. The lower critical path delay besides the reduced latency of the proposed design lead to reducing the time complexity of the proposed design to be less than the compared designs.
Performance analysis and comparison
To quantify the analytical results obtained in Table I, Table III shows the estimated values of total gate counts (A), latency (L), critical path delay (CPD) or clock period, total computation time (T), and AT complexity for m ¼ 233. The AT complexity is defined as the area (A) multiplied by the total computation time (T). The sum of critical path delay, setup delay (T setup ) and propagation delay (T Clk!Q ) of the D-FF represents the clock period. From Table III , we notice that the proposed design outperforms other compared designs in terms of area (at least 32% reduction in area) and speed (at least 60% reduction in total computation time) and has the lowest AT complexity that ranges form 80% to 94%.
Conclusion and future work
The details of a new systolic array architecture of the extended stein's division algorithm over GFð2 m Þ are explained in this article. We used a previously reported systematic approach to extract the proposed design. The approach starts by obtaining the dependency graph for the intended algorithm and assigning a time value to each node in the DG using a scheduling function and ends by projecting several nodes of the DG to a processing element to constitute the systolic array. The obtained design structure has a reduced number of flip-flops required to store the intermediate variables of the algorithm and hence a significant reduction in the total gate counts compared to the related designs. The analytical results showed that the proposed design outperforms the previously reported related designs in terms of area and speed. As a future work, the proposed design and the other related designs will be described using VHDL programming language and will be implemented on FPAG and ASIC to obtain an accurate real implementation results. Due to the lower area and delay achieved by the proposed design, we expect that the total power consumption and energy of the proposed design will be lower than that of the previously reported related designs. This makes the proposed design more suitable for resource-constrained applications that have more restrictions on area and power. 
