Abstract-In this paper, the architecture for Fast Fourier Transform over Galois Field (2 4 ) is described. The method used is cyclotomic decomposition. The Cyclotomic Fast Fourier Transforms (CFFTs) are preferred due to low multiplicative complexity. The approach used is the decomposition of the arbitrary polynomial into a sum of linearized polynomials. Also, Common Subexpression Elimination (CSE) algorithm is used to reduce the additive complexity of the architecture. By using CSE algorithm, the design with reduced operational complexity has been described.
INTRODUCTION
Fourier analysis is useful in converting a signal from its original domain into frequency domain and vice versa. The Fast Fourier Transform (FFT) algorithm designed for the complex field is not well-suited for the finite field. The FFT in the complex field has applications throughout the subject of signal processing [3] . Whereas, the FFT over the finite field have been widely used in cryptography and have applications in error correcting codes. The method for Fast Fourier Transform over the finite field (i.e. Galois Field) [2] [3] along with the architecture has been suggested in the paper. Galois Field is a field that contains a number of finite elements.
The suggested method consists of decomposing an original polynomial into a sum of linearized polynomials and evaluating them at a set of basis points [2] . The architecture designed in this paper is for GF (2 4 ). The Cyclotomic Fast Fourier Transform (CFFT) is useful in RS i.e. Reed-Solomon decoders to reduce the complexity of the decoder [4] . Because Reed Solomon code is cyclic in nature [15] . The CFFT proposed in [2] has low multiplicative complexity but they have high additive complexities. The FFT suggested in this paper can be used to perform the RS decoding which involves two time-consuming steps (Syndrome computation and Chien search). Chien search is a fast algorithm used in determining roots of polynomials defined over a finite field. The RS codes are capable of correcting random errors and multiple burst errors. This architecture can also be used to implement the Gao algorithm [14] which includes operations based on Fourier transform.
The design of architecture follows several steps which have been explained in a simplified manner in this paper. The architecture is designed in 4 stages. The architecture so designed is modified by applying Common Subexpression Elimination (CSE) Algorithm. CSE algorithm reduces the additive complexity of the architecture. The language used for design is Verilog and has been implemented in the Xilinx ISE Design Suite.
The paper proceeds as follows. Section II covers basic notions and definitions of the Fourier transform and the method to determine cyclotomic cosets, along with the basic theory of Galois Field. Section III focuses on linearized polynomials and generation of the matrix. Section IV describes hardware architecture. The Common Sub-expression elimination Algorithm has been explained in section V. Section VI illustrates the architecture after applying CSE. And the paper concludes with the comparison between two architectures in section VII.
II. DEFINITIONS
The Fourier transform of a polynomial is the collection of elements.
The Fourier transform can be generated using [2] [10]:
is of degree f(x) = n-1 and n | (2 m -1).
2.2. The elements can be estimated through:
Here, j Є |0, n-1|. 
2.4. A linearized polynomial over GF (2 m ) is a polynomial represented as:
One useful representation of elements in Galois Field is m-tuple representation. Let α 0 + α 1 α + α 2 α +....+ α m-1 α m-1 be the polynomial representation of a field element β. Then, β can be represented by an ordered sequence of m components called an m-tuple, as follows [7] :
Where, the m components are simply the m coefficients of the polynomial representation of β.
For GF (2 4 ) the 4-tuple representation generated by p(X) = X 4 +X+1 is: Table 1 The elements of GF (2 m ) forms all the roots of X 2^m + X. Let Ø(X) be the polynomial of the smallest degree over GF (2 m ). This polynomial Ø(X) is called the minimal polynomial of β. Ø(X) must be irreducible. Minimal polynomials of the elements in GF (2 4 ) generated by p(X) = X 4 +X+1 are: III.
CYCLOTOMIC FAST FOURIER TRANSFORM
Based on the formula mentioned in (3) in section II, the cyclotomic cosets so formed for GF (2 4 ) after substituting m=4 i.e. n=15 are as follows:
C 0 = {0} C 1 = C 2 = C 4 = C 8 = {1,2,4,8} C 3 = C 6 = C 9 = C 12 = {3,6,9,12} C 5 = C 10 = {5,10} C 7 = C 11 = C 13 = C 14 = {7,11,13,14}
An irreducible polynomial p(X) of degree m is said to be primitive if the smallest positive integer n for which p(X) divides X n +1 is n = 2 m -1. p(X) = X 4 +X+1 divides X 15 +1 but does not divide any X n +1 for 1 ≤ n ≤ 15. Hence, X 4 +X+1 is a primitive polynomial for GF (2 4 ). Let α be the root of this polynomial. f(α i ) can be developed using:
These coefficients α ijs are used to form the matrix A. For example, in GF(2 4 
Here, the basis for C 1 , C 3 , C 7 is (β, β 2 , β 4 , β 8 ), where β = α 3 and α is an element of GF (2 4 The rest of the equations are developed as follows: This four point cyclic convolution [6] is obtained for cosets C 1 , C 3 , C 7. Whereas, the same interpretation for coset C 5 gives a two-point cyclic convolution.
The complete architecture can be computed as [2] :
Where, Q is the binary block diagonal matrix, C is the combined vector of constants, and P is the binary block diagonal matrix of combined pre-additions.
IV. HARDWARE ARCHITECTURE
The architecture design of FFT starts with the designing of GF multiplier. The GF multiplier is used for multiplication of polynomials [7] . The GF multiplier design can be represented as - The above design requires 16 AND gates and 8 XOR gates.
The complete FFT architecture [3] can be represented as follows: 
V. COMMON SUB-EXPRESSION ELIMINATION ALGORITHM
In many Digital Signal Processing applications, multiple constant multiplication is widely used. In VLSI design for high-level synthesis, proper optimization of multiple constant multiplication is effective in improving parameters like area and power consumption. To optimize multiple constant multiplications, the Common Sub-Expression Elimination (CSE) algorithm is used in this paper [5] . The approach used in CSE is initially to identify the identical terms i.e. the common subexpressions present in the equations and then to replace them with a single variable. Thus, by computing the terms only once, results are significantly being reduced in the hardware architecture in VLSI design.
In Galois field, matrix multiplication is performed. Here, the addition is performed via XOR-ing but there are several methods to perform multiplication. In this paper, the multiplication is a linear transform of the form C = AB, where C and B are m-and n-dimensional column vectors, respectively, and A is an m x n constant binary matrix. Here, B represents input variable and C represents output variables. According to this paper, the B column vector is (C . (Pf) ) and the matrix A is the resultant matrix of AQ (Refer (6)). So, the CSE algorithm is applied to this matrix -vector multiplication.
Some general steps are involved in carrying out the CSE algorithm. These steps are as follows [5] So, by applying CSE algorithm to matrix-vector multiplication there is a significant reduction in a number of XOR gates.
The method suggested in [8] reduces the additive complexities of Cyclotomic Fast Fourier Transform using a weighted sum of the numbers of multiplications and additions. [12] focuses on both area and delay optimization in hardware implementations over GF (2 m ).
VI. REDUCED FFT ARCHITECTURE
To reduce the additive complexity of the FFT architecture, in this paper, the CSE algorithm mentioned in section V has been used. The basic stages of architecture mentioned in section IV remain the same, with the only difference in being applying the CSE algorithm. The CSE algorithm is applied to stage 4. After applying the CSE algorithm to the matrix, the matrix size increases from 15 x 31 to 47 x 63. Due to this, the number of LUTs eventually increases in the final architecture but the additive complexity i.e. the number of XOR gates are reduced significantly.
We have written a synthesizable Verilog code for the different stages of the architecture. First 3 stages of the architecture remains the same, whereas, the architecture design changes at stage 4. Based on the appropriate changes the two architectures i.e. without CSE and with CSE can be compared in terms of LUTs, the number of XOR gates required and delay. I.J. Image, Graphics and Signal Processing, 2016, 10, 35-41
VII. RESULTS
The complexity of the proposed architecture has been evaluated considering the synthesis report generated in Xilinx ISE Design Suite. The synthesis report has been generated for two FPGA devices, namely-Spartan 6 and Virtex 5. The results of the GF multiplier used in the architecture are-for Spartan 6 the LUTs required are 9 whereas for Virtex 5 the LUTs required are 8. Number of IOBs remains the same for both FPGA kits. The maximum combinational path delay is the maximum delay that would occur for the complete architecture. For cosets C 1 , C 3 , C 7 since the design is the same the implementation results so generated are same. But, for coset C 5 the results are different. For cosets C 1 , C 3 , C 7 te results are: For cosets C 1 , C 3 , C 7 the XOR gates required for Spartan 6 are 37 and for Virtex 5 are 52. The stage 3 of the architecture is formed by the multiplication of matrix A with Q. The simulation results are: In the stage 3 of architecture, since only matrix multiplication is involved, no XOR gates are required in this stage.
Finally, the results for the architecture without applying CSE are: The paper mainly focuses on reducing the additive complexity of the FFT architecture. Therefore, the number of XOR gates required before modifying the architecture are 121 and 170 for Spartan 6 and Virtex 5 respectively.
The results after modifying the architecture with CSE are: After applying CSE, the number of XOR gates are reduced to 54 and 73 for Spartan 6 and Virtex 5 respectively.
VIII. CONCLUSION
It is clearly evident from the above-mentioned results that the additive complexity of the architecture after applying CSE reduces by a considerable amount. The area of the architecture is also reduced. The number of XOR gates required before applying CSE is 121 and 170 for Spartan 6 and Virtex 5 respectively. Whereas, after applying CSE XOR gates reduces to 54 and 73. Thus, the additive complexity of the FFT architecture is reduced.
Graphically, the comparison between two architectures can be plotted as:
