This paper analyzes the time domain Reed Solomon Decoder with FPGA implementation. Data throughput and area is carefully evaluated compared with typical frequency domain Reed Solomon Decoder. In this analysis, three hardware architecture to enhance the data throughput, namely, the pipelined architecture, the parallel architecture, and the truncated arrays, is evaluated, too. The evaluation reveals that the number of the consumed resources of RS (255, 239) is about 20% smaller than those of the frequency domain decoder although data throughput is less than 10% of the frequency domain decoder. The number of the consumed resources of the pipelined architecture is 28% smaller than that of the parallel architecture when data throughput is same. It is because the pipeline architecture requires less extra logics than the parallel architecture. To get higher data throughput, the pipelined architecture is better than the parallel architecture from the viewpoint of consumed resources.
Introduction
Error control coding has become an essential means of ensuring data integrity in a variety of applications including satellite and mobile communications and the storage of data in magnetic and optical media. Among the various codes available for correcting multiple errors, the Reed-Solomon code is one of the most important codes. The notation RS(N, k) is commonly used to donate a Reed-Solomon code of block size N symbols with each block containing k information symbols. Each symbol is an n bit word, which is usually interpreted as an element in GF [2 n ]. This can correct for t errors, where t = (N − k)/2. The time domain algorithm has been developed by Blahut [1] by taking the inverse DFT of all the sequences and operators of the Berlekamp-Massey (BM) algorithm. This operates directly on the received data and generates the error sequence in the domain in which the data is received. Hence it eliminates the need for transform and inverse transformation operators such as syndrome computation and Chien search. As a result, the complete algorithm can be implemented by the repeated application of a single operation, which is of importance for hardware implementation. However, the main drawback of the time domain algorithm is its high computation count. This is brought about by the fact that time domain key equation solving algorithm has to operate on the complete data sequence of length N, while the frequency domain algorithm needs to work only the syndrome sequence of length N − k. The computational count can be reduced by modifying the algorithm or considering hardware architecture [2] .
It is known that time domain Reed Solomon decoder has lower area and longer computational time in general. But it is not well analyzed quantitatively in circuit level from practical point of view.
This paper analyzes the time domain Reed Solomon Decoder with FPGA implementation. Data throughput and area is carefully evaluated compared with typical frequency domain Reed Solomon Decoder. In this analysis, three hardware architecture to enhance the data throughput, namely, the pipelined architecture, parallel architecture, and the truncated array, is carefully evaluated in addition to the analysis of the normal time domain Reed Solomon decoder.
The contribution of this paper is as follows.
• We first analyzed the performance of three different styles of the architecture of time domain Reed Solomon Decoder shown in [2] in the circuit level.
• We first revealed quantitatively that the pipelined architecture of the time domain Reed Solomon Decoder is better than the parallel architecture from the viewpoint of area cost.
• We provided the data for design of practical time domain Reed Solomon Decoder.
The rest of the paper is organized as follows. Section 2 shows the related works. Section 3 explains the detail of the partially parallel time-domain Reed Solomon decoder. Section 4 gives the evaluation results. Finally Sect. 5 concludes the paper.
Related Works
Among codes employed to ensure data integrity, RS code is preferably desired for burst error correction whilst convolutional code is good for random error correction. The emerging of convolution code and LDPC code has led to the intensive interest in iterative decoding. The Koetter Vardy (KV), Copyright c 2017 The Institute of Electronics, Information and Communication Engineers a soft-input algebraic decoding [3] is quite well known. Inspired by the research published by Yedidia et al. [4] , Jiang et al. [5] has proposed a stochastic shifting based iterative decoding (SSID) scheme in decoding RS codes. They showed that Belief Propagation algorithm (BPA) can be used to decoded RS code. It should be noted that BPA and SPA [6] , [7] are commonly not practical for high density parity check code such as RS code in [5] . Regardless, the extensive computation caused by Gaussian elimination (GE), Jiang's work can outperform about 2 dB (FER, at 200 outer rounds and 50 SPA) compared to HDD (Hard decision decoding). However, their method diminishes as the codeword length becomes long, i.e. With such a regard, Bellorado et al. [8] has proposed an iterative RS decoder based on reduced-density, binary, parity check matrix. The computation complexity is reduced notably as the error rate is better. They said algorithm seems to be implementable in hardware. However, for the application which FER is not as that low and energy saving is more concerned, the tradition HDD is still reveal.
Architecture of higher performance Reed Solomon decoder has been researched for a long time, which is still one of the hot research topics. A syndrome-based RS decoder generally consists of three main blocks: a syndrome calculation (SC) block, a key equation solver (KES) block, and a Chien search and error evaluation (CSEE) block [9] . Dayal et al. [10] proposed the RS (255, 239) decoder for wireless network 802.16 introduced pipelining in Chien-Search component of decoder to improve the maximum frequency of Reed-Solomon codes. Jiang et al. [11] proposed a multigigabit Reed-Solomon (RS) convolutional codes (CC) decoder architecture for 60 GHz systems. They introduced reformulated inversionless Berlekamp-Massey (RIBM) algorithm via double-clock methods to improve the decoding speed in the RS decoder part. Lee et al. proposed a low-complexity, high-speed RS (255, 239) decoder architecture using Modified Euclidean (ME) algorithm for the highspeed optic communication systems [12] . These days, communication system is often implemented on FPGA for lower development costs and its re-configurability [13] . RS decoders for several communication systems have been implemented on FPGA [14] , [15] . However because FPGA is vulnerable to soft error, soft error tolerant system design is important for communication systems on FPGA [16] .
Time Domain Reed Solomon Decoder
This section explains the target time domain Reed Solomon decoder [2] . Either the Modified Euclidean (ME) algorithm or BM algorithm can be used to solve a key equation for an error locator polynomial and an error evaluator polynomial [9] . The target decoder uses BM algorithm. Section 3.1 explains the decoding algorithm briefly. Section 3.2 describes hardware architecture.
Decoding Algorithm
The target algorithm is Algorithm D proposed in [2] . The algorithm consists of 3 phases. In the first phase, the 9 parameters for this algorithm are initialized. After that, in the 2nd phase, the process to update the parameters following the Eqs. (1a) and (1b) is repeated 2t times. This algorithm does not require transform and inverse transformation operators such as syndrome computation and Chien search. As a result, the complete algorithm can be implemented by the repeated application of a single operation, which is of importance for hardware implementation. Finally, in the 3rd phase, error magnitude in each location i is calculated by the Eq. (1c). Algorithm D:
where λ i = 0 ( 1 .c)
Hardware Architecture
Here the hardware architecture of the time domain Reed Solomon decoder to reduce the computational time is explained. The processing module computes the arithmetic calculation formulated (1a), (1b), and (1c). The data are decoded by 2t times calculation with the processing modules. First the processing module, which is the basic unit for the time domain decoder, is described in Sect. 3.2.1. Choomchuay proposed three hardware architectures to make the computational time practical, namely pipelined architecture, parallel architecture, and truncated arrays [2] . Sections 3.2.2, 3.2.3, and 3.2.4 explain the architecture, respectively. Figure 1 depicts the block diagram of the processing module. The processing module calculates Δ r and updates the 6
Processing Module
. The left side is the input-side. The right side is the output-side. The module consists of Δ r computation unit for calculating Δ r and the units for updating the 6 parameters, matrix coeff generation, iteration decision, and vector modification unit. N bit shift register D(N) is connected to v i and each input line of the 6 parameters. The processing module calculates N elements of each parameter sequentially with shift operation of D(N). In addition to the N clock cycles, 1 clock cycle for initialization and 1 clock cycle for updating Δ r are needed. Therefore, N + 2 clock cycles are required for the calculation. The waveforms of the logic simulation of a processing module for RS (15, 9) is shown in Fig. 3 . Figure 2 shows the sequential decoding architecture using single Processing Module. The time domain architecture decodes the encoded codeword by repeating computation of (1.a) and (1.b) 2t times. The precomputed Δ r and the 8 parameters are applied to the Processing Module and the Processing Module generates the updated parameters. The updated parameters are looped back to the inputs of the Processing Module via MUX. This process is repeated 2t times. The value of Δ (r+1) is calculated simultanously. This calculation requires N clock cycles. Here, let PM(N) be the Processing Module including shift registers whose length is common N.
In this case, data throughput TH (bit/s) of RS (N, k) is expressed by the following formula.
where f is the cock frequency of the decoder, n is the bit width of a code word, k is the number of information symbols, and N is the block size.
Pipelined Architecture
As shown in Fig. 4 , fully pipelined architecture consists of 2t processing modules connected serially each other. Each processing module carries out each iteration from 1 to 2t of the algorithm. The Δ r comp unit of each stage calculates the value of Δ r+1 used in the next stage. In this case, data throughput TH (bit/s) of RS (N, k) increases 2t times compared with the sequential architecture. Accordingly, the data throughput is expressed by the following formula.
This architecture requires 2t processing modules. Therefore, the required hardware resources increase linearly as the iteration time increases. In this case, data throughput TH (bit/s) is expressed by the following formula.
The p parallel architecture requires p Processing Modules and p MUXs. It causes area overhead. The number of whole registers for storing the parameters is not changed because the number of parameter values is not changed even if the architecture is modified for parallel processing theoretically.
Truncated Arrays
The pipelined architecture can be truncated. An array consisting of only t processing modules can be used and the data recycled twice through the array as shown in Fig. 6 . Data throughput and hardware are halved in this case. It is possible to truncate the array further.
Evaluation
Performance [17] is used as the frequency domain Reed Solomon Decoder for this evaluation. Here, the two parameters for the evaluation, the ratio of data throughput R T H and the ratio of consumed resources R N , are defined as follows.
where TH T and TH F are data throughput of the evaluated time and frequency domain Reed Solomon decoder, respectively.
where N T and N F are the number of logic elements used for the syntheses of the time and frequency domain Reed Solomon decoder, respectively. R T H indicates the ratio of the data throughput of the frequency domain decoder against the time domain decoder.
Sequential Architecture
This section evaluates the basic sequential architecture. In this evaluation, the block size is fixed to 255. First, the data throughput and the number of consumed resources, is evaluated when the size of information symbols k = 249, 243, 239, 233, and 223. Data throughput is calculated from the Eq. (2). The number of consumed resources is obtained from the synthesis result. Table 1 shows the evaluation result. As size of information symbol increases, data throughput increases. The column CIII is the results of Cyclone III. The column CIV is the results of Cyclone IV. According to the result, data throughput of time domain decoder is lower than that of frequency domain decoder. The frequency domain decoder does require no iteration of calculation. On the other hand the time domain decoder requires 2t iterations. Therefore as the size of information symbols decreases the data throughput becomes lower. The value of R T H is 0.08 when k=243. It indicates that data throughput of time domain decoder is less than 10%. The results of Cyclone IV is twice of those of Cyclone III. It is because the clock frequency of Cyclone IV is twice of that of Cyclone III. Table 2 shows the evaluation result of the number of the consumed resources for the implementation. The rows, lelm, comb, reg, and mem, are the number of used logic elements, combinational functions, logic registers and memory bits, respectively. The number of the used logic elements of the frequency domain decoder increases as the size of information symbols increases. It indicates that larger amount of resources are required as t increases. On the other hand, the number of the logic elements of the time domain decoder is constant. It is because the time domain decoding uses the common processing modules even when t is changed. The value R N is larger than 1 when k=249, which means that the area of the time domain decoder requires larger resources than the frequency decoder.
In case of Cyclone III, R N is smaller than 1 when k is smaller than 243. It indicates that the area of processing module is lower than that of frequency domain decoder. In case of Cyclone III, the area of the time domain decoder is 65% of that of frequency decoder when k = 239. In case of Cyclone IV, the number of the used logic elements of the time domain decoder is 94% of that of frequency decoder when k = 239. Like this, R N depends on the implemented devices. However as a whole, the number of the used logic elements of the time domain decoder tends to be smaller than that of the frequency domain decoder according to the evaluaton result. The number of used memory bits of the synthesis result of Cyclone III is larger than that of Cyclone IV. On the other hand, the number of used logic registers of the synthesis result of Cyclone IV is larger than that of Cyclone III. It is because the shift registers of D(N) of Fig. 1 are constructed using memory bits in case of Cyclone III, and those are constructed using logic registers in case of Cyclone IV.
Pipelined Architecture
This section evaluates the number of the consumed resources for the fully pipelined architecture when the size of information symbols k = 249, 243, and 239. Theoretically, the data throughput of the typical single input channel Reed Solomon Decoder including Alteara Reed Solomon II IP Core is expressed by f nk/N (bit/s), where f is the clock frequency, n is the bit width of a code word, k is the number of information symbols, and N is the block size. On the other hand, the data throughput of the pipelined architecture of the time domain Reed Solomon Decoder is express by Eq. (3). When N is enough larger than 2, then we can approcimate that N + 2 N. Because 257 255, the data throughput of the time domain Reed Solomon Decoder can be approximated to be equal to that of the frequency domain Reed Solomon Decoder. Table 3 shows the result of the evaluation. The number of all the required resources increases as k decreases. When k=239, R N of Cyclone III and Cyclone IV is 5.2 and 9.7. It indicates that the area of the fully pipelined time domain decoder are 5.2 times of the frequency domain decoder in case of Cyclone III, and it is 9.7 times of the frequency domain decoder in case of Cyclone IV.
Parallel Architecture
This section evaluates the number of the consumed resources for the parallel architecture RS (255, 239) when the Table 4 shows the evaluation result of data throughput. The data throughput increases linearly because k is fixed to 239. Therefore, we can conclude that data throughput is increase by p times when p parallel architecture is applied. Table 5 shows the consumed resources. The data throughput of RS (255, 239) of p parallel architecture can be approximated to be equal to the pipelined architecture of RS (255, 239). The required resources of the p parallel architecture is larger than those of the pipelined architecture. It is because the p parallel architecture requires extra control logics. Those logics result in the increase of the consumed resources. The number of the consumed resources of the pipelined architecture is 35% and 22% smaller than those of the 16 parallel architecture when Cyclone III and Cyclone IV are targeted, respectively.
The Eq. (3), which is used for calculation of the data throughput of the fully pipelined architecture, can be approximated to f nk/N when N is enough larger than 2. On the other hand, Eq. (4) is transformed to f nk/(N + 1 + 1) when 2t is substituted to p. When N is enough larger than 2, the data throughput is approximated to f nk/N. Accordingly, the data throughput of the fully pipelined architecture and 2t parallel architecture can be approximated to be equal to that of the frequency domain decoder. Table 6 shows the consumed resources of the fully pipelined architecture and 2t parallel architecture of RS (255, 253), RS (255, 251), RS (255, 247), and RS (255, 239). The ratio of the number of the used resources for synthesis of the fully pipelined architecture against that of the 2t parallel architecture R N2 is defined as follows.
where N FP is the number of used resources for synthesis of the pipelined architecture and N P is that of the 2t parallel architecture. R N2 of lelm and comb is smaller than 1. They indicate the number of the logic elements, the combinational functions used for synthesis of the pipelined architecture is smaller than that of the 2t parallel architecture. It is because the p parallel architecture requires extra control logics. Those logics result in the increase of the consumed resources. On average, the number of used logic element of the fully pipelined architecture of Cyclone III and Cyclone IV are 28% and 18% smaller than those of the 2t parallel architecture, respectively. On the other hand, the number of the logic registers used for synthesis of the fully pipelined architecture is larger than that of the 2t parallel architecture. It indicates the number of the logic registers used for synthesis of the pipelined architecture is larger than that of the 2t parallel architecture. The number of the registers for D(N) is proportional to the number of the pipeline stages of the fully pipelined architecture. On the other hand the number of the registers for D(N) does not depend on p of the parallel architecture. The result is affected by the difference. The number of the memory bits is constant. Figure 7 plots t specification of R N2 of lelm, comb, reg, and mem. Although R N2 of the logic register increases as t increases, R N2 of logic elements decreases. Table 7 shows the result of the evaluation of the consumed resources of the truncated arrays of RS (255, 239). The number of the consumed resources is about the half of the fully pipelined architecture. Concretely, the number of the consumed resources of the truncated arrays of RS (255, 239) are 43% and 46% smaller than those of the fully pipelined architecture. However the data throughput is halved, too.
Truncated Arrays

Discussion
The merit of the time domain Reed Solomon Decoder comapred with typical frequency doman Reed Solomon Decoder is as follows.
Universality:
The time domain Reed Solomon Decoder is an universal Reed Solomon Decoder [1] . This is a good point. The universal decoder can decode the encoded codeword including arbitrary information symbols unlike the frequency domain Reed Solomon Decoder. Especially the sequential architecture shown in Fig. 2 can decode arbitrary encoded codeword with single Processing Modules. We can conclude that it is area efficient compared with typical frequency domain Reed Solomon Decoder. The evaluation result of the area efficiency of the sequential architecture of the time domain Reed Solomon decoder is shown in Table 2 . Simplicity of hardware architecture: Because the algorithm can be implemented by the repeated application of a single operation, it is easy to implement on hardware. Especially, the devices with regular structure such as FPGA and GPGPU [18] is suitable for the implementation.
Ease to apply techniques for dependable design:
Because hardware architecture consisits of simple controller and simple and regular Processing Modules, it is easy to apply fault tolerant techniques such as dual modular redundancy (DMR), or triple modular redundancy (TMR) [19] . Especially, it is good point for the implementation using SRAM-based FPGA, which is vulnerble to soft error occuring during normal operation [16] . Figure 8 depicts the pipelined architecture of the time domain Reed Solomon Decoder applied DMR and TMR. The above statements are added to page 4, right, 3rd line of the revised manuscript, too.
Conclusion
This paper has analyzed the time domain Reed Solomon Decoder with FPGA implementation. Data throughput and area is carefully evaluated compared with typical frequency domain Reed Solomon Decoder. In this analysis, three hardware architecture to enhance the data throughput, namely, the pipelined architecture, the parallel architecture, and the truncated arrays, have been evaluated, too. The evaluation reveals that the number of the consumed resources of RS (255, 239) is about 20% smaller than those of the frequency domain decoder although data throughput is less than 10% of the frequency domain decoder. The number of the consumed resources of the pipelined architecture is 28% smaller than that of the parallel architecture when data throughput is same. It is because the pipeline architecture requires less extra logics than the parallel architecture. To get higher data throughput, the pipelined architecture is better than the parallel architecture from the viewpoint of consumed resources.
