Abstract-Reed-Solomon (RS) codes play an important role in providing the error correction and the data integrity in various communication/storage applications. For high-speed applications, most RS decoders are implemented as dedicated application-specified integrated circuits (ASICs) based on parallel architectures, which can deliver high data throughput rate. For lower-speed applications, the RS decoding operations are usually performed by using fine-grained processing elements (PE) controlled by a programmable digital signal processing (DSP) core, which provides high flexibility. In this paper, we propose a novel -PE multisymbol-sliced (MSS) RS datapath structure. The -PE RS architecture is a highly scalable design and can be dynamically reconfigured at 1-PE, 2-PE,. . .
. Spectrum of VLSI implementations in terms of flexibility, power efficient, and silicon cost.
since the Flash memory and micro hard drive are manufactured in nanometer scale, the structural/manufacturing faults become inevitable. Currently, the error control coding (ECC) [2] unit is a common scheme in DRAM and Flash memory systems to provide data integrity. However, the correcting capability is limited since most of them still use BCH codec as the forward error correcting (FEC) processing engine. As the manufacturing defects of memory/storage become inevitable and the data amount and throughput rate increases, the reliability of data transmission on the mobile devices has become more and more important. Hence, a better and robust FEC mechanism should be employed in the ECC unit of the storage system in future portable/hand-held devices. Reed-Solomon (RS) codec [3] [4] [5] is a widely adopted FEC technique that has an excellent error correction capability against burst errors. It has been adopted by many communication/storage systems, such as computer memory storage, magnetic and optical recording, wireless mobile and satellite communications. Hence, it becomes a good candidate in the design of ECC units.
In portable/hand-held devices, power efficiency always plays the key role to extend the standby/operating period. Currently, the hardware solutions can be categorized into three types: application-specified integrated circuits (ASICs), reconfigurable designs, and digital signal processing (DSP)-type solutions [12] . From Fig. 1 , we can see that they provide tradeoff among flexibility, power efficiency, and silicon area. In general, for low-speed applications (e.g., error-correcting for MP3/voice decoding), existing ASIC-type RS solutions [13] , [14] provide much more computing power than required. Hence, it can be considered as an over-design while using ASIC-type designs in low-speed applications. On the other hand, for high-speed applications (e.g., error-correcting for movie decoding), the DSP-type solutions [15] [16] [17] need much more cycles at very high operating frequency, which can lead to high-power consumption. That is, both designs play the two extreme roles in Fig. 1 , and they usually do not have runtime-controlled power-saving mechanisms due to the constraints of their VLSI architectures.
In this paper, we propose a novel -PE multi-symbol-sliced (MSS) RS datapath structure, which is classified as the reconfigurable Silicon IP (SIP) in Fig. 1 . The -PE RS architecture is a highly scalable design, and can be dynamically reconfigured at 1-PE, 2-PE, -PE, and -PE modes. That is, it can deliver different data throughput rate by activating the number of PEs based on the design flowchart of Fig. 1 . The applications of the proposed design are illustrated in Fig. 2 . We have one-PE, partial-PE, and full-PE modes to meet different required throughput rates of various target applications. Meanwhile, the idle PEs can be turned off through gated-clock schemes to save energy. Hence, the proposed runtime configurable ASIC design provides good tradeoff between the data throughput rate and the power dissipation. Therefore, the proposed MSS-RS design is a good candidate to extend the battery life of portable devices.
Finally, we demonstrate a prototyping VLSI design using four processing elements (4-PE MSS-RS data-path architecture) by using UMC 0.18-m CMOS standard-cell library. The design can be dynamically reconfigured to be operated at 1-PE, 2-PE, and 4-PE modes, with performance of 140 Mb/s at 18.91 mW, 280 Mb/s at 28.77 mW, and 560 Mb/s at 48.47 mW, respectively. Compared with existing RS designs, the proposed -PE RS decoder has better performance indices, such as normalized are efficiency (NAE) and normalized power efficiency (NPE), than most DSP-and ASIC-type RS designs.
The rest of this paper is organized as follows. In Section II, we investigate the syndrome-based RS decoding algorithm and design three different basic finite-field operators (FFO). In Section III, we deliver a new coarse-grained finite-field MSS-PE by using the FFOs. Moreover, we propose a dynamically reconfigurable RS decoding architecture. The detailed operations of the proposed MSS-RS decoder are discussed in Section IV. Section V shows the performance and comparisons. Finally, we conclude our works in Section VI.
II. PROPOSED MSS PROCESSING ELEMENT
The syndrome-based RS decoding process consists of syndrome calculation (SC), key equation solving (KES), and error correction (EC), as shown in Fig. 3 [4] - [6] . In this work, we employ the Modified Euclidean (ME) algorithm [18] to solve the key equation. The advantage of the ME-based RS decoder is its pipelinability for high-throughput decoding. In this section, we analyze the three major DSP modules of Fig. 3 . By the algorithmic analysis of each module, we derive three new finite-field operators (FFO) that correspond to the basic operation kernels of these three modules. The FFOs will be applied to our MSS Processing Element (PE) design in Section III.
A. Syndrome Calculation
The encoded RS codeword consists of finite-field symbols, , and denotes the polynomial representation of this codeword. We also define the received code polynomial and the error value polynomial as and , respectively. Then, . The syndrome calculation (SC) module receives the RS code from channel, and computes the syndrome values from the ( )-degree syndrome polynomial (1) where is the syndrome values and denotes the error correcting capability. Then, can be calculated by (2) where are the roots of the generation polynomial in RS encoder. Equation (2) can be represented in a recursive format as (3) Equation (3) is an iterative process and can be implemented with the hardware architecture of Fig. 4(a) . We can see that the basic operator of the syndrome calculation consists of one finite-field multiplier (FFM) and one finite-field adder (FFA). The basic finite-field operator (FFO) corresponding to the syndrome calculation module (we call it SC-FFO) is shown in Fig. 4(b) . Each iteration of the SC module takes SC-FFO operations.
B. Modified Euclidean Algorithm for Solving Key Equation
The key equation solving (KES) module is the most critical/ complicated part of the RS decoder. Assume that there are error symbols in the received code block with . We can define the error location polynomial as
If there is an error at the coefficient of the received code polynomial , we have . We can also define the error magnitude polynomial as
The error location polynomial and the error magnitude polynomial can be obtained by solving the key equation below (6) In this paper, we employ the modified Euclidean (ME) algorithm [18] to solve the key equation. The ME algorithm can be explained by following iteration equations. 1) Initial conditions:
2) Updating equations in each iteration:
where if if (10) 3) Stop condition:
4) Output assignments: (12) where " " and " " denote the leading coefficients of and , respectively. Equations (8) and (9) Since (8) and (9) are calculated concurrently, we use four FFMs and two FFAs to construct the basic finite-field operator for computing the ME algorithm (we call it ME-FFO) as shown in Fig. 5 (c). Each iteration of the ME algorithm takes ME-FFO operations.
C. Error Correction (EC)
The error correction (EC) module finds the locations of errors by checking if , for . This is called Chien's search, which evaluates where (13) Since the maximum number of errors equals to the error correcting capability , we can replace with and set for in (13) . It can be represented in an iterative form as (14) where . Then, the values of can be calculated sequentially by the following iterative operations. 1) Initial conditions:
2) Iterative operations: (16) where . To calculate the error value, the Forney algorithm is used. If there is error at the coefficient in , the error value at can be calculated by (17) The denominator sums half of the , which are odd parts of the coefficients of . This is part of the computation in the Chien's search. In considering the hardware sharing, we do not need additional hardware to evaluate this denominator. The numerator evaluates the error magnitude polynomial by setting . Similarly, we can evaluate the error location polynomial by (18) where . The values of can be calculated sequentially by the following iterative operations. 1) Initial conditions: 2) Iterative operations:
where . Finally, with both Chien's search and the Forney algorithm, the EC module can be implemented using the hardware architecture of Fig. 6 (a). The basic operator consists of one FFM and one FFA except for the finite-field divisor (FFD). The basic FFO corresponding to the EC module (we call it EC-FFO) is shown in Fig. 6 (b). Each iteration of EC takes EC-FFO operations.
III. DYNAMICALLY RECONFIGURABLE REED-SOLOMON DECODER BASED ON PROPOSED MSS DATA-PATH ARCHITECTURE

A. Unified Finite-Field PE Design
By investigating the three major modules of the RS decoding procedure-SC, ME, and EC-we can modify each of them into an iterative form. Then, using the three basic operators (SC-FFO, ME-FFO, and EC-FFO), we can construct the system architecture of a parallel RS decoder as shown in Fig. 7(a) . All three stages use iterative computation. In each iteration, the RS decoder reads in one symbol for syndrome calculation and outputs one symbol by error correction.
Since the number of SC-FFOs, the number of ME-FFOs, and the number of EC-FFOs in Fig. 7(a) are all , the parallel RS decoder can be divided into horizontal hardware slices. By applying the bit-sliced datapath concept [24] , we can define a symbol-sliced datapath. Each slice consists of one SC-FFO, one ME-FFO, and one EC-FFO. Fig. 7(b) shows the unified processing element (PE) design and the PE can simultaneously perform the SC, ME, and EC operations. We can use PEs to construct the parallel RS decoding architecture in Fig. 7(a) .
B. Dynamically Reconfigurable RS Decoder Design Based on Unified PE
Based on the newly derived Unified-PE, we propose a scalable design methodology to construct a dynamically reconfigurable RS decoder. Fig. 8 shows the system block diagram of a generalized reconfigurable RS decoder. It consists of the -PE array, the data register bank, the datapath units, and the controller. The detailed operations are given below.
1) -PE array:
This array consists of MSS-PEs which can perform SC operations, ME operations, and EC operations simultaneously. Based on requirement of the system data throughput rate, we can decide the number of PEs within the chip. With more PEs running in parallel, the decoding throughput can be higher. 2) Data register bank: The data register bank includes register blocks based on the error correcting capability . Each register block consists of six symbols of data register: one for the SC operation, four for the ME operation, and one for the EC operation. 3) Datapath units: The datapath units read the data of those selected register blocks as the input signals of the working PEs, and write back to those selected register blocks with the output signals of those working PEs. 4) Controller: The controller is in charge of the address generator and the coefficient generator for controlling the datapath access and generating the multipliers' coefficients.
C. Direct-Mapping Data Path RS Architecture
Assume that the number of processing elements is . A direct-mapping reconfigurable architecture of RS decoder design is shown in Fig. 9 . For each PE, this design uses -to-1 multiplexers to select one register block from the whole register blocks to be the inputs of the PE. For each register block, this design uses 1-to-demultiplexers to select one signal block from the outputs of PEs to refresh the data of the register block. The total delay on data-path control is 2-to-1 multiplexers delay. The total data-path control circuit takes ( m) 2-to-1 symbol-wise multiplexers. While in single-PE mode, all register blocks are processed by sequentially. uses its own -to-1 symbol-wise multiplexers to choose the data register block to process, and output back to the same register block through the 1-to-output data-path symbol-wise demultiplexers. While in 2-PE mode, register blocks are processed by sequentially, and the other register blocks are processed by , sequentially. Other cases are similar to the single-PE and 2-PE modes. This is a very straightforward solution to reconfigure the PE array. However, the loading of all PEs is not balanced, and the datapath design is complicated. This will cause the higher hardware cost and slow down the processing speed.
D. Proposed Multisymbol-Sliced (MSS) Data-Path Architecture
The direct-mapping architecture always uses the same PEs to process all data. Hence, the PEs need to choose one register block as the input from all register blocks at all the time. As the result, each PE needs -to-1 multiplexers at its input and each register needs 1-to-demultiplexers to choose one output signal block from all the processing elements. To simplify the datapath design, we propose the multisymbol sliced (MSS) datapath architecture. By splitting all the data register blocks into groups, and each group consists of data register blocks, we can modify the direct-mapping architecture to a more efficient format-the MSS data-path RS architecture. Fig. 10 shows the proposed dynamically reconfigurable MSS-RS decoder design. The total register blocks are sliced into multisymbol groups. For each data register group, we assign one PE to be responsible for all the operations performed on the data register blocks in this group. While in single-PE mode , each time only one PE is in operation, which is similar to the direct-mapping architecture. However, we do not reuse the same PE to process all data. We make every PE operate sequentially and each PE needs to operate times, for processing its own data register blocks. While in 2-PE mode , each time only two PEs are active and every PE still needs to operate times to process its own data register blocks. Other cases are similar to these two operating modes. The difference between these two operating modes is the operating timing for the PEs   TABLE I  COMPARISON BETWEEN THE DIRECT-MAPPING AND THE PROPOSED MSS  DATA-PATH ARCHITECTURE processing data, which will be discussed in detail in the next section. For a given value, each PE always operates times to processing its own data register blocks. As the result, each PE only needs -to-1 multiplexers to select one register block from register blocks as its input signals. The output signals will go through a 1-todemultiplexer to one of the PE's data registers. The total delay on data-path control is 2-to-1 multiplexers delay. The total datapath control circuit takes symbol-wise multiplexers. Compared with the direct-mapping architecture, the proposed MSS data-path architecture can save 2-to-1 multiplexers critical path delay, and 2-to-1 symbol-wise multiplexers hardware area cost, as shown in Table I .
There are two extreme cases of the MSS data-path architecture. When , each data group consists of only one register block. The datapath multiplexers will be needless and the hardware architecture becomes a fully expanding parallel architecture. Another special case happens when . The MSS data-path architecture with only one PE will be the same as the single-PE data-multiplexing architecture.
IV. OPERATIONS OF THE DYNAMICALLY RECONFIGURABLE REED-SOLOMON DECODER
The proposed MSS data-path structure can be dynamically reconfigured to operate at different modes for various RS specification requirement. In this section, we show the detailed operations of the proposed dynamically reconfigurable RS decoding architecture.
A. Dynamically Reconfigurable Property for -PE Mode
Assume the embedded PEs are , and , and the register blocks are , and . For any value, every PE just operates times for processing its own data register blocks. For -PE runtime mode, it takes steps to complete one iteration and each step takes clock cycles. At the first step, spends clock cycles to complete the processes of , and . Meanwhile, spends clock cycles to complete the processes of , and at the first step, also. and also spend clock cycles to complete the processes of their own data register blocks at the first step. Similarly, at the second step, , and spend clock cycles to complete the processes of their own data register blocks. The steps continue till the th step are performed. Finally, at the th step, , and spend clock cycles to complete the processes of their own data register blocks at the -th step. Hence, register blocks are always processed by PEs in every clock cycle and register blocks are processed during each step. Therefore, it takes clock cycles to finish processing the total register blocks in one iteration. After each iteration, the RS decoder delivers one symbol of output data. Thus, the data throughput rate of -PE mode is symbols per clock cycle.
B. Operations of Dynamically Reconfigurable RS Decoder Using Four PEs
To demonstrate our design, we use and in the following discussion. With four PEs, our design can be dynamically reconfigured to operate at 1-PE (single-PE), 2-PE, and 4-PE (full-run) modes. In the full-run mode, the RS decoder can achieve the best data throughput rate. For lower speed system requirement, we can use 2-PE or 1-PE modes, to deliver the necessary date throughput rate.
1) Full-Run (4-PE) Mode:
Assume the four embedded PEs are , and , and the register blocks are , and . While in full-run (4-PE) mode, each time all four PEs are active. Every PE still needs to operate 2 4 times for processing its own data register blocks. As shown in Fig. 11 , to complete one iteration with throughput of one symbol/cycle, the decoder needs to spend 2 4 clock cycles for processing all the data register blocks. The data throughput rate in full-run mode is symbol per clock, which is the best rate that a 4-PE decoder can deliver.
2) Half-Run (2-PE) Mode: While in half-run (2-PE) mode, each time two PEs are active. For the 2-PE mode, there are only two PEs operating at the same time, and it takes steps to complete one iteration to throughput one symbol. As well as the full-run mode, every PE still needs to operate four times for processing its own four data register blocks. As shown in Fig. 12 , to complete one iteration with throughput of one symbol/cycle, the decoder needs to spend eight clock cycles for processing all data register blocks. The data throughput rate in half-run mode is symbol per clock. 3) Single-PE Mode: While in signal-PE mode, each time there is only one PE which is active, and it takes steps to complete one iteration to throughput one symbol. As well as the full-run mode, every PE still needs to operate 2t/m = 4 times for processing its own data register blocks. As shown in Fig. 13 , to complete one iteration with throughput of one symbol/cycle, the decoder needs to spend 16 clock cycles for processing all data register blocks. The data throughput rate in 1-PE mode is symbol per clock, which is the lowest rate can deliver.
V. PERFORMANCE AND COMPARISON
A. Decision of -PE MSS Data-Path Architecture 1) Theoretical Performance Analysis of -PE design in Different Run-Time Modes:
Based on the data-path controlling method we have discussed, we can analyze the theoretical performance of the -PE design in different run-time modes in Table II . The data throughput rate of the proposed dynamically reconfigurable RS decoder is directly proportional to the run-time mode . The latency time is inversely proportional to . The hardware utilization rate, which is related to the dynamic power consumption, is also proportional to .
2) Synthesized Data Throughput Rate Analysis: Based on synthesis results by using UMC 0.18-m CMOS standard-cell library, we can obtain the critical path delay of the proposed MSS data-path architecture. It consists of register delay, datapath delay and PE delay. Fig. 14 shows the data throughput rates that the -PE RS decoder can achieve by varying numbers of PEs and error correcting capability. According to Fig. 14, we can decide a suitable number of PE for RS specification of error correcting capability.
B. Prototyping RS Decoder Implementation With 4-PE MSS Data-Path Architecture
We implement a dynamically reconfigurable RS decoder with four PEs for . By using UMC 0.18-m CMOS standardcell library, this prototyping RS decoder design only requires 24 000 gate counts and can achieve 560 Mb/s in full-run mode. The layout and specification is summarized in Fig. 15 .
1) Dynamic Power Analysis:
The 4-PE RS decoder prototyping design can be dynamically reconfigured to the low-power single-PE mode if system needs a data throughput rate less than 140 Mb/s. For higher symbol rate systems, we can dynamically reconfigure the 4-PE RS decoder to 2-PE mode or 4-PE mode, and pay more power to deliver a higher data throughput rate to achieve the system requirement. Fig. 16 shows the data throughput rates and the dynamic power consumption of the 4-PE RS decoder prototyping architecture design in 1-PE, 2-PE, and 4-PE operating modes. We can see that the dynamic power consumption is a linear function of the data throughput rate and the number of used PEs. Furthermore, by using the gatedclocking scheme, we can enhance the dynamic power saving by 28%-56% in the different PE modes.
2) Applications of the Prototyping Design: For Digital Video Broadcasting-Terrestrial (DVB-T), Digital Versatile Disc (DVD) -10X and Compact Flash (CF) Card -100X systems, the required data throughput rates are lower than 140 Mb/s (see table of Fig. 17) . In these cases, we can configure the prototyping RS decoder to 1-PE operating mode to achieve low power consumption. On the other hand, the data throughput rates for Memory Stick (MS) Card and Extreme Digital (XD) Card systems are 160 and 256 Mb/s, respectively. Hence, the 2-PE mode (280 Mb/s) is enough to support MS/XD card rate. For Fast Flash Disk (FFD) system, it needs 360 Mb/s data throughput rate. We can dynamically reconfigure the RS decoder to 4-PE operating mode (560 Mb/s) to support the high-performance requirement. Fig. 17 summarizes the data throughput rates of the 4-PE RS decoder in 1-PE, 2-PE, and 4-PE operating mode, and requirements of some systems using RS code with . 
3) Performance Comparison:
The NPE value describes how much data throughput rate one VLSI design can deliver based on unit power consumption. From Table III , we can see that the NPE of our design is also higher than those of other existing programmable or reconfigurable solutions in [19] , [20] . Note that [20] has better NAE value than our design but its NPE value is only 10% of our design. On the other hand, the ASIC-type the pipelined recursive RS architecture [22] , [23] is very area-efficient with high data throughput rate (6160 Mb/s). Hence, it has a very high NAE value. However, this solution is a not reconfigurable design (also, we cannot obtain the NPE value for fair comparison). The high-speed design may be an over-design for low data rate applications, such as voice/MP3 decoding in power-saving domain. On the contrary, our design can be run-time reconfigured to less PE modes to deliver the necessary (no more, no less) throughput rate at high power efficiency and area efficiency.
VI. CONCLUSION
In this paper, we have developed a dynamically reconfigurable RS decoding architecture based on the multisymbol-sliced processing element. Our proposed -PE dynamically reconfigurable RS decoder has good flexibility to trade off between the data throughput rate and the dynamic power consumption. For that reason, the proposed architecture is very suitable for the ECC unit of memory system in the portable devices. Moreover, we implement a 4-PE prototyping RS decoder to demonstrate the effectively of our MSS-RS VLSI architecture. 
