Reed Solomon (RS) codes have been widely used in a variety of communication systems. Continual demand for ever higher data rates makes it necessary to devise very high-speed implementations of RS decoders. In this paper, a uniform comparison was drawn for various algorithms and architectures proposed in the literature, which helped in selecting the appropriate architecture for the intended application. Dual-line architecture of modified Berlekamp Massey algorithm was chosen for the final design. Using PcCMOS12corelib the area of the design is ¼ ¾¾ÑÑ ¾ and a throughput of 1.6 Gbps. The design dissipates only 17mW of power in the worst case, including memory, when operating at 1.0 Gbps data rate.
Introduction to Reed Solomon
Reed Solomon codes are perhaps the most commonly used in all forms of transmission and data storage for forward error correction (FEC). The basic idea of FEC is to add redundancy at the end of the messages systematically so as to enable the retrieval of messages correctly despite errors in the received sequences. This eliminates the need of retransmission of messages over a noisy channel. RS codes are a subset of Bose-Chaudhuri-Hocquenghem (BCH) codes and are linear block codes. [1] is one of the best references for RS Codes.
An ÊË´Ò µ code implies that the encoder takes in symbols and adds Ò parity symbols to make it a Ò symbol code word. Each symbol is at least of Ñ bits, where ¾ Ñ Ò . Conversely, the longest length of code word for a given bit-size Ñ, is ¾ Ñ ½. For example, ÊË´¾ ¾¿ µ code takes in 239 symbols and adds 16 parity symbols to make 255 symbols overall of 8 bits each. When a code word is received at the receiver, it is often not the same as the one transmitted, since noise in the channel introduces errors in the system. Let us say if Ö´Üµ is the received code word, we have
where ´Üµ is the original codeword and ´Üµ is the error introduced in the system. The aim of the decoder is to find the vector ´Üµ and then subtract it from Ö´Üµ to recover original code word transmitted. It should be added that there are two aspects of decoding -error detection and error correction. As mentioned before, the error can only be corrected if there are fewer than or equal to Ø errors. However, the Reed Solomon algorithm still allows one to detect if there are more than Ø errors. In such cases, the code word is declared as uncorrectable.
The basic decoder structure is shown in Figure 2 . A detailed explanation on Reed Solomon decoders can be found in [1] and [2] . Decoder essentially consists of four modules. The first module computes the syndrome polynomial from the received sequence. This is used to solve a key equation in the second block, which generates two polynomials for determining the location and value of these errors in the received code word. The next block of Chien search uses the Error Locator Polynomial obtained from the second block to compute the error location, while the fourth block employs Forney algorithm to determine the value of error occurred. The correction block merely adds the values obtained from the output of the Forney block and the FIFO block. Please note that in Galois arithmetic, addition and subtraction are equivalent.
Channel Model
Before we proceed to the actual decoder implementation, it is important to look at the channel model itself. Since UWB (Ultra Wide Band) is not very well explored yet, it is important to analyse how the channel would behave at the frequency and the data rate under consideration. One of the most common models used for modelling transmission over land mobile channels is the Gilbert-Elliott model. In this model a channel can be either in a good state or a bad state depending on the signal-to-noise ratio (SNR) at the receiver. For different states, the probability of error is different. In [3] , Ahlin presented a way to match the parameters of the GE model to the land mobile channel, an approach that was generalized in [4] Figure 3 shows the GE Channel Model. Two states are shown represented by and indicating the good and the bad state respectively. Further, the transition probability from the good state to the bad state is shown as and from the bad to the good state as . The probability for error in state and is denoted by È´ µ and È´ µ respectively. A detailed analysis can be found in [5] and [6] .
Simulation
Following were the parameters set for the simulation of the Ultra Wide Band channel: carrier frequency = 4.0 GHz information rate = 480 Mbps Two sets of simulation were run for different threshold reading. The threshold here signifies the SNR level at which the channel changes states. The first set was with the threshold set to 5dB lower than the average SNR and the other with 10dB less than the average. Due to the very high data bit rate involved the transition probability is very small. Therefore, channel transition become very rare events, and simulations determined the error probabilities for codewords beginning in a certain state. These were then weighted by the steady state probability of the corresponding state and added together to obtain the overall probability rate. Two measures, the bit error rate and the Figure 4: The symbol error rate and bit error rates for different thresholds.
symbol error rate are computed and plotted. The simulation was run for 10,000 codewords to get a good estimate for each state. Mathematica software was used to solve the complex mathematical equations and obtain the channel model parameters for the physical quantities under consideration.
Simulation Results
As can be seen from the Figure 4 , the error probabilities decrease with increase in SNR as expected. The figure shows the symbol and the bit error probabilities observed. As expected the error rates follow a linear relationship with the increasing SNR on the logarithmic scale. We notice that around 20dB average SNR for both the thresholds, the symbol error rate is about 0.02, which corresponds to an average of 5 symbol errors in a code word of 255 symbols. From the results, an error correction capability of 8 is seen as a good choice, as when the SNR is above 20dB, the likelihood of more than 8 errors in a codeword of 255 is very low.
Architecture Design Options
Having decided on the codeword, investigation was carried out to determine appropriate algorithm and architecture. Figure 5 shows the various architectures available. 
Design Decisions
In order to choose a good architecture for the application, various things have to be taken into account.
Gate count: Determines the silicon area to be used for development. A one time production cost but can be critical if it is too high.
Latency: Latency is defined as the delay between the received code word and the corresponding decoded code word. The lower the latency, the smaller is the FIFO buffer size required and therefore, it also determines the silicon area to a large extent.
Critical path delay: It determines the minimum clock period, i.e. maximum frequency that the system can be operated at. Table 1 shows a summary of all the above mentioned parameters. For our intended UWB application, speed is of prime concern as it has to be able to support data rates as high as 480 Mbps, and perhaps even 1 Gbps in the near future. At the same time, power has to be kept low, as it is to be used in portable devices as well. This implies that the active hardware at any time should be kept low. Also, the overall latency and gate count of computational elements should be low since that would determine the total silicon area of the design.
Key Equation Solver
Reformulated inversion-less and dual line implementation of the modified Berlekamp Massey have the smallest critical path delay among all the alternatives of the Key Equation Solver. When comparing inversion-less and dual-line implementation, dual line is a good compromise in latency and computational elements needed. The latency is one of the lowest and it has the least critical path delay of all the architectures summarized. Thus, dual-line implementation of the BM algorithm was chosen for the key-equation solver. Another benefit of this architecture is that the design is very regular and hence easy to implement.
RS Code
As we can see from Table 1 , the hardware requirement for the entire block is a function of Ø, the error correction capability, and the latency is a function of both Ò and Ø. Thus, while we want to have a code with high error correction capability, we can not have a very high value of Ø as the hardware needed is proportional to it. The value of Ò determines the bit-width of the symbol and therefore the hardware needed, but only logarithmically. However, one would want to have a value of Ò ¾ Ñ ½ , to derive maximum benefit out of the hardware. The value of Ø is often chosen to be a power of 2 in order to maximise the hardware utilised in design. Taking into account the results of Channel Model Simulation ÊË´¾ ¾¿ µ is chosen, since it has an error correction capability of 8. 
Highlights

Design Flow
The first step was to develop a C-model for the decoder. 'Gcc' compiler was used to compile the code and to check if the code worked correctly. Output of each intermediate stage was compared with the expected output according to the algorithm with the aid of an example.
Once the algorithm was fully developed and tested in C, VHDL-code was developed. The VHDL code was structured such so it could be easily synthesized. A wrapper class was written around it, in order to test it. This VHDL code was compiled and tested using Cadence tools. 'Ncsim' was used to simulate the system and generate the output stream for the same input tests as were used for testing C code. The output stream from VHDL and C were then compared.
When this output was found to be matched for various input test cases, synthesis experiments were started. Ambit from Cadence was used to analyse the hardware usage and frequency of operation after various optimisation settings.
The design flow needed for verification of synthesized design and power estimation has been explained in Figure 6 . As shown in the figure the core VHDL modules were optimised and synthesized using ambit. The synthesized model was written out into a verilog netlist using ambit itself. Once the netlist was obtained, it was compiled using ncvlog into the work library together with the technology library. The library used was for the same technology as the one used for synthesis. As can be seen, the wrapper modules were actually written in VHDL, while the compiled core was from the verilog. Thus, to allow interaction between the two, the top interface of the work library, was extracted into a VHDL file and then compiled into the work library. This was done using ncshell and ncvhdl respectively. This being done, the wrapper modules were compiled into the work library.
From this point onwards, two approaches were used. Ncelab and ncsim were used purely for simulating the synthesized design, and dncelab and dncsim were used to obtain power estimate, which were essentially the same tools, but included the DIESEL routines for estimating the power dissipated in the design. Diesel is an internal tool developed within Philips which estimates the power for the simulated design, and hence the accuracy of the results depends on the input provided.
Results
This section covers the results of various synthesis experiments conducted. Resource utilization, timing analysis and the power consumption were used as benchmarking parameters. 
Area Analysis
Ambit was run with the libraries PcCMOS12corelib and PcCMOS18corelib. The silicon area required was analysed for various timing constraints. A comparison for area of the decoder is shown in Table 3 . This table shows the area requirement when the constraint was set to 5 ns, which can support 200 MHz frequency, i.e. 1.6 Gbps. The total number of design cells used, including the memory, were 12,768 and 12,613 for PcCMOS18corelib and PcCMOS12corelib respectively. 
Power Analysis
The power estimates provided in this section are for design operation at 125 MHz, which translates to data rate of 1Gbps. Figure 7 shows the variation of power with the number of errors found in the codeword for PcCMOS12corelib. As can be seen from the graph obtained, the power dissipated for the FIFO and syndrome computation block is independent of the number of errors as expected. For the block that computes the ELP (Error Locator Polynomial) and EEP (Error Evaluator Polynomial), it is clearly seen that the power dissipated increases linearly with the number of errors. The Chien search block also shows a linear increase in the power dissipated. The behaviour of Forney evaluator is a bit different from the other modules. We see that the power dissipated for the codeword with an even number of errors is not significantly larger to the one with the previous number of errors. The reason lies in the fact that the degree of EEP for codeword with one error is often the same as the one with two errors, and so on and so forth. However, as a general rule, there is still an increase in the power dissipation, because of some computation that is done for each error found. Figure 8 shows a distribution of power when there are maximum number of errors correctable in the received code word, while Figure 9 shows the distribution when the code word is received intact. As can be seen, in the case of no errors, bulk of the power is consumed in computing syndromes, apart from the memory. In the event of maximum errors detected, the Forney block consumes the maximum power. 
Variation With Number of Errors
Distribution of Power in Different Modules
Benchmarking
Please note that for all the designs ÊË´¾ ¾¿ µ code has been used for benchmarking. The design using modified Euclidean Algorithm is very hardware intensive. The design proposed in [7] uses roughly 115K gates for ¼ ½¿ Ñ CMOS techology operating at 6 Gbps excluding memory. The proposed design only uses 12K cells including memory in both ¼ ½¾ Ñ and ¼ ½ Ñ technology. The results are better even when they are normalised for throughput and technology. The latency of the design is only 284 cycles when compared to 355 cycles in [7] .
In terms of power, a design was proposed by Chang in [9] for low power. In that design, 62mW of power is used in the best case, including memory, using ¼ ¾ Ñ CMOS technology, and 100mW are consumed in the worst case. In our design, only 17mW of power is used in the worst case using ¼ ½¾ Ñ technology. The area of the chip proposed in [9] using ¼ ¾ Ñ CMOS technology is ÑÑ ¾ , while the area of the proposed design is ¼ ¾¾ÑÑ ¾ with ¼ ½¾ Ñ technology.
Conclusions
A uniform comparison was drawn for various algorithms that have been proposed in literature. This helped in selecting the appropriate architecture for the intended application. Modified Berlekamp Massey algorithm was chosen for the VHDL implementation. Dual line architecture was used, which is as fast as serial and has low latency as that of a parallel approach.
