Abstract
Introduction
This work demonstrates the implementation of two decoders applying the Soft Output Viterbi Algorithm (SOVA) [1] . These decoders can be employed as softinput-soft-output (SISO) decoders for a Turbo coded system. Figure. 1 shows an example that comprises a serial concatenation of an 8-state (11,13) convolutional encoder, with an enhanced partial response class-4 (EPR4) channel.
In order to achieve desired throughputs that are in line with current trends in magnetic recording systems, a fully unrolled and pipelined architecture [2] is needed. This results in a linear complexity increase, and will limit the number of iterations in practical systems to about three or four. Each SOVA decoder outputs 7-bit sign-magnitude values that consist of one decoded bit (hard output) and a 6-bit unsigned loglikelihood ratio (soft output). The soft output is based on the difference in path metric between the two mostlikely (ML) paths, α and β, that trace back to complementary bit decisions, x and x , as shown in Figure 2 . Both decoders have the same architecture, but are matched to different generator polynomials. The inner and outer decoders are named SOVA_EPR4 and SOVA_11_13 respectively to indicate the particular type of convolutional codes that each is used to decode.
The arithmetic computation and system architecture of the SOVA decoder will be discussed in Section 2. Section 3 highlights the micro-architectural analysis of the add-compare-select (ACS) structures, which are the traditionally speed-bottleneck of Viterbi decoder designs. Continuing the emphasis on high throughput rates, section 4 describes the use of deeply pipelined mechanisms for the traceback, equivalence detecting, and comparison of competing path metrics. Finally, Section 5 discusses the design flow as well as testing methodologies and results.
SOVA Decoder Architecture
The SOVA decoder outputs the log-likelihood of a correctly decoded bit. This value is given by the difference between the path metrics of the two mostlikely (ML) paths that trace back to complementary bit decisions, x and x . Figure 2 shows that the ML path, α, is determined using the Viterbi algorithm with an L-step traceback. This is followed by another M-step traceback that resolves the next ML path, β, based on maximal probability of a deviation from α.
It is assumed that the absolute values of the path metrics, M α and M β , dominate over that of other paths, such that the probability of selecting β over α (i.e. the wrong decision) is given by (1) . The log-likelihood of a correct output by the SOVA decoder is given in (2). 
The system architecture of this implementation is shown in Figure 3 . The branch metric generator, compare-select-add (CSA), and L-step survivor memory unit (SMU) form the building blocks of a conventional Viterbi decoder. The CSA is a retimed and transformed version of the more common addcompare-select (ACS) structure, and provides higher throughput rate at a lower area cost than traditional loop unrolling methods [5] [6] [8] . There are eight copies of the CSA in accordance to the targeted eightstate convolutional codes.
In addition to providing a path decision at each iteration, the CSA of the SOVA decoder is also required to output the difference in path metric between the two ML paths. The path decisions and the metric differences are cached into an array of L-step FIFO buffers. The delayed signals are used in the M-step path-equivalence detector (PED) to determine the similarity between each pair of competing decisions obtained through a j-step traceback,
. Finally, the output decisions from the SMU are used to select the delayed metric difference and the equivalence outputs corresponding to the most-likely state. These signals are input to a reliability measure unit (RMU), which outputs the minimum path metric difference reflecting complementary bit decisions, x and x .
Add-Compare-Select Structures
The throughputs of SOVA decoders have traditionally been limited by the implementation of the add-compare-select (ACS) structure due to a singlestep recursion that prevents pipelining.
Previous high throughput implementations of the Viterbi decoder, [5] [6] [8] , resorted to unrolling of the ACS loop in order to achieve high throughputs. These methods increase the critical path delay, but improve the overall throughput. However, in a soft-output Viterbi decoder implementation, additional overhead includes the modifications to the register-exchange and reliability measure units in order to handle the doubled symbol rate.
These considerations render loopunrolling an unsuitable technique.
In this design, the critical path constraint was eased through a process of retiming and transformation of the ACS recursions. The sequence of operations in a single stage of the pipeline is reordered as compareselect-add (CSA). This is followed by a transformation [3] [4] that moves the add operation ahead of the select operation, such that the compare and add operations are executed in parallel. This modification decreases the critical path delay at the cost of doubling the number of adders and multiplexers.
The adders in the transformed CSAs have flat input data-arrival profiles, and permit datapath synthesis to achieve the fastest adder implementation. The critical delay of the resulting structure (Figure 4 ) is reduced by 42% at the expense of 22% increase in overall area of the decoder. 
SMU, PED and RMU structures
The two ML paths are determined by a dualtraceback function, achieved by cascading the SMU with a combination of the PED and RMU.
The SMU is implemented with a high-speed pipelined register exchange, shown in Figure 5 . It outputs a decision relating the most likely state î(n) after a latency of L-cycles. This method avoids the complexity of designing specialized SRAM blocks with complex read/write control mechanisms, which was required by previous implementations, [5] , [7] . It has been shown in studies of the Viterbi decoder that SRAM-based traceback has a costly throughput overhead due to the need to access multiple memory pointers [8] . The larger memory blocks required for storage of the 7-bit soft metrics, as opposed to singlebit decisions in the Viterbi decoder, resulted in further delays for memory access. In addition, the retiming and transformation of the ACS unit, traditionally the speed bottleneck of Viterbi decoder designs, has improved the critical delay by 42%. This imposes a more stringent throughput requirement on the traceback operation.
The PED is a modified register exchange ( Figure 6 ) that provides Boolean outputs, EQ i,j (n) for j = 1, 2, .., M, indicating the equivalence between the two competing decisions obtained through a j-step traceback from state i.
From CSA i , the difference between the two path metrics, ∆ i (n), arriving at time n, state i, is retained;
. The output from the SMU selects ∆ î (n) and EQ î,j (n), which correspond to the values along the ML path, as inputs to the RMU.
The RMU consists of comparators and multiplexers in a pipeline that selects the minimum ∆(n) along the ML path. It is initialized with the maximum possible reliability measure represented as "∞" in Figure 3 . Based on the EQ input, each pipelined section outputs one of the following: EQ = 1: Reliability measure from the previous step EQ = 0: Min{∆ î , previous reliability measure} Compared with a Viterbi decoder implementation, the total size of the SMUs is approximately doubled (L = M). The RMU overhead consists of M copies of 1 register, 2 multiplexers and a 2-input comparator. The latency through the SOVA decoder is L + M. Both decoder implementations use L = M = 15, which is five times the constraint length of the convolutional code. The additional latency remains insignificant compared to the overall latency in the Turbo-SOVA system, which is dominated by the latency through the interleavers.
Design Flow and Testing
The design of the SOVA decoders uses an automated design flow for direct-mapping of signal processing algorithms into integrated circuits [8] . This automated flow was further enhanced for high-speed design through customization of the clock tree to achieve low clock skews.
The functionality of the chip has been verified with 1.8V supply at 25°C.
Throughput rates above 500 Mb/s were achieved and power dissipation was 400mW. The speed characterization was performed using a clock tree with a built-in delay line. The speed and power performance for one of the SOVA decoders is plotted in Figure 9 . The power measurements were performed at the highest frequencies permitted by the supply voltage. Table 1 summarizes the characteristics of the decoders.
Conclusion
The design of a 500MHz soft-output Viterbi decoder has been described. It can be employed as a soft-input-soft-output (SISO) decoder for Turbo code systems. Architectural transformation of the addcompare-select structures and modification of the register exchange allow a high throughput with small area overhead.
In addition to magnetic recording applications, the SOVA decoder is also appropriate for Turbo-coded forward error correction applications in wireless, wireline, and optical communication systems. Vdd (V) Figure 9 . Performance of SOVA_EPR4 decoder.
Acknowledgements

