In this paper we present an efficient Turbo decoder implementation with minimal performance loss. Based on the non-optimal Max-log-MAP algorithm, we carefully manipulate the algorithm. through a-priori scaling to improve BER within only O.2SdB loss of EJNo of the optimal decoder. All computations are performed as integers avoiding the complexity and logic requirements of fixed and floating-point arithmetic. By combining integer computations with a design based on a single "sliding-window" decoder and a zero-delay interieaver, the final FPGA implementation achieves IMbps while consuming less than 600m W.
INTRODUCTION
Turbo coding has shown great promise with near channel capacity achieved with long block sizes and numerous decoding iterations [1] . However the enormous computational complexity required for the decoder has thus far limited the application of these codes. Nevertheless, several upcoming communication systems (e.g. IMT-200 and MILSTAR Advanced EHF) have chosen to utilize the coding gain of Turbo codes.
In order to truly utilize Turbo codes, one must find an efficient decoder implementation in terms of both BER and computational resources. This is particularly true in wireless systems as well as any portable application where size and power are key limitations. The goal of reduced size and power require simplification of the decoding algorithm as well as a careful and compact hardware implementation. This work is sponsored by the Deportment of the Air Force UDder contract FI9628-OO-C-0002. Opinions, interpretations. conclusions "'" those of the author and not oec:essariIy endorsed by the United States Government.
The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI:
The optimal Turbo decoding algorithm is Log-MAP [2] . Its complexity arises from a modification of the traditional Viterbi Add-Compare-Select (ACS) operation. In Log-Map, the Compare-Select has been replaced by max*(x,y) = In( eX + e Y ). Through some manipulation, the max*(x,y) function can be expressed as max*(x,y) = max(x,y) + In (1+e-1 X-Y I). From this alternate form of max *, two forms of simplification arise. The first form of simplification is to pre-compute the natural logarithm term and store the results in a look-up table [2] . However many tables are required for the design to be robust with a range of SNR. This is less complex than the straightforward Log-Map algorithm, but still too computationally intense for a low-power design. Along the same lines is the Linear-Log-Map algorithm [3] . Here the In(1 +e -I X -Y 1 ) is approximated by a linear function. Linear-Log-Map accounts for the fact that the entire term is approximately 0 when Ix-yl is greater than 3 or 4. This achieves near optimal performance and reduced complexity. In order to truly simplify the Turbo decoding algorithm, we come to the second class of simplifications to the max* function. Algorithms such as Max-Log-MAP [2] approximate the max*(x,y) with max(x,y). Although this results in approximately 0.5dB of performance loss, the computational requirements are greatly reduced. In addition, by slightly modifying the Max-Log-MAP algorithm, we are able to gain back a portion of the 0.5dB loss while still realizing a tremendous reduction in computational complexity.
Although many fixed and floating-point decoder implementations have been presented [3] [4] [5] [6], these implementations appear to be too complex. Although less complex than floating-point, fixed-point arithmetic still requires normalization. We constrain ourselves to 2's compliment integer representations for all calculations and metric storage. While we still must consider the dynamic range of our numbers, integer normalization does not require multiplication or division.
In addition, we have further simplified our decoding process by implementing a single decoding engine rather than a traditional two-decoder design. Although this will be further explored in Section 2.4, this simplification allows to the effectively reduce the hardware resources by a factor of two.
For reference, the specifications of the coding scheme presented are as follows: -Code Rate V2 -Constituent code (23,35) octal -OddlEven Puncturing -Block Size 640 with no trellis termination -S=18 S-Random Interleaver
IMPLEMENTATION SPECIFICS

Turbo Decoder Basics
Traditional Turbo decoders utilize two decoding engines separated by a set of interleavers and de-interleavers. Each decoder inputs a set of noisy codewords {Yk', ... , Yk'lR} (R is the code-rate) as well as an a-priori value Lak. After some processing interval, the decoder produces soft-decisions, LUk, and extrinsic information, Lek. The extrinsic information output from one decoder is passed through an interleaver or de-interleaver and becomes the a-priori input to the next decoder. As shown in Figure I , {Le }D1, the extrinsic output from decoder 1, is passed through an interleaver and becomes {La}D2 the a-priori input to decoder 2. Similarly, {Le}D2 is deinterleaved and becomes {La}o, for the next decoding iteration. This process of exchanging information is what produces the coding gain present in Turbo coding. Because each decoder has a different view of the data, it can offer insight to the other decoder not otherwise possible. This exchange of information continues for a number of iterations at which point the soft decisions are limited to form the final hard decision. More generally, this process is known as iterative decoding.
For a more complete theoretical explanation of the max-Log-MAP algorithm please refer to [2] or [9] .
Modified Max-Log-MAP
In our modified Max-Log-Map decoder, we carefully examine the computation of the branch metric. The equation is as follows:
In the first part of the equation, the incoming noisy codewords Yk are correlated against the j possible codewords, Xk, generated by the encoder. The second part of the equation represents the a-priori knowledge about this transition from all previous decoders. For the first decoder of the first iteration, the a-priori is set to 0, as we do not have any a-priori knowledge. Essentially we are combining past decoding results with the current decoding results to form a new result.
By adding a scaling constant in front of either term, we can control the strength of the current result against the past result. For example, if we were to multiply the codeword correlation by a number greater than 1, intuitively we are saying that the result from the previous decoders are less important than the current decoder's results. Similarly, if we multiply by a number less than one, we are stating that the previous decoder's result should be valued more than the current result. Hence we arrive at the following equation:
We have added a scaling constant R to the a-priori information. As verified in [7] , if SNR is constant and R is swept from 0 to 2, thus varying the importance of the a-priori information, a wide range of BER is observed. As Figure 2 shows, when 0.45 <= R <= 0.6 the BER is improved by an order of magnitude. By setting R = 0.5, we are able to gain back 0.35dB of the approximate 0.5dB loss incurred by using Max-Log-MAP instead of the optimal Log-MAP.
Although the absolutes reason for the BER improvement cannot be totally explained, [7] [9] suggest that R mitigates the propagation of errors from iteration to iteration. Normally a cluster of errors passed to the next decoder via extrinsic information dominates the branch metric calculations in the decoder. More generally, the codeword correlation is not strong enough to right an extreme error in the a-priori information serving to propagate errors.
Sliding-Window Decoding
As explained in [1] [2] [9] , Log-MAP decoding is broken into two main components, forward recursion and backward recursion. Forward recursion steps through the trellis from bits 0 to N-l producing node metrics representing the logarithmic probabilities that the encoder was in a given state. Backward recursion has two tasks. First it calculates node metrics like forward recursion, but steps its way through the trellis from bit N-l to bit O. Second, it combines the probability of arriving at a node via forward recursion with the probability of arriving at the next node via backward recursion into a complete path metric. The complete path metrics for a bit may be combined to form a soft decision of the transmitted bit.
Unfortunately [8] .
With sliding-window processing, the backward recursion unit is split into a backward training metric generator, and a backward soft-decision generator (shown in Figure 3 ).
Backward Recursion
Soh-Decision Recursion According to [8] , after 5 to 6 times the constraint length, backward metrics are considered reliable. This allows us to decode the code block in segments rather than as a whole. Conveniently, the sliding-window block was chosen to be 32 bits long for constraint length 5. This result is that the decoder only needs to store 64 bits worth of forward node metrics and branch metrics rather than 640 bits worth. This reduces the memory requirements by a factor of 10.
In sliding-window processing, the backward soft-decision recursion unit is exactly the same as the backward recursion unit in traditional Log-MAP processing. The backward training metric generator is a stripped down version of the backward soft-decision generator, as it only contains the nodemetric generation logic. Figure 4 shows a brief example of how sliding-window processing works. L is equal to 5 to 6 times the constraint length. As the forward recursion unit works from L to 2L, the backward training metric generator works from 3L back to 2L. When the two units meet, the backward training metric generator passes its node metrics to the soft-decision generator. The soft-decision generator then continues from 2L back to L producing soft-decisions while the backward training metric generator processes 4L down to 3L. This process is repeated until the entire decoder block is finished.
Single Decoding Engine
As shown in Figure 1 , the traditional Turbo decoder utilizes two decoding engines separated by interleavers and de-interleavers. Because most Turbo codes utilize some sort of random interleavers for performance reasons, we cannot make any assumption about when bits will be ready. For example, if block or convolutional interleaver were used, we might be able to say that after D 1 runs for N bits we can start D2. This is a nearly impossible assumption to make with random interleavers.
As a result, we simplify the decoder in Figure 1 to the decoder shown in Figure 5 . 
Zero-Delay Interleaving
Another aspect of the decoder design is its zero-delay interleaving. Turbo decoding requires interleavers on the inputs of D2 and a de-interleaver on the output of D2 in order to give D2 an alternate view of the data stream.
One way to accomplish this is to have separate interleaving and deinterleaving units that require some processing time to complete. With this approach, upon the completion of D1, its output would be interleaved and then fed into D2. Similarly when D2 was complete, its output would be deinterleaved and fed back into D 1. However this approach would consume at least N additional clock cycles for interleaving and an additional N clock cycles for de-interleaving. Because additional processing time relates to increased power consumption, it is desirable to limit the time required by these functions.
The time required to interleave and de-interleave can effectively be reduced to zero if this process is performed by address manipulation. This is illustrated in Figure 6 .
Memory
Decoder # Figure 6 . Address-based Interleaving
In the address manipulation approach, the base address generator simply counts from 0 to N-l, where N is the block size of the code. If we are in the non-permuted decoder, Dl, this count is simply delayed by a cycle and then used to read data out of input memory. However if we are in D2, the permuted decoder, the count is used to read an address from an Index ROM containing the interleaver sequence. The result from the ROM is then used to read from the input memory. As for de-interleaving, we simply pass the multiplexed address to the backwards soft-decision unit where the address is used to store the result. Figure 7 illustrates the two interleaving schemes. In the first illustration, we see the traditional approach with distinct interleaving and de-interleaving stages. Although the implementation may be efficient, it will still require some finite processing interval. However when we examine the second illustration representing interleaving by address manipulation, we do not see any distinct interleaving blocks. Because the interleaving I de-interleaving process has been combined with the normal process of retrieving data from memory, we are able to replace at least N cycles of interleaving delay by a clock cycle delay.
At first this modification may seem minor. However let us consider the case where interleaving and de-interleaving both require N clock cycles. If the decoder runs for I iterations, the traditional approach spends I*N cycles interleaving and I*N cycles de-interleaving.
Whereas the address manipulation scheme spends 2*1 cycles interleaving and zero cycles deinterleaving. For the decoder presented here with 5 iterations, this is a savings of approximately 6,390 clock cycles.
Metric Normalization
As shown in [2] and [3] , forwardlbackward node metrics accumulate as recursion progresses. Because finite precision is used, the node metrics can easily overflow or underflow. This is especially catastrophic with 2' s compliment integer representations where an overflow results in a negative number and an underflow results in a positive number.
The overflow/underflow problem can be solved by metric normalization. For each decoded bit in both forward and backward recursion, all node metrics are compared with a threshold T. If the absolute value of any node metric is greater than T, then all node metrics are shifted towards the center, as shown in Figure 8 . Note that the shifting is done for all node metrics so that soft-decision values [10] are not affected. The threshold value is chosen such that the update (the node metric plus a branch metric) does not cause overflow and the logic required for comparison is minimized. In our implementation, the thresholds are set to be -95 and 96. With normalization the node metrics range from -128 to 127, which can be represented by 8 bits.
RESULTS
The Turbo Decoder was designed as a Verilog RTL description and implemented on a XILINX XCV600E-7 FPGA. The design utilized approximately 2900 out of 15550 Logic Elements (LEs) and approximately 24 out of 72 RAM blocks. The interleaver ROM is included and there are no external storage or processing components required.
The design requires 1350 clock cycles per iteration with a maximum clock rate of approximately 25MHz. When run at its normal operating point of 7 decoding iterations, this equates to approximately 1.7 megabits per second. At 1Mbps, the decoder consumes less than 600mW Figure 9 shows the BER performance of our all-integer FPGA Turbo decoder. Our 8-level integer-based Turbo decoder even outperforms floating point Max-Log-MAP, thanks to the a-priori scaling factor. Compared to the optimal (floating-point Log-MAP), the performance loss is less than 0.25 dB at BER=lO· s . Interleaver, A-priori factor 0.5)
CONCLUSION
This paper presented the implementation of an integer Turbo decoder with minor performance loss. The efficient implementation comes from algorithm modification, integer arithmetic, and hardware management. Based on the Max-Log-MAP decoding algorithm. we modify the branch metric by weighting a-priori value, resulting in a significant BER improvement.
By performing all computations as integers, the complexity of fixed and floating-point arithmetic was avoided. By analyzing the data flow through the interleavers we were able to simplify the traditional two-decoder design to a single decoder realizing a 45% savings in area. Interleaving was accomplished through address manipulation eliminating latency and resulting in higher throughput and lower power consumption. Finally, a simple metric normalization scheme was implemented to eliminate the possibility of overflow or underflow.
In summary, FPGA design utilizes approximately 2900 XILINX Logic Cells and consumes approximately 600mW to achieve throughput of more than 1 Mbps. With 3 bits (8-level) input, our integer-based Turbo decoder outperforms floating-point Max-Log-MAP, and is only 0.25 dB away from the optimal floating-point Log-MAP.
5.
