Abstract-At present, the main challenge for hardware implementation of turbo decoders is to achieve the high data rates required by current and future communication system standards. In order to address this challenge, a low complexity radix-16 SISO decoder for the Max-Log-MAP algorithm is proposed in this paper. Based on the elimination of parallel paths in the radix-16 trellis diagram, architectural solutions to reduce the hardware complexity of the different blocks of a SISO decoder are detailed. Moreover, two complementary techniques are introduced in order to overcome BER/FER performance degradation when turbo decoders based on the proposed SISO decoder are considered. Thus, a penalty lower than 0.05dB is observed for a 8 state binary turbo code with respect to a traditional radix-2 turbo decoder for 6 decoding iterations.
I. INTRODUCTION
The adoption of turbo codes [1] in last generation wireless communication systems, such as 3GPP-LTE and WiMAX, has prompted research activities towards the design of high throughput Turbo Decoder (TD) architectures. In next generation standards, throughput requirements are significantly higher. For instance, LTE-advance standard targets data rates around 1Gbit/s. Thus, the design of low complexity high throughput TD architectures becomes a major challenge. The decoding of turbo codes is carried out through an iterative process where two Soft-Input Soft-Output (SISO) decoders, operating in natural and interleaved domain, exchange extrinsic information. The inherent recursive nature of MAP-based algorithms (Log-MAP or Max-Log-MAP [2] ), implemented by each SISO decoder in the TD, is an important issue that limits the TD throughput. To overcome this constraint, radix-2 N T architectures have been proposed [3] - [7] . In these architectures, NT transitions in the trellis diagram are performed per clock cycle.
Most of the works in the literature that propose radix-2
architectures are concerned to high throughput, and only a few explore the hardware complexity reduction. A radix-16 ACS unit to decompose the compare operation among 16 branches into two levels is presented in [6] . The hardware overhead is minimized thanks to a two-dimensional ACS unit architecture. In [7] , a radix-16 modified log-MAP algorithm and a two-stage compare and select architecture to further decrease the latency and processing power is detailed. The main novelty of our work is the design of a radix-16 SISO architecture based on a radix-8 ACS unit for the Max-Log-MAP algorithm. Since the ACS unit is responsible of the bottleneck in the decoder, a reduction of the SISO critical path with respect to the architectures in [6] and [7] is possible. Thus, the SISO decoder throughput can be improved. Furthermore, a hardware complexity reduction of all the block units that form the SISO decoder is proposed. A clever idea is also introduced to avoid TD BER/FER performance degradation. The rest of the paper is organized as follows. In section II the proposed radix-16 SISO decoder architecture is presented. Each block unit of the designed SISO decoder is detailed. Then, a comparison in terms of BER/FER performance of different versions of the turbo decoder is given in section III. Synthesis results are summarized in section IV. Finally, section V concludes the paper. The 8-state binary turbo code compliant with the LTE standard has been selected to illustrate our propositions. Let S = {s (0) , . . . , s (7) } be the set of convolutional encoder states, and m ∈ {0, 1} the value of the flip-flops that form the convolutional encoder. Consider Fig. 2 that depicts the state transition s (i) → s (j) in the trellis diagram of a radix-16 ACS unit (NT = 4). For this radix value, two paths (parallel paths) are possible from any state at time k to any state at time k + 4. Let these two paths be due to the systematic bit sequence d path b. As pointed out in [8] , parallel paths should be eliminated prior to the ACS unit. Thus, it is possible to replace a radix-16 ACS unit by a less complex radix-8 ACS unit where each pair of states (s (i) , s (j) ), at time k and k + 4 respectively, are connected only by one path 1 . The branch metric value for this path is then γ
i,j ). This simplification does not affect the result of the ACS unit operation. Furthermore, since a radix-8 ACS unit has a lower critical path compared to a radix-16 ACS unit, it enables to increase the clock frequency of the corresponding decoder. The systematic and redundant bits that define the parallel paths in the transition s (i) → s (j) are given by:
A. Branch Metrics Unit Architecture
From (1), for any pair of parallel paths, d
and r
because they are independent of b. Therefore, the channel values corresponding to the bits d k+1 and r k+2 do not affect the choice of the maximum branch metric in the radix-16 ACS unit transition
can be then computed as follows. First, a partial term ξi,j, due to the bits that are different in both paths, for b = 0 is calculated. If ξi,j < 0 then γ (1) i,j is the maximum branch metric and i,j = −ξi,j. Otherwise, the maximum branch metric is γ (0) i,j and i,j = ξi,j. Afterwards, γ max i,j is obtained by adding to i,j the terms corresponding to the bits d k+1 and r k+2 . The proposed BMU architecture is given in Fig. 3 . Note that only 16 values ξi,j have to be computed for all the 64 state transitions
This BMU only costs 53% of the hardware resources of a conventional implementation. A low complexity adder-sharing BMU is proposed in [7] . It requires a memory to store 128 branch metrics. With our BMU architecture this memory is reduced to the half (64 γ max i,j values). Moreover, the BMU in [7] needs additional registers that increase the latency since more cycles are required to share the adders. In our architecture this disadvantage is overcome. branch metrics). Then, four values are selected. These values are sent to the radix-4 CS that has to produce the final state metric value. In order to reduce the ACS unit critical path, we have applied the modulo normalization technique as presented in [9] . Thus, state metric normalization blocks have been removed. Besides, we have adopted the radix-4 CS architecture presented in [5] . Indeed, this architecture has been optimized in terms of throughput.
B. Radix-8 ACS Unit Architecture

C. Soft Output Unit Architecture
Since it is not possible to determine a priori which paths are eliminated in each state transition, a static comparison structure cannot be used for the SOU. Note however that we can group the 64 state transitions s (i) → s (j) in eight sets Q l − each one having eight elements (pair of states) − for l = 0, . . . , 7,
3 )}, with c respectively, such that (s (i) , s (j) ) ∈ Q l , correspond to only two possible systematic bit sequences, according to d 
Each Max-L block processes its inputs in three stages. In the first stage, four Switch-I blocks (Fig. 5(b) ) either find in parallel the maximum values of their inputs or directly send the inputs to the outputs. In the two other stages, Switch-II blocks are used (Fig. 5(c) ). These blocks find the maximum value or let one of the inputs to pass through them. Thus, at the output of the Max-L blocks the maximum value for each d is found 2 . These values are then compared using a fixed tree structure composed of Switch-II blocks, such that the number of comparators is minimized. Finally, the four extrinsic values are computed. Due to the elimination of paths, values L e k+h , for h = 0, 2, 3, may not be valid. In this case, a multiplexer enables to replace them by p · q k+h , where q k+h = ±1, and (q k+h +1)/2 is the hard decision corresponding to the systematic bit d k+h . An expression for p will be proposed in Section III. On the other hand, the value L e k+1 is always valid since the systematic bit d k+1 does not change in the parallel paths.
III. TURBO DECODER PERFORMANCE COMPARISON
The proposed radix-16 SISO architecture significantly affects the asymptotic gain of the TD. We have observed a penalty around 0.2dB, for code rates R = 1/2 and 1/3, at FER of 10 −6 , with an error floor that remains high compared to the error floor of a radix-2 SISO architecture. For high SNR values (error floor region), a-priori values grow fast during the iterative process, and thus, it is more probable to eliminate all the paths for a specific bit value. , and thus, an undesirable correlation may appear between the hard decisionsd k ,d k+2 andd k+3 during the iterative process. To solve this problem, we propose two techniques: 1) select an appropriate value p when there is not enough information to compute a determinate extrinsic value (section III-A); 2) reduce the mutual interference between the extrinsic values in a same radix-16 trellis transition (section III-B).
A. Expression of p for Unknown Extrinsic Value
We have established an expression for p following a similar approach to the one presented in [10] for the decoding of Block Turbo Codes. In [10] , when there is not competing codeword in order to compute the reliability value of a certain bit, the soft output is calculated as the sum of the magnitude of a set of channel observations at the input of the decoder. Since there are only four systematic bits for each radix-16 trellis transition, we have modified the summation by a minimum operation. Thus, we avoid too optimistic extrinsic values that are inconvenient during the iterative process performed by the turbo decoder. The value of p corresponds then to the minimum reliability value at the SISO decoder input (absolute value of the systematic plus the a-priori LLR values) between all the four bits in the radix-16 trellis transition as shown by (2) . Extensive Monte-Carlo simulations showed the convenience of this expression.
B. Reduce the Interference of Bits in the Same Trellis Transition
We emphasize that the bit d k+1 affords a higher level of protection compared to the others bits in the radix-16 trellis transitions. Therefore, we propose to apply a shift operation on the frame treated by the SISO decoder, as presented in Fig. 6 for a frame size of 1024 bits. For this example, when no shifting is applied, bits 1, 5, 9, . . . , 1021 are protected. With a shift of one bit, the bits 0, 4, 8, . . . , 1020, 1023 are protected. With a shift of two or three bits, other bits are protected. Note also that the bits that interfere are different, in each radix-16 trellis transition, in function of the shift value. Thus, if we change the number of shifted bits in consecutive turbo decoder iterations, the negative effects introduced by our SISO architecture can be significantly reduced. This technique can be adapted to the sliding window process [11] , and when multiples SISO decoders are assigned for the decoding of a frame [12] .
If the shifting technique is applied, there are some bits that do not belong to the frame for the first and last radix-16 trellis transitions (Fig. 6) . Therefore, the SISO decoder must be fed with appropriate a-priori values for these bits outside the frame, such that α and β values at the frame limits are not altered. Moreover, if the tail biting approach is used [13] , bits outside the frame of the first (last) transition correspond to bits of the last (first) transition since the frame is circular. Thus, no special consideration should be given in this case.
BER and FER performance after six iterations for the LTE TD with 1024 bits per frame are given in Fig. 7 3 . Two architectures have been considered: one based on a radix-2 SISO decoder and another one based on the proposed radix-16 SISO decoder. Two different code rates R = 1/3 and 1/2 have been studied. The code rate R = 1/2 is achieved by puncturing the original code. 6 and 9 bits have been chosen for the channel and extrinsic values representation, respectively. For the radix-2 SISO decoder, 10 bits are necessary for α and β metrics. For the radix-16 SISO decoder, 12 bits are necessary so that the modulo normalization technique can be applied. When p is given by (2), an error floor appears for both code rates. However, when the shifting technique is also applied, the architecture based on the proposed radix-16 SISO decoder exhibits a negligible degradation by comparison with an architecture based on the radix-2 SISO decoder.
IV. SYNTHESIS RESULTS
The proposed radix-16 SISO architecture was synthesized with a STMicrolectronics 90nm CMOS process ASIC target. The hardware complexity in terms of equivalent 2-input (NAND) gate count is given in Table I . In order to compare the SISO decoder hardware complexity as function of NT , radix-2, radix-4 and radix-16 SISO decoders were also designed based on the scheme presented in Fig. 1 . For the radix-2 and radix-4 SISO decoders, the ACS units in [9] and [5] were adopted, respectively. For the radix-16 SISO decoder, the two-dimensional ACS unit in [6] was chosen, and a SOU implementing the static comparison tree that minimizes the number of comparisons as presented in [7] was implemented. All SISO decoder architectures were synthesized for a 200MHz clock frequency. The input and β buffer complexities are not included in the comparison since their values depend on the window size. However, a radix-2 N T ACS unit enables to reduce the β buffer size by a factor of NT .
Thanks to the complexity reduction performed in all the SISO decoder blocks, our radix-16 SISO decoder is about 55% less complex than the considered radix-16 implementation. The proposed ACS unit helps considerably to achieve this result. Furthermore, the proposed BMU and SOU enable to overcome the hardware penalty cost introduced by a radix-16 approach. Compared to radix-2 and radix-4 SISO decoders, the proposed architecture is about 7.8 and 2.9 more complex, respectively. With this additional hardware complexity, the proposed SISO decoder improves by a factor of 4 and 2 the radix-2 and radix-4 SISO decoder throughput, respectively.
V. CONCLUSION
A low complexity radix-16 SISO decoder for the Max-Log-MAP algorithm is presented in this paper. Two complementary techniques have been proposed in order to limit the BER/FER performance degradation introduced by a TD architecture based on the proposed radix-16 SISO decoder. Moreover, an elimination of parallel paths in the trellis diagram allows us to use a radix-8 architecture for the ACS unit. The association of these different contributions enable to design a high speed low complexity radix-16 SISO decoder. The ideas presented in this paper can be applied to design higher radix SISO decoders with an acceptable cost-throughput ratio.
