Abstract-Soft Input Soft Output (SISO) decoders iteratively exchanging intermediate results (extrinsic information) between themselves lie at the core of turbo decoder architectures. The implementation architecture could be serial, parallel or network on chip (NoC) based. In this paper, we present a technique for bitwidth reduction of exchanged extrinsic information and analyze the impact of it for different implementation architectures. The methodology is investigated over two kinds of turbo decoding system, both based on the Max-Log-MAP algorithm. First is a serial concatenated convolutional code (SCCC) decoder and the other is a WiMax (IEEE 802.16e) parallel concatenated convolutional code (PCCC) decoder. For the SCCC decoder, bitwidth of the extrinsic information can be reduced from 8 bits down to 4 without significant bit-error-rate (BER) degradation. For the WiMax case it can be reduced from 8 bits down to 5 with a BER degradation of 0.2 dB.
Abstract-Soft Input Soft Output (SISO) decoders iteratively exchanging intermediate results (extrinsic information) between themselves lie at the core of turbo decoder architectures. The implementation architecture could be serial, parallel or network on chip (NoC) based. In this paper, we present a technique for bitwidth reduction of exchanged extrinsic information and analyze the impact of it for different implementation architectures. The methodology is investigated over two kinds of turbo decoding system, both based on the Max-Log-MAP algorithm. First is a serial concatenated convolutional code (SCCC) decoder and the other is a WiMax (IEEE 802.16e) parallel concatenated convolutional code (PCCC) decoder. For the SCCC decoder, bitwidth of the extrinsic information can be reduced from 8 bits down to 4 without significant bit-error-rate (BER) degradation. For the WiMax case it can be reduced from 8 bits down to 5 with a BER degradation of 0.2 dB.
I. INTRODUCTION
Since the time Turbo-Codes were first introduced in 1993 [1] , there is a wide consensus on their importance as a great achievement in coding theory. Historical turbo codes, also sometimes called Parallel Concatenated Convolutional Codes (PCCC) are based on a parallel concatenation of two recursive systematic convolutional codes separated by an interleaver. The decoding process involves an iterative algorithm, with two SISO decoders exchanging intermediate results (extrinsic informations ) in order to improve the error correction performance with the decoding iterations. Later, this iterative decoding concept was applied to other concatenations of codes separated by interleavers such as the Serial Concatenated Convolutional Codes (SCCC) [2] .
Turbo codes have found application in various communication standards due to their near-capacity performance and their suitability for practical implementation. Turbo coding architectures lie at the heart of all of the third-generation (3G) wireless standards, including UMTS and cdma2000 [3] , [4] . These coding architectures are called for because they allow systems to meet the tough bit-error-rate (10 −6 ) requirements and low signal-to-noise ratios (SNRs) placed on emerging 3G designs. More recently, these codes were also selected for the IEEE 802.16 standard (WiMAX) [5] intended for broadband connections over long distances. Thus, efficient Turbo-decoder implementations with emphasis on small area, less power consumption and high throughput are of emerging importance. A detail analysis of various implementation issues related with turbo decoders has been presented in [6] , [7] .
The Turbo decoding implementations can be broadly classified into serial and parallel architectures. A serial decoding architecture is based on the use of a single SISO decoder to decode a whole block of data. However, in the case of parallel implementation, the block of data to be decoded is sub divided into several blocks, each of them processed by an independent SISO processing unit [6] . Parallel implementations are further sub divided into two classes based on the interleaver structure used. The first implementation uses deterministic interleavers [8] and the other implementation uses network on chip for exchange of extrinsics across component decoders [9] . The size of the extrinsic memory, complexity of the interleaver, and the communication resources of the network for these implementations, scales with the reduction in the bit width of the exchanged extrinisic information. In addition to this, a reduced energy consumption is also achieved, as it was established in [10] that 50% of the entire power consumption in a turbo decoder comes from memory power dissipation.
Extrinisic information in typical turbo decoding indicates a degree of confidence associated with each bit. There have been previous efforts in finding out the optimal quantization of extrinisic information to be exchanged. In [12] , [13] , [14] results for optimal quantization of extrinsic information were presented with the objective of complexity and memory size reduction of the processing element. In [15] an algorithm was proposed to decreases the necessary bit description width of the extrinsic information by employing a pseudo floating point representation. Instead, our paper presents a new approach for optimal quantization of exchanged extrinsic metrics. The underlining assumption of our approach is that optimal fixed point representation has already been obtained for the component decoder's internal signals. We propose a technique in which most significant bit (MSB) clipping combined with least significant bit (LSB) drop (at transmitter) and append (at receiver) is used as a way for bit-width reduction across the communication structure. To the best of our knowledge, in current literature investigation of LSBs (drop-append) on system BER performance has not yet been performed. This paper is divided into six sections. In Section II, different implementation architectures of turbo decoders are reviewed and impact of extrinsic bit width reduction is analyzed. Section III elaborates the proposed bit-width optimization methodology. Section IV reports on the BER performance analysis based on this optimization, while Section V presents an analysis of memory and power consumption reduction using proposed method. Conclusions are given in Section VI.
II. TURBO DECODING ARCHITECTURES
This section presents a brief description of the three different classes of turbo decoder implementation architectures, i.e., serial implementations, parallel implementations with deterministic interleavers and network on chip based parallel implementations. The impact of extrinsic bit width optimization on architecture implementation is also analyzed.
A. Serial Architecture
The simplest decoding architecture to decode a block of data is based on the use of a single SISO decoder, which alternatively processes both constituent codes and is generally based on an optimal MAP algorithm [16] . To reduce the MAP implementation complexity, a logarithmic domain version is used, which drastically reduces the operation complexity. In this paper we consider the implementation of the Max-Log-MAP algorithm (see [6] , [17] ), which is obtained by discarding the correction term. This serial architecture requires three storage units: a memory for the received data at the channel and decoder outputs (LLR in memory), a memory for extrinsic information at the SISO output (EXT memory), and a memory for the decoded data (LLR out memory) as shown in Figure  1 . The extrinsic memory of Figure 1 has an implementation complexity proportional to the number of bits to be stored. Bit width reduction of extrinsic information not only results in memory area saving, a reduced energy consumption is also achieved, as memory power dissipations consititutes around half of the turbo decoder power consumption. For high throughput application, parallel architectures of turbo decoders are required. The block of data to be decoded of frame size K is sub divided into P sub-blocks, of size M. Each of the sub-blocks is processed by P independent SISO processing units (see Figure 2) . Access of any memory bank is facilitated by the use of its associated address generator (AG). The extrinsic informations are exchanged across multiple SISO processing units using a permutation network. An appropriate initialization is required for the forward f m and backward b m metrics [6] , [18] , [19] .
Fig. 2. Turbo decoder parallel implementation
It can be observed that in the deterministic permutation network based parallel implementation, permutation network has an implementation complexity propotional to the number of bits to be routed. Reduction of bit width of exchanged extrinsic not only reduces the memory bank size, it also significantly reduces the complexity of the permutation network.
C. Network on Chip based Parallel Architecture
In [9] networks-on chip capable of resolving access conflicts at run-time is detailed to support arbitrary interleavers without any pre-processing. Such an approach offers great flexibility with respect to implementing various permutation patterns for standard compliant decoders. Figure 3 shows the general NoC based parallel turbo decoder architecture, where each processing element (PE) is a SISO processing unit with corresponding memories. The extrinsic information exchange across different PEs is facilitated by the NoC. Choice of the interconnect topology determines the achievable decoding throughput. In [11] it was shown that for a given interconnect topology, due to the limitation of routing resources (switching and interconnect wires), accepted traffic by the network tends to saturate after a certain value of injection load, resulting in reduced decoding throughput. Reducing the bit-widths of exchanged extrinsic information is an effective way to allocate more communication resources and increase the accepted traffic saturation point, thus increasing the decoding throughput.
III. BIT-WIDTH OPTIMIZATION
The problem of fixed-point implementation is important as hardware complexity increases linearly with the internal bit width representation of the data. The trade-off between hardware complexity and error correcting efficiency leads to the minimum bit width internal representation that results in an acceptable degradation of performance. However, in this work we assume that the internal bit width representation of the data has already been derived for minimal performance We analyze the impact of MSB clipping and LSB dropappend across the communication structure and resulting impact on the error performance. In Figure 4 a unified modeling language (UML) based activity diagram model of our methodology is presented. A parallel implementation is considered with a network for extrinsic information exchange across multiple SISO units. After the first half iteration, extrinsics with bit-width B are generated at the transmiter SISO decoder which are then reduced to b bits (B > b) and sent across the network. At the input of the receiver SISO decoder, b bits are coverted back to the original B bit-width for SISO processing. The methodology can be equally applied to serial implementation architectures of the turbo decoder mentioned in the previous section without loss of generality. In this case, the network block in Figure 4 is replaced with extrinsic memory EXT.
A. MSB Clipping
The MSB bits are clipped at the transmitter component decoder and sign extension is done at the receiver component decoder. Let x B (with B bit-width) be the extrinsic information at the output of the transmitter component decoder. Clipping is applied to cut M MSBs, mapping
M , which is sent across the network. At the input of the receiver component decoder, (x B )
M is transformed back to B bits aŝ x B , which is fed to the component decoder for processing. The transformation can be formalised as follows: 
B. LSB Drop-Append
Certain number of LSB bits at the transmitter are dropped and zeros are appended at the receiver component decoder. Let x B (with B bit-width) be the extrinsic information at the output of the transmitter component decoder. LSB dropping is applied to remove L LSBs and mapping x B into (x B ) L , which is sent across the network. At the input of the receiver component decoder, N zeros are appended to obtain the original bit-width of B, mapping (x B ) L intox B , which is fed to the component decoder for processing. The transformation can be formalised as follows:
where ⌊x⌋ means the integer part of x.
IV. ERROR PERFORMANCE ANALYSIS
Applying the methodology of Section III, extensive simulations were performed to evalute the error correction performance of the turbo decoder with reduced word size allocated for exchange of extrinisic information. Different wireless communication services have different requirements on the transmission quality, e.g., for speech-services a BER of approximately 10 −3 is sufficient, whereas for data-services BERs down to 10 −6 are necessary in applications where data is delay sensitive. Therefore, the performance in the area of these functional points is of particular interest. For the SCCC decoder the serial code used is from the implementation of a very high speed (1Gbps) adaptive coded modulation modem for satellite applications [20] . The characteristics of the evaluation system are given in Table I . The received channel values are 6 bit quantized with 5 bit for the integer part and 1 bit for the sign information. The no-noise signal amplitude of input is in the range (-12, 12) . To improve the performance of the Max-Log-Map algorithm a correction term implemented as a look-up table is used for max function calculation [20] . Figure 5 shows the BER performance (10 10 bits simulated) for different bit-width choices at iteration 7. It can be observed that MSB clipping of 2 bits starts to give performance degradation of more than 0.1 dB at a BER of 10 −5 and greater. Instead, there is more tolerance to LSB drop and append strategy. We can reduce 2 LSB bits without significant degradation in BER performance. For SNR higher than 1.25 dB, the LSB drop and append strategy of 2 bits provides a slight performance gain over the conventional system. We can hypothesize that this behaviour is related to the optimality of the extrinsic information. To decode Low Density Parity Check (LDPC) codes, in the Offset Min-Sum (OMS) algorithm [21] , a multiplied or additive correction factor is directly applied to the check-node output of the original MinSum algorithm in order to improve the decoding performance. Inherently, the impact of drop-append technique could be compared to the offset value of OMS algorithm, however in drop-append technique the offset is not fixed and depends on the value of extrinsic information. Figure 6 shows the BER results for different combinations of MSB clip and LSB drop-append strategies. At a BER of 10 −4 opting 2 MSB clip and 2 LSB drop-append we get less than 0.1 dB performance degradation with respect to original 8 bit wordsize case. At higher BER this encoding starts to give greater loss, however the loss can be contained well within the 0.1 dB by switching to 1MSB clip and 3 LSBs drop-append strategy. Such a switching can be actuated based on the SNR information available at the receiver, resulting in a 50 percent reduction of the extrinisic information bit-width. Figure 7 shows the BER performance (10 9 bits simulated) for different bit-width optimization for duo-binary codes used in the WiMax standard. The characteristics of evaluation system are given in Table I . To improve the performance of the Max-Log-Map algorithm a scaling factor of 0.75 is used for the extrinsic scaling at each iteration. The BER values are at the end of iteration 7. The recieved channel values are 6 bit quantized with 3 bit for the fractional part and 2 bits for the integer part. The MSB represents the sign of the channel information. The no-noise signal amplitude of input is in the range (-12, 12) . It is evident from the results that MSB clipping is not an option for wimax codes, but LSB (drop-append) methodology is effective. At a BER of 10 −6 a 3 bit reduction in bit-width is possible if the loss of 0.2 dB is tolerable. Considering the fact that in duo binary turbo decoder three extrinsic information are passed on between component decoders, combined bit-width reduction is significant. Memory size dominates the size of the turbo decoder. The Max-Log-MAP decoder contains three main memories. They are dedicated respectively to soft input information, the soft extrinsic information, and state metrics. The width of the extrinsic information memory is K× x B bits. In the SCCC decoder case, for a block size K = 1024, a reducton of 4 bits of extrinsic bit width corresponds to a memory size reduction from 1 KBytes to 0.5 KBytes is possible. In the case of a WiMAX PCCC decoder, as the extrinisic information vector contains three extrinsic information related to the decoded symbol, the memory saving is more visible. For example, for a block length of 2400 couples, a reducton of 3 bits of extrinsic bit width corresponds to a memory size reduction of 7 KBytes downto 4.4 KBytes.
Static (leakage) power P lkg is becoming a dominant source of power dissipation in next generation integrated circuits. It can be defined as :
where V DD and I lkg denote the power supply voltage and the leakage current respectively. Leakage power dissipation depends on the memory size and increases exponentially with submicron technologies [22] . Bit width reduction also reduces the amount of switched capacitances inside the memory device and on the driven bus lines, resulting in reduced dynamic power dissipation. Hence a reduction of the memory size has also a significant impact on the total power consumption of the turbo decoder.
VI. CONCLUSION
A methodology for extrinsic message size reduction was proposed which results in bit-width reduction of 8 downto 4 bits for SCCC code with less than 0.1 dB performance loss at BER of 10 −6 . For the WiMax CTC code bit-width reduction of 8 downto 5 bits is possible if loss of 0.2 dB at BER of 10 −6 is tolerable. Cost, area and energy consumption of the turbo decoder implementation scales with the bit-width of extrinsic information. The presented results show how the optimization potential through communication centric paradigm can be fully exploited without serious degradation of the bit-error performance.
Future work will focus on using a more optimal implementation of the turbo decoder, for example with received channel quantization of 3, 4 and 5 bits and with varying optimal scaling of extrinsics with iterations, for exploring the impact of our methodology.
