Turbo codes are the most recent breakthrough in coding theory. However, the decoder's implementation cost limits their incorporation in commercial systems. Although the decoding algorithm is highly data dominated, no systematic memory optimization study has been performed yet. Therefore, we have applied the IMEC Data Transfer and Storage Exploration (DTSE) methodology to the MAP (Maximum A Posteriori) class of decoding algorithms. We present an extensive overview of the optimizations and tradeoffs that arise from applying this methodology. The result is not one optimal algorithm, but a parametric family of new optimized algorithms that can be targeted towards hardware or software. The optimal choice of parameters depends on the implementation target, the selected cost function, and the specific turbo code.
I. INTRODUCTION
In 1993 Berrou et al. have presented a new channel coding scheme, called turbo coding [1] . These codes received a lot of interest from the research community, as they offer a performance unmatched by any other code so far (as close as only 0.27 dB from the Shannon capacity limit has been reported [2] ). Turbo codes are also on the verge of finding their way in numerous cutting edge applications like cellular [3] and satellite communications [4] .
A turbo decoder is composed of modules that work in an iterative scheme. Different alternatives exist for the algorithm incorporated in these decoder modules. The originally developed algorithms belong to the Maximum A Posteriori (MAP) class [1] . Later another class of algorithms was proposed, called Soft Output Viterbi Algorithm (SOVA) [5] . The implementation of either class is far from a trivial task, on which already considerable effort has been spent (see [6] [7] [8] for MAP and [9] [10] [11] [12] for SOVA). Since both decoding algorithms are highly data dominated, the memory organization is critical. This aspect has never been treated systematically in the literature. We have no knowledge of any paper on in depth memory optimizations for the MAP class. To remedy this, we have carried out a systemic exploration using the code transformation steps of the IMEC Data Transfer and Storage Exploration (DTSE) methodology [13] .
Although the basic methodology is equally applicable to both classes, we have focussed on MAP algorithms since they offer a superior decoding performance [14] [15] and have an implementation cost that is comparable to that of SOVA when using sliding windows [7] . The result of our optimization study is an entire family of optimized algorithms that can be targeted towards both hardware (HW) and software (SW) systems. The optimal parameters for these algorithms greatly depend on the optimization criterion (area/energy/delay), but also on the implementation target (e.g. HW/SW), and the specific turbo code. We therefore do not present one optimal structure, but offer the designer a whole parametric family of algorithm alternatives with different tradeoffs.
The second and third sections of this paper present the general concepts of turbo coding and DTSE respectively. The first step of this methodology, preprocessing, is treated in section four. Section five deals with data flow transformations, more in particular selective recalculation. The sixth section discusses several loop and control flow transformations. Section seven presents the parameters of the new family of algorithms and illustrates the possible tradeoffs. Finally, section eight provides some conclusions.
II. TURBO CODING

A. Encoder Setup
A turbo encoder consists of a parallel concatenation of two relatively low constraint length convolutional codes as depicted in figure 1a . A pseudo random interleaver π separates these two component codes (ENC1 For every new input block, the encoder starts in a known initial state. In order to have a known end state, some extra input bits, called tail bits, are generated. Traditionally this is done only for ENC1 [1] . We extend this method and add tail bits after the interleaver as well, to also force ENC2 in a known end state. We refer to this scheme as full termination. Another option to force both encoders in a known end state is to use precursor bits as in [16] . Both schemes reduce the throughput only very slightly, but improve the decoding performance and allow an interesting decoder module optimization as we will explain in section VI.
B. Decoder Setup
A maximum likelihood turbo decoder would be prohibitively complex. The iterative decoding scheme of figure 1b is a clever solution to this problem. Each decoder module (DEC1 and DEC2) produces an of a decoding module are the log likelihood ratios λ k i (for i =1,2) of the received symbols, which are defined
The decoded bits û k , calculated via formula (2), provide an approximation of the maximum likelihood decoded solution. This approximation improves with an increasing number of iterations.
[ ]
The way the extrinsic information is calculated depends on the algorithm class (MAP or SOVA). As explained in the introduction, we focus on the MAP class. Even within this class, slightly different algorithms exist to which all of the optimizations presented in the subsequent sections apply. They all use a forward and a backward recursion and can be extended towards a sliding window approach. To merely illustrate the principles, we pick the log-SISO (Soft Input-Soft Output) algorithm of Benedetto et al. [17] 
These α and β metrics are obtained through a forward and a backward recursion respectively (see formulas 5 and 6). They start in a known initial state at the beginning (for α) or end (for β) of the block.
[ ] 
III. REVIEW OF APPLIED DTSE METHODOLOGY
The IMEC Data Transfer and Storage Exploration (DTSE) methodology (see figure 3 ) allows to systematically reduce the storage bottleneck in data dominated algorithms [13] [21] and is therefore suitable to be applied to the MAP turbo decoding algorithm presented above. The starting point of the methodology is a system specification with multi-dimensional signal accesses that can be statically ordered. The output is a system level memory organization, combined with a transformed specification. This specification can be synthesized to hardware or compiled to software. The different steps of the methodology, which do not change the global I/O behavior, are explained below.
1. Preprocessing strips down the algorithmic bit true description and provides explicit data dependency information. In addition, pruning heavily reduces the search space for the complex subsequent steps without any negative effect on the optimization.
2. Global data-flow transformations that have the most crucial effect on the system exploration decisions, either optimize the important cost factors directly (by advanced signal substitution, modifying computation order, shifting delay lines and selective recomputation) or serve as enabling transformations for the subsequent steps by removing data-flow bottlenecks (e.g. look-ahead transformations).
3. Global loop and reindexing transformations aim at improving the data access locality and at removing the system-level buffers caused by mismatches in data production and consumption ordering.
4. The data reuse decision exploits possibilities for improved use of the memory hierarchy.
5. The memory organization step allocates memory units and ports from a memory library given the cycle budget and assigns the data to them.
6. A decision is made on in-place storage of multi-dimensional signals resulting in even more reduced storage requirements.
In the next sections, we apply the first three steps to the MAP turbo decoding algorithm class. These steps focus on the high-level structure of the algorithm, which is the scope of this research. 
IV. PREPROCESSING
The pruning step in the DTSE methodology provides a stripped version of the algorithm, in addition to some other preprocessing substeps. We only focus on the iteration loop and not on the calculation of the output bits (given by formula (2)), since its influence on energy and area is negligible. Also a stop criterion, which limits the number of decoding iterations, is not considered. In order to have realistic results, the methodology is applied to a bit true version of the algorithm with 6 bit log likelihood ratios and 8 bit state metrics. dominates, but its size is fixed by the code performance. The transformations we apply are therefore mostly focussed on reducing the energy consumption.
V. GLOBAL DATA FLOW TRANSFORMATIONS
The only relevant data flow transformation in this case is based on selective recomputation. To reduce the state metric storage, only a fraction 1/θ of the metrics is stored in RAM (e.g. β N , β N-θ , β N-2θ , etc. for all states). We therefore refer to this transformation as partial state metric storage. The missing state metrics are recalculated when they are needed for λ ext . Of the different alternatives, temporarily storing the recalculated metrics in a register file (as indicated in figure 4 ) proves to be more efficient than recalculating each of the missing metrics separately starting from a stored metric. By scheduling the metric storage in the register file as in figure 5 , only (θ-1) 8 bit-registers for each decoder state are required. This approach reduces both the size and the number of transfers of the state metric RAM by a factor θ. This occurs at a cost of some extra calculations and register file storage of course. The latter have been incorporated in the overall cost model used in the sequel. A tradeoff to determine the optimal value of θ can only be made by the designer when models for each the decoder building blocks are available. We will show the potentially large benefit of this data flow transformation in section VII.
VI. GLOBAL LOOP AND CONTROL FLOW TRANSFORMATIONS
A. Search Space Pruning
Loop transformations boil down to finding the optimal ordering of the algorithm operations. Several restrictions limit the vast number of possible orderings, resulting in a beneficial search space decrease.
These restrictions originate either from decoding performance considerations or from the algorithm structure itself.
The first type of restriction avoids a loss in decoding performance. First of all, state metrics are only considered valid if they are calculated from previous valid metrics, from an initial condition or after at least L dummy metrics when initialized in the middle of a block (see section II). Moreover, we do not consider the approach of [22] using partially overlapping sub-blocks. The transformations we present are however extensible to these kinds of structures.
The second type of restrictions originates from the sliding window algorithm structure itself. Figure 6 shows a simplified graphic representation of possible calculation sequencing in the case of sliding windows applied on the β metrics (like in figure 2b ). As presented in figure 6a , there can be a non-negative delay δ between the calculation of the α metrics. This delay has no advantages, but only increases the decoding time and is therefore set equal to zero. Also for the calculation of the β metrics, such a delay δ could be introduced (as in figure 6b ). Again its optimal value is equal to zero for the same reasons as before.
Furthermore, since the number of such sliding windows is potentially large, the operation sequence in each window must be the same (see figure 6c ). This symmetry results in a distance of 2L between two calculations of the same β metric: the first is a dummy metric, the second a valid one. Scheduling the calculation of the α metrics relative to that of the β metrics is the only degree of freedom left, which is indicated by the parameter ∆. This parameter can be varied continuously, but it can be shown that only three values make real sense, namely 2L, L and 0. Based on these restrictions, we explore the remaining alternatives. These can be catalogued as single and double flow structures.
B. Single Flow Structures
As formulas 5 and 6 are totally equivalent and both are always initialized due to our full termination, the role of α and β can be swapped in the decoding scheme. Applying the sliding window on the β metrics (which is done traditionally [7] [18]) or on the α metrics are both valid options, resulting in what we call single flow structures. Since both are equivalent, we only illustrate the first case, but the conclusions are similar for the other case.
As explained before, the parameter ∆ leads to the three promising alternatives (∆ 2L , ∆ L and ∆ 0 ) of figure 7. Another transformation is presented in figure 8 . The value L is determined by the decoding performance, but the number of valid metrics calculated thereafter can be varied. This is denoted by parameter η, which
gives the average number of valid metrics per non-valid one. Changing this parameter results in a tradeoff between the state metric storage, the state metric calculations and the number of input reads (as will be detailed in section VII).
C. Double Flow Structures
We also introduce a new structure, called double flow, which reduces the decoding delay of one data block.
The α metrics are calculated for the first half of the data block starting from their initial condition while a sliding window is applied on the β metrics. At the same time, the second half of the data block is processed with the functionality of α and β swapped. promising values for parameter ∆ (see figure 9 ). Option ∆ 2L is again inferior to option ∆ 0 for the same reasons as for a single flow structure. Also a straightforward setup for option ∆ L would result in essentially the same drawbacks as the ones discussed for a single flow structure, and would make it again inferior compared to ∆ 0 . Offsetting the two flows with a factor L/2 on the other hand (as in figure 9b ) eliminates the need for extra data-path units compared to the other two options. This or any other offset has no beneficial effect in the case of the two other structures, since no reuse is possible there. The decoding delay of option ∆ L is still larger than that of ∆ 0 . An advantage is now the reuse of the state metric memory between the two flows, which is not possible for ∆ 0 . This results in a tradeoff between structure ∆ L and ∆ 0 . For these two remaining options, also a parameter η is introduced with the same meaning as for a single flow structure.
VII. TRADEOFF ANALYSIS AND SIMULATION RESULTS
The data flow and loop transformations we have discussed in the previous sections introduce a wide range of tradeoffs. We list these in table 2 
Transfers from input cache~ 2 + 1/η~ 2 + 1/η~ 2 + 1/η Size state metric memory~ η/θ~ η/θ~ 2η/θ Data-path hardware for dummy β~ 1/η
Energy ratio valid β over dummy β~ 1/η~ 2/η~ 2/η The exact tradeoffs can only be evaluated when reliable cost estimates for memory and data-path operations are available [13] . Even then, an analytic optimization is not possible due to the non-linear behavior of memory area and transfer energy as a function of the memory size. The optimal structure and parameter settings therefore have to be obtained using simulations. An exact evaluation depends also on the specific implementation target (HW/SW) and turbo code chosen. The result of our optimization study is therefore a parametric family of algorithms from which a designer can choose the best one for a specific context.
To illustrate this, we discuss some particular simulation results. As an example, we pick a rate ½, 16 state sliding SISO turbo decoder with an interleaver size N of 1000 and a value L of 50. The memory models are for a 0.8 µm Motorola SRAM technology [13] . This example is therefore targeted towards a custom hardware solution, but the general tradeoffs are equally applicable for embedded processor implementations. The desired settings of the parameters that we have introduced (single or double flow, ∆, θ and η) not only depend on the models used, but also on the relative importance attributed to area, energy and throughput.
The traditional sliding window implementations use a single flow ∆ 0 structure with θ = 1 and η = 1 [7] [18] [19] . A double flow ∆ L structure with θ = 4 and η = 1, would result in the performance gain compared to the traditional structure as indicated in table 3 (with the Motorola memory model). A substantial energy (factor 2.5) and delay (factor 1.7) reduction compared to the structure used in [7] [18] [19] is obtained with only a small penalty in area.
Parameter
Multiplied The slight area increase is however not an issue when considering the entire turbo decoder. The full decoding algorithm namely breaks down into several iterations, where each iteration consists of four substeps: a half-iteration, an interleaving step, a half-iteration and a de-interleaving step. Physically, each such half-iteration is performed on a decoder module like the one we have optimized. All the half-iterations can run on the same hardware module, but to improve the user data rate a decoder typically consists of several identical modules [1] 
VIII. CONCLUSIONS
Optimizations of the original MAP turbo decoder algorithms are crucial to allow implementation of the decoder. Even though introducing sliding windows has resulted in a significant decrease in area, energy consumption and decoding delay, this setup is still highly sub-optimal. Since the MAP decoding algorithm is strongly data dominated, a systematic memory optimization is required. We have applied the first part of the IMEC DTSE methodology, more specifically a recalculation data flow transformation and several loop transformations, to the sliding window MAP algorithm. Most of these transformations are valid for the entire MAP class, not only for sliding window algorithms. They result in three interesting structures (one single flow, two double flow variants) with two additional parameters η and θ, leading to a family of new algorithms. Depending on the particular turbo code, on the implementation target (HW/SW), on the specific technology and on the optimization criterion, a selection can be made by the designer between these alternatives. As an example we have presented the area, energy and throughput tradeoffs for a particular turbo code using area/energy models. The resulting performance gains compared to published approaches are substantial.
IX. ACKNOWLEDGEMENTS
We would like to thank our colleagues Sven Wuytack and Veerle Derudder for the area/energy models they provided us with.
