Introduction
Digital communication systems, such as fiber-optic communication, wireless communication and storage applications, require very high data-rates as well as powerful error correction capabilities. For the latter, performance approaching the Shannon bound are obtained with iterative decoding algorithms, such as turbo decoding [1] or LDPC decoding [2] . As those algorithms are characterized by their high complexity, achieving high-data rates requires optimal parallelism exploitation.
Parallelism in convolutional turbo decoding has been widely investigated over the last few years either at a fine grain level [4] [5] on symbol elementary computations of the decoding algorithm, called BCJR or Forward-backward algorithm [3] , or at a coarse grain level [6] [7] [8] [9] In this paper, we classify existing parallelism possibilities in convolutional turbo decoding with the BCJR algorithm. We also propose a promising performance analysis of parallelism efficiency related to sub-block decoding and shuffled decoding.
The rest of the paper is organized as follows. The next section presents the convolutional turbo decoding algorithm for better understanding of subsequent sections. Section 3 proposes a multi-level classification of turbo decoding parallelism. In section 4 and section 5, sub-block parallelism and component-decoder parallelism (shuffled decoding) are respectively analyzed on the basis of parallelism efficiency criteria. Finally, section 6 summarizes the results obtained and concludes the paper.
Convolutional turbo decoding
Discovered in 1993, the turbo principle [1] relies on information exchange and iterative processing between the different elementary blocks. The exchanged information is called extrinsic information.
For parallel convolutional turbo codes (Figure l.a), the elementary blocks are the component decoders. The convolutional decoding is performed using the BCJR algorithm [3] which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes. In practice, a log-domain derivation of the algorithm (log-MAP) is used. Log-MAP algorithm can be approximated by max-log-MAP algorithm [15] .
BCJR algorithm is implemented in Soft Input Soft Output (SISO) decoders. Using input symbols and a priori extrinsic information, each SISO decoder computes a posteriori probabilities (APP). These APPs constitue the a prori information for the other decoder and are exchanged via an interleaving (HI) and deinterleaving (I-') processes.
Figure l.b illustrates the main steps of the BCJR algorithm. Firstly, the branch metric (or y metric) between two states represents the probability that a transition occurs between these two states. Secondly, forward recursion (or a recursion) computes the probability of all the states in the trellis given the past observations (eq. 1). This processing is recursive since a trellis section (i.e. the probability of all states) is computed using the previous trellis section and branch metrics between these two sections.
Thirdly, backward recursion (or D recursion) computes the probability of all the states in the trellis given the future observations (eq. 2). This computation is similar to the forward recursion, but the frame is processed in the backward direction.
s'=O Finally, the extrinsic information is computed from the forward recursion, the backward recursion and the extrinsic part of the branch metrics (eq. 3). 
BCJR metric level parallelism
The BCJR metric level parallelism concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder ( Figure 1 ). It exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [4] [5].
Parallelism of trellis transitions
Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [15] , these operations are either ACS operations (Add-CompareSelect) for the max-log-MAP algorithm or ACSO operations (ACS with a correction offset [15] ) for the log-MAP algorithm.
Each BCJR computation (eq. 1,2,3) requires a number of ACS-like operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree.
However this parallelism implies low area overhead as only the ACS units have to be duplicated. In particular, no additional memories are required since all the parallelized operations are executed on the same trellis section, and in consequence on the same data.
Parallelism of BCJR computations
A second metric parallelism can be orthogonally extracted from BCJR algorithm through a parallel execution of the three BCJR computations.
Parallel execution of backward recursion and APP computations was proposed with the original ForwardBackward scheme, depicted in Figure 1 .c. So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part.
To increase this parallelism degree, several schemes are proposed [8] . backward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but still limited in parallelism degree.
In conclusion, BCJR metric level parallelism achieves optimal area efficiency as it does not affect memory size which occupies most of the area in a turbo decoder circuit. Nevertheless the parallelism degree is limited by the decoding algorithm and the code structure. Thus, achieving higher parallelism degree implies exploring higher processing levels.
BCJR-SISO decoder level parallelism
The second level of parallelism concerns the SISO decoder level. It consists of the use of multiple SISO decoders, each executing the BCJR algorithm and processing a sub-block of the same frame in one of the two interleaving orders. This level of parallelism can reach a reasonable parallelism degree and preserve memory area [6] .
Two kinds of parallelism exist in this class: sub-block parallelism and component-decoder parallelism.
Sub-block parallelism
In sub-block parallelism, each frame is divided into M sub-blocks and then each sub-block is processed on a BCJR-SISO decoder using adequate initializations [6] [7] [8] [9] . A graphical formalism is proposed in [8] to compare various existing sub-block decoding schemes towards parallelism degree and memory efficiency.
Besides duplication of BCJR-SISO decoders, this parallelism imposes two other constraints. On the one hand, interleaving has to be parallelized in order to extend proportionally the communication bandwidth [12] . On the second hand, BCJR-SISO decoders have to be initialized adequately as detailed in section 4.
Component-decoder Parallelism
The component-decoder parallelism is a new kind of parallelism that has become operational with the introduction of the shuffled decoding technique [10] . The basic idea of shuffled decoding is to execute all component decoders in parallel and to exchange APP information as soon as created.
With this decoding scheme, decoding time could be theoretically halved in comparison with serial approach with the same iteration number. Section 5 analyzes the performance that can be obtained with this kind of parallelism.
Turbo-decoder level parallelism
The highest level of parallelism duplicates whole turbo decoders to process iterations and/or frames in parallel.
Iteration parallelism occurs in a pipelined fashion with a maximum pipeline depth equal to the iteration number, whereas frame parallelism presents no limitation in parallelism degree.
Turbo-decoder level parallelism, however, is too areaexpensive (all memories and computation resources are duplicated) and presents no gain in decoding latency. 4. Initialization in sub-block parallelism
As described in section 3, sub-block parallelism takes place at frame level and requires initializations. These initializations are mandatory as information on recursion metrics is available at frame ending points, but not at sub-block ending points [7] .
An estimation of this undetermined information can be obtained either by acquisition or by message passing between neighboring sub-blocks.
Initialization by acquisition
This widely used initialization method consists in estimating recursion metrics thanks to an overlapping region called acquisition window or prologue.
Starting from a trellis section, where all the states are initialized to a uniform constant, the acquisition window will be processed on its length, denoted AL, to provide reliable state metrics at the beginning of the sub-block.
This acquisition length is determined at design time in order to make negligible error rate degradation. It is fixed in function of the number of redundancies in the prologue, typically 6. Another empirical rule recommends from 3 to 5 times the constraint length of the code for this acquisition length [7] .
When all the sub-blocks are initialized with acquisition method, the decoding time (td), the speed gain (Sg) and additional computation ratio (R,) can be expressed as: 
where N represents the frame length, d the sub-block parallelism degree and it the number of iterations. Equation 4 shows clearly that the decoding time tends towards a constant value when parallelism degree 0-7803-9521-2/06/$20.00 §2006 IEEE. increases. Thus sub-block parallelism with initialization by acquisition encounters a throughput ceiling value and the maximum speed gain is equal to N/(AL + 1). The corresponding efficiency, which is defined as speed gain Sg divided by parallelism degree d, will decrease to the minimum value 1/(AL + 1).
Furthermore the additional computation ratio, which concerns recursion computations and input data memory accesses, increases linearly with the parallelism degree.
Initialization by message passing
The second method initializes dynamically a subblock with recursion metrics computed during the last iteration in the neighboring sub-blocks [9] . So this technique does not require additional hardware except some communication resources between BCJR SISO units.
To evaluate this technique, bit error rate performance degradation has to be evaluated and compensated with additional iterations.
In Figure 2 , Frame Error Rate (FER) performance is represented for different parallelism degrees in function of iteration number. This figure shows that asymptotic error rate is not affected by message passing approach whatever the parallelism degree. Thus it ensures that initialization by message passing can be used without degradation. Like sub-block parallelism with initialization by acquisition, using initialization by message passing method also encounters a throughput ceiling value. The maximum speed gain is roughly equal to it/C, corresponding to an efficiency ofit/(C N).
Threshold position strongly depends on the sub-block size. It can be physically interpreted as the minimum sub-block size, which provides reliable recursion values at the end of the first iteration. Under this minimum size, recursion values have to be refined using more iterations.
Thus this threshold will change according to the frame size and the code rate, as this latter has an influence on recursion reliability.
Efficiency and performance comparison
In Figure 4 , sub-block parallelism efficiency is compared between both initialization methods. Under the presented conditions, efficiency of message passing technique and acquisition technique with 16-symbol acquisition length are quite similar at high sub-block parallelism degree. Nevertheless at low sub-block parallelism degree, message passing technique efficiency is constant and equal to maximum efficiency and thus outperforms acquisition technique whatever the acquisition length.
0-7803-9521-2/06/$20.00 §2006 IEEE. Furthermore, the initialization by acquisition degrades error-rate performance, whereas initialization by message passing induces no degradation. Figure 5 illustrates Frame Error Rate (FER) performance considering a DVB-RCS code [14] with a 47-sub-blockparallelism degree. 0.1 5dB-degradation is observed between initialization with 32-symbol acquisition length and message passing initialization or Max-log-MAP without parallelism. Comparison between both techniques tends clearly in favor of message passing technique, which enables better error rate performance without resource overhead but gives similar efficiency.
However simulations results (Figure 4 ) also show that sub-block parallelism, whatever the initialization method, becomes quite inefficient at high parallelism degrees.
Component-decoder parallelism analysis
As described in section 3, component-decoder parallelism takes advantage of the shuffled decoding technique that executes all component decoders in parallel and exchanges APP information as soon as created. The following section will analyze the efficiency of this parallelism.
Shuffled decoding efriciency
Like sub-block parallelism efficiency, componentdecoder parallelism efficiency is defined as the speed gain divided by the parallelism degree at equivalent error rate performance. For shuffled decoding, parallelism degree is limited to the number of component decoders (usually 2) and only iteration number could vary in speed gain (eq. 5). Thus shuffled decoding efficiency only depends of the iteration number needed to reach the same error rate performance as serial decoding. Simulation results demonstrate efficiency ranging from 0.6 to 0.95.
By definition, shuffled decoding efficiency is computed with a set of component decoder, with interleavers of defined size and with a fixed SNR. Efficiency can always be computed along the turbo decoder convergence process as shuffled decoding and serial decoding converge to the same value. Simulations reveal that efficiency is almost invariant along turbo decoder convergence. Then with a defined error rate at various SNR, it is also possible to show that efficiency is SNR invariant. So shuffled decoding efficiency can only depend on interleaving rules and BCJR-SISO decoder parallelism.
Shuffled decoding and interleaving
The dependency between shuffled decoding and interleaving has already been studied in [10] . According to interleaving laws II, symbols belong to three different classes. The first class contains all points processed at the same time in interleaved and desinterleaved order, such as t(k) = t(Il(k)). The second class contains all points verifying t(k) < t(Il(k)) and the third all points verifying t(k) > t(Il(k)).
A symbol of the first class is processed concurrently by both component decoders. So decoders do not take advantage before the next iteration of APPs, that are sent by the another decoder.
Using component-decoder and sub-block parallelism at the same time, the number of first-class symbols increases with sub-block parallelism degree. However shuffled decoding efficiency increases with sub-block parallelism degree as represented in Table 2 and Table 3 . This result can be explained by the fact that iteration number increases with sub-block parallelism degree. Thus the penalty to next iteration imposed by defined first class will be less significant in the decoding and in this way shuffled decoding efficiency is improved.
Combining component-decoder and subblock parallelism
From these results, it makes sense to combine component-decoder parallelism and sub-block parallelism. Indeed component-decoder parallelism efficiency increases with sub-block parallelism degree and at the same time sub-block parallelism efficiency decreases.
To determine when component-decoder parallelism becomes more efficient than sub-block parallelism, subblock parallelism with parallelism degree d should become the new reference in efficiency computation.
Thus the efficiency of doubling sub-block parallelism degree to a degree 2d can be compared with efficiency of shuffled decoding at the reference parallelism degree d to select the most efficient parallelism.
In our examples, shuffled decoding becomes more efficient for d greater than 4 in Table 2 and 16 in Table  3 . Obtained results illustrate that, beyond a certain bound, decoder parallelism becomes more efficient than sub-block parallelism.
Conclusion
In this paper we analyzed and classified the various parallelism techniques that could be used in convolutional turbo decoding with the BCJR algorithm. The proposed three level classification includes: BCJR metric level parallelism, BCJR SISO decoder level parallelism, and Turbo-decoder levelparallelism.
It has been shown that sub-block initialization is more efficient with message passing technique than with acquisition technique and also that sub-block parallelism becomes inefficient for high sub-block parallelism degrees. On the contrary, component-decoder parallelism, with the newly introduced shuffled decoding technique, becomes more efficient for high sub-block parallelism degrees. Efficiency of this parallelism depends only on interleaving rules.
Furthermore a criterion based on analysis of parallelism efficiency is proposed to help the selection and the use of these two parallelism techniques.
