Abstract-It has been recently shown that a sequence of R = q(M − 1) redundancy free trellis stages of a recursive convolutional decoder can be compressed in a sequence of L = M − 1 trellis stages, where M is the number of states of the trellis and q is a positive integer. In this paper, we show that for an M state Turbo decoder, among the L compressed trellis stages, only m = 3 or even m = 2 are necessary. The so-called m-min algorithm can either be used to increase the throughput for decoding a high rate turbo-code and/or to reduce its power consumption.
I. INTRODUCTION
T HE quality of an error control code design can be evaluated in terms of decoding performance, implementation cost (area, power dissipation), and decoding throughput. A general overview of error control code decoders can be found in [1] . The recent specifications of wireless systems (LTE, HSPA [2] ) propose the use of turbo-codes with very high code rates (typically, between 0.8 and 0.98). In contrast to code construction, there are very few papers dedicated to decoders with such high rates. Most of the reported architectures propose the basic structure of a rate 1/3 decoder, with parameters (i.e., window lengths) optimized for high rates to tradeoff performance and decoding throughput. In [3] , a method is proposed to directly exploit the existence of long sequences (up to 100) of Redundancy Free Trellis Sections (RFTS, or sequences of bits without redundancy) to reduce the complexity of part of the decoder: during the acquisition process, any RFTS of size R is replaced by a shorter RFTS sequence of size L = M − 1, with additional (R mod L) steps of shuffling. In this paper, we show that, in the context of a Max-Log-MAP decoder [5] , among the L steps of the compressed RFTS, only m = 3 or m = 2 are really useful, which allows further architectural optimization.
The remainder of the paper is divided into four sections. Section II gives enough information about the trellis compression to have a self-consistent paper. Then, Section III presents the sub-optimal 3-min simplification; followed in Section IV by a discussion about hardware implementation. Finally, Section V concludes the paper. 
II. PRINCIPLE OF THE RTFS TRELLIS COMPACTION
In this section, we will first recall the problem of acquisition for high rate turbo-codes. Then, we will present the principle of trellis compaction at the encoding level before deriving it at the decoding level.
A. Acquisition for High Rate Turbo-Codes
The implementation of a turbo-code is a well investigated area. The standard implementation uses the Log-MAP algorithm or the Max-Log-MAP algorithm [5] along with the sliding window algorithm [6] . This algorithm consists in dividing the frame of length K into windows of length W and processing the forward-backward steps on a W -sized block instead of a whole K-sized block. To process the pth window, an accurate estimation of the initial forward state metrics α pW and backward state metrics β pW +W are required. One possible scheduling is to perform the backward recursion directly from index K down to 1 to obtain naturally the initial β pW +W states and to obtain the α pW initial states by an acquisition of size W , starting from state α 0 pW −W to state α pW (see Fig. 1 ). This pre-processing is called "acquisition". The initial value α 0 pW −W can be either the all-zero state vector (if there is no a priori knowledge on the initial state) or a forward state vector stored from the previous iteration. This last method is commonly called Next Iteration Initialization (NII) [7] .
For high code rates, the length of the acquisition L needs to be high enough to contain some non RFTS (i.e., trellis sections associated with non-punctured redundancy bits). In fact, starting from the all-zero state vector, if the acquisition processes only RFTS, the final state will also be the all-zero state vector and the acquisition process will thus be useless. By simulation, it is verified that a high value of L is required, even 1089-7798 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Table I ).
if the NII technique is used. This means that the acquisition step consumes a significant portion of the decoder's processing effort, which reduces the efficiency and throughput of the hardware implementation.
B. Compression of RFTS at the Encoding Level
Let us consider the trellis compression in the case of hard information bits (a decision is made on the value of the received bit, i.e., no soft value is available). For an 8-state convolutional encoder, the state-space representation of [8] gives
where X k , D k , and V k are respectively the state of the encoder, the input information bit and the output vector at time k. The matrices (A, B, C, D) are respectively the state matrix, the entry matrix, the output matrix and the feedforward matrix. In (1a) and (1b), arithmetic is over GF (2) . Moreover, for a recursive code, the matrix A verifies A L = Id, where Id is the identity matrix. Fig. 2(a) shows the structure of the encoder (the generation of output bits, corresponding to (1.b), is omitted). Starting from a state X 0 and the bit sequence D k , k = 0, 1, . . . , R − 1, the final state of the encoder is given by
Let us replace index k with qL + l, where l = k mod L and
For the sake of simplicity, k mod L will be denoted k |L . Then (2) can be rewritten as (3) can be expressed as:
where R |L = R mod L. We perform the change of variable
Let us give an example to illustrate (5) with R = 23, L = 7, and l = 3. The left term of (5) gives
, the right term of (5) gives
and both terms are equal. Let D a h be the "aggregated bit" corresponding to the bits having the same residue h mod L:
Note:
Since (7) can be expressed simply as:
Note that (8) has the same structure as (1a). Moreover, the computation of X 0 = A R |L X 0 can be performed in R |L steps of (1a) with dummy null input bits D In that case, the last R |L stages are associated with dummy null bits. This solution was originally presented in [3] .
C. Trellis Compaction
Let us recall the classical forward recursion of the MAP decoder. Let α k be the forward state metrics at time k and γ k the branch metrics at time k. The recursion is given by 
When the Max-Log-MAP algorithm is used, the max * operator in (9) is simply replaced by the max operator. Fig. 2(b) shows the computation of the state metric α k+1 (s). 
Thanks to the min-sum approximation, (11) can be simplified by separating the modulus of γ a l and the sign of γ a l :
D. Example of RFTS Compaction
Let us consider the M = 8-state encoder defined in Fig. 2(a) 1 Let us assume an RFTS sequence of length R = 16. Table I shows the first forward state metrics and the last forward state metrics of the RFTS sequence. In order to bound the increasing values of the state metrics, at each stage, the minimum value of the 1 The MATLAB code used for this example can be downloaded at [9] . 
). After this subtraction, the minimum metric is always equal to zero. Fig. 2(c) shows the details of the computation of α 1 from α 0 and γ 0 . Let us compute γ Table II shows the 9 steps of the compressed trellis. As expected, the last stage of the classical trellis (last column of Table I ) and the last stage of the compressed trellis are equal (last column of Table II) . This algorithm implies the use of L minima. It is called the L-min algorithm. In terms of performance, processing acquisition sequences of the sliding window algorithm without or with trellis compression gives identical final results. In the latter case, the number of clock cycles can be significantly reduced, leading to a more efficient architecture, as described briefly in [3] . Now the question arises whether it is possible to further decrease the complexity of the L-min algorithm by trading-off complexity and performance.
The analysis of the Max-Log-MAP algorithm for a sequence of RFTS shows that a high magnitude of γ implies simply a shuffling of the state metrics value α, while a low magnitude of γ decreases significantly the dynamics of the α terms (i.e., a low reliability bit gives more uncertainty on the current state of the encoder). This indicates that it may be sufficient to consider only a small subset of the initial γ values to process the RFTS sequence. In practice, the only pertinent criterion to evaluate an algorithm is the performance loss. To this end, we have run several simulations performing all acquisitions using the m − min a algorithm. We also tested a modified version of the m − min a method, performing the saturation before the aggregation of the bits. In this case, the R − m highest values of the RFTS sequence are saturated prior to the aggregation method. This variation of the m − min a method is called m − min g method. Note that the m − min a and the m − min g methods lead to the same result if m = 1. For m > 1, the two methods give the same result only if the m minimum values before aggregation have all distinct indices modulo L. Table IV shows bit-true simulation results of an HSPA Turbo decoder with K = 5144 using the sliding window technique associated with the NII technique. For r = 0.98, various acquisition lengths W are given to illustrate the need of high values of W for very high code rate. In all cases, the 3-min algorithm (not shown in the table) does not have significant performance degradation. The 2 − min g and 1 − min g algorithms degrade the performance by less than 0.02 dB and 0.2 dB respectively for all code rates.
IV. ARCHITECTURE OPTIMIZATION
In [3] , the authors proposed to perform the bit aggregation "on the fly" during the previous iteration. The time for processing the acquisition is thus reduced and the decoding throughput is increased. In this paper, we present new applications of RFTS compression to reduce the power consumption of the decoder. Let us focus on the 2 − min g aggregation in an RFTS acquisition sequence of length R. The RFTS compression can be done on the fly in three steps.
Step one: during the first R − (R |L ) − L clock cycles, the acquisition forward unit is frozen (and thus saving energy).
Step two: during the next R |L clock cycles, permutations of state metrics are done (processing of R |L dummy bits).
Step 3: during the last L clock cycles, the forward unit is processed with the aggregated bit computed on the fly by algorithm 1. For R = 100 (code rate 0.98), this method allows the decoder to freeze the acquisition forward unit 90% of the time while the low complexity bit aggregation algorithm works. This method does not increase the throughput but helps to save power dissipation.
Algorithm 1:
On the fly aggregation of the bit for the 2-min g algorithm.
V. CONCLUSION
In this paper, we showed that the compression of redundancy free trellis stages using the L-min algorithm can be further simplified by considering only 3, or even 2, minima among the L values. Simulations on 3-min show no performance loss and simulation on 2-min shows minor performance loss (around 0.02 dB). The 3-min (or 2-min) algorithm opens new architecture optimization, either to save power during the acquisition or to increase the overall decoding throughput.
