Non-recursive max* operator with reduced implementation complexity for turbo decoding / S. Papaharalabos; P.T. Mathiopoulos; G. Masera; M. Martina. -In: IET COMMUNICATIONS. -ISSN 1751-8628. -STAMPA. -6:7(2012, pp. 702-707.
Introduction
In the past, several algorithmic approaches have been proposed aiming to simplify the wellknown max * operator [1] for decoding turbo codes [2] , including Improved Max-Log-MAP, Constant Log-MAP [3] , Linear Log-MAP, Average Log-MAP, etc. A detailed comparative study among these techniques was published in [4] , in which the max * operator was approximated using the max operator and a small number of piecewise linear (PWL) terms, denoted as r. As shown in [4] , the implementation of the r = 3 approximation is very simple and has similar complexity with the Constant Log-MAP algorithm. Additionally, the r = 4 approximation requires higher complexity and it is comparable, in terms of complexity, with the most recent known algorithms, such as the Linear Log-MAP, the Average Log-MAP, etc. [4] . The penalty paid for all these well-known reduced complexity implementation techniques is a small bit error rate (BER) performance degradation as compared to the BER performance achieved by the optimal Log-MAP algorithm [5] . Currently, the best trade-off between turbo code performance and complexity is achieved by the Constant Log-MAP algorithm.
Conventionally, for Log-MAP turbo decoding the max * operator is defined only for n = 2 input values [1] and for n > 2 it is applied n − 1 times in a recursive form [5] . In this paper, we show how the max * operator with n > 2 input values can be computed effectively in a non-recursive form with the aim of reducing implementation complexity of the Log-MAP turbo decoder. Performance evaluation results are depicted for both binary and duo-binary turbo codes showing the near-optimal and essentially optimal BER performance of the proposed method, respectively. Furthermore, hardware synthesis results indicate implementation advantages against the most recent published efficient turbo decoding algorithms, such as the r = 3 and r = 4 approximations [4] , which makes our proposed method quite appealing in practical communication systems.
2 Novel Non-Recursive max *
Operator
The max * operation, i.e. Jacobian logarithm, used in Log-MAP turbo decoding, is defined as [1] max
where f c (|x 1 − x 2 |) is a non-linear function referred to as 'correction term' [5] and |.| denotes absolute value. Typically, for more than two input values, the Jacobian logarithm is applied recursively [5] . For example, considering three values, it yields max
For the Log-MAP algorithm, a look-up table (LUT) substitutes f c (|x 1 − x 2 |), which is usually implemented with eight values [5] . If the LUT is omitted, then the Log-MAP simplifies to the Max-Log-MAP algorithm.
From (2) it is evident that for n input values the max * operator is applied recursively n − 1 times. Let us consider the Chebyshev inequality [6, p. 186, Eq. 36
where a 1 ≥ a 2 ≥ · · · ≥ a n and
. . , a n = exp(x n ) and b 1 = n, b 2 = n − 1, . . . , b n = 1 the Chebyshev inequality yields
Taking the logarithm on both sides of the above inequality and considering only two terms in the right-hand side (RHS), the following approximation is obtained
where
, and x 2 is the second maximum value among n values.
Since the first term in the RHS of (6) is positive constant it can be ignored, and because
for large values of n, (n − 1)/n ≈ 1, (6) is further simplified to
denoted as Log-MAP Delta. From (7) it is evident that the max * operator having n input values can be computed non-recursively since it requires only knowledge of the maximum among n values and an additive correction term depending on the second maximum value among n values.
Turbo Code Performance Evaluation Results
Performance evaluation results have been obtained by means of computer based simulations for both binary and duo-binary turbo codes, in terms of BER against bit energy E b in an additive white Gaussian noise (AWGN) channel having one-sided power spectral density The performance of the newly proposed algorithm given in (7) is compared with the least complex Max-Log-MAP algorithm and also with the best performing Log-MAP algorithm.
As shown in Figs Furthermore, performance evaluation results have been obtained assuming the two efficient decoding algorithms proposed recently in [4] , i.e. r = 3 and r = 4 approximations, and also the Constant Log-MAP algorithm, which currently provides the best trade-off between BER performance and complexity. Following [9] , to further improve the BER performance scaling was applied in the extrinsic information and the best performing values were found by trial and error for all investigated algorithms. The corresponding performance evaluation results for both binary and duo-binary turbo codes are summarized in Table 1, without scaling, and Table 2 , using scaling, respectively. From Table 1 , it is noticed that Log-MAP Delta performs similarly with r = 3 approximation, whereas both r = 4 approximation and Constant Log-MAP algorithm achieve near optimal BER performance. Furthermore, from Table 2 , it is noticed that scaling improves performance and Log-MAP Delta performs similarly with r = 4 approximation and Constant Log-MAP algorithm. Hence, Log-MAP Delta achieves esentially optimal BER performance.
Hardware Architecture Description
Several architectures have been proposed for the implementation of (1) and (2) based on the two-input max * structure [4, 10] . In principle, two main implementation approaches are feasible for n > 2, that is: serial and parallel architecture. On the one hand, the serial architecture employs only one two-input max * structure and a bank of registers but it requires n − 1 clock cycles to complete the computation. On the other hand, to reduce the latency of a generic n-input max * structure, parallel architectures employing n − 1 twoinput max* structures and operating concurrently in a tree-based architecture are usually preferred. Notice that experimental results comparing serial and parallel architectures will be discussed in Section 5. Furthermore, the computation of (7) 
Maximum Finding
The most straightforward approach in finding x 1 and x 2 among x i with i = 1, 2, ..., n is to sort x i . As suggested in [11, Chapter 28.5, pp. 1-2], to obtain a parallel sorter a merge sort architecture can be deployed. However, this approach will result in an increased area overhead, due to the large number of comparators required for finding x 1 and x 2 being equal to n/4 · {[log 2 (n) + 1] · log 2 (n)}. An alternative solution is obtained by adapting the first-two-minimum-finder architecture (denoted as M2), which was proposed in [12] for lowdensity parity-check (LDPC) decoding. This architecture is based on a tree structure using Maximum-Value Generators (MVG). The structure for n inputs is derived recursively from two MVG architectures for n/2 values and a connection unit, based on Maximum-Value Units (MVU) as shown in Figs. 3 (a) and (b) where
With the M2 architecture the number of comparators required for finding x 1 and x 2 is 2n − 3.
Correction Term Computation
A simple implementation of the correction term in (7) is obtained by exploiting the procedure adopted in [10] for the two-input max * approximation. Namely, the correction term is stored into a LUT accessed by δ. The size of the LUT, denoted as m, is the minimum positive integer value that satisfies
and p is the number of fractional bits to represent δ as a fixed point value. Then, the LUT content is obtained by computing log {[1 + exp(−i/2 p )]}, ∀i ∈ [0, m − 1] and converting the result as a fixed point value on p fractional bits.
Hardware Synthesis Results
Post synthesis results obtained by implementing the max * operator and its approximations on 90 nm standard cell technology are shown in Table 3 Log-MAP architecture as proposed in [3] ; and (v) r = 3, r = 4 architectures as proposed in [4] . As it can be observed, the proposed solution with M2 architecture leads to significantly lower complexity than JL. In particular, the area of M2 is from 50% to 70% of the area required to implement JL. Furthermore, the proposed M2 solution is simpler than the approximations recently proposed in [4] for both r = 3 and r = 4, whereas it has nearly the same complexity with Constant Log-MAP.
Since, as it can be seen from Table 3 the proposed M2 and the Constant Log-MAP architectures have comparable complexity, it is interesting to further investigate these two solutions. On the one hand, the proposed Log-MAP Delta approximation is intrinsically parallel for an n-input max * operator. On the other hand, the Constant Log-MAP approximation can be employed to obtain either a serial or a parallel implementation. Serial implementation of the Constant Log-MAP approximation for an n-input max * operator has been carried out for the cases shown in Table 3 [16] . Similarly, the CCSDS SISO module requires two 16-input max * operators to compute the APO [17] . In both cases, the n-input max * operators were implemented as M2, MX, JL, Constant Log-MAP, and r = 3, r = 4 architectures. Moreover, the intrinsic and extrinsic information were represented with six and eight bits, respectively with p = 3, whereas state metrics were represented with ten bits.
Post synthesis results for the UMTS-LTE/CCSDS turbo codes, as shown in Table 5 , depict that the area required to compute APO with the proposed M2 architecture is about 74% of the area occupied by JL-based solution. If we consider the area occupied by the logic of a whole SISO module, then the proposed M2 architecture features an area saving that ranges from about 12% to 15% with respect to a JL-based SISO. We have similarly investigated the DVB-RCS/Wi-MAX duo-binary turbo code and the post synthesis results are shown in Table 6 . In this case, M2 architecture offers 21% area savings with respect to JL-based SISO. Furthermore, the area required to compute APO/SISO modules with the proposed M2 architecture is less than that required by both the r = 3 and r = 4
approximations [4] . Lastly, Constant Log-MAP requires the smallest area to compute APO/SISO modules. It is thus, the most efficient algorithm, in terms of computational complexity.
Conclusion
It has been shown how the max * operator with n input values can be approximated effectively without recursive computation, in order to reduce implementation complexity of practical Log-MAP turbo decoders. For the case of a 16-state binary turbo code, 0.05 dB of performance degradation was observed at BER of 10 −5 but with 15% complexity savings.
In another case, for an 8-state duo-binary turbo code neglibible performance degradation was observed at BER of 10 −6 , while maintaining 21% complexity savings. If scaling is additionally used, then negligible performance degradation is observed against Log-MAP algorithm for both binary and duo-binary turbo codes. In terms of complexity comparison with other state-of-the-art reduced complexity algorithms, the proposed solution is simpler than the approximations recently published in [4] for both r = 3 and r = 4, and it is slightly more complex than Constant Log-MAP algorithm. (sc = 0.9) (sc = 0.9) (sc = 0.9) 
