The symbol-by-symbol maximum a posteriori (MAP) known also as BCJR algorithm is described. The logarithmic versions of the MAP algorithm, namely, Log-MAP and Max-Log-MAP decoding algorithms along with a new Simplified-Log-MAP algorithm, are presented here. Their bit error rate (BER) performance and computational cornplexity of these algorithms are compared. A new hardware architecture for implementing the MAP-based decoding algorithms suitable for chip design is also presented here.
INTRODUCTION
The near Shannon limit error correction performance of Turbo codes [1] and serial concatenated convolutional codes [2] have raised a lot of interest in the research community to find practical decoding algorithms for implementation of these codes. The MAP decoding also known as BCJR [3] algorithm is not a practical algorithm for implementation in real systems. The MAP algorithm is computationally complex and sensetive to SNR mismatch and inaccurate estimation of the noise variance [41. This algorithm requires non-linear functions for computation of the probabilities and both multiplication and addition are also required to compute the variables of this algorithm. The fixed-point representation of the MAP decoding variables usually require between 1 6 to 24 bits for a QPSK signal constellation. Based on the above hardware requirements, MAP algorithm is not practical to implement in a chip. The logarithmic version of the MAP algorithm [5] [6] [7] and the Soft Output Viterbi Algorithm (SOVA) [8] [9] are the practical decoding algorithms for implementation. These algorithms are less sensetive to SNR mismatch and inaccurate estimation of the noise variance and fixed-point representation of their variables require approximately 8 bits for a QPSK signal constellatoin. All different logarithmic versions of the MAP algorithm only require addition and a max-operation which can be conducted utilizing a simple look-up table [7] or a threshold detector [10] . SOVA has the least computational complexity and the worse bit error rate (BER) performance among these algorithms, while the Log-MAP algorithm [5] has the best BER performance equivalent to the MAP algorithm and the highest computational complexity.
This paper describes briefly all versions of MAP decoding algorithm and introduces a new logarithmic version of the MAP decoding algorithm. The new algorithm, Simplified-Log-MAP algorithm, is less complex than the Log-MAP algorithm but it performs very close to the Log-MAP algorithm. Also a new hardware architechture for implementation of the MAP-based decoding algorithms is introduced.
In section 2, different MAP decoding algorithms are described. Section 3 introduces a new hardware architecture for MAP-based decoding algorithms which is currently implemented in a chip. Section 4 compares their BER performance of these algorithms for different QAM constellation sizes. Figure 1 . illustrates the structure of the Turbo encoder. The systematic data, dk , and two parity bits from the output of the recursive systematic convolutional (RSC) encoders, yj and y, are the outputs of the Turbo encoder. The two RSC encoders are in a parallel structure similar to [1] . The length of memory for each RSC encoder is v and consequently, the total number of states for each decoder is M = Z'. The received systematic data and parity bits from the i-th encoder are represented as d and y respectively. The Turbo block length is N. The two RSC encoders are separated by an N-bit interleaver or permuter. Thus, the input to the second encoder is the interleaved version of the input information data. If we transmit the data as shown in figure 1 ,the encoder is rate . There is an option of puncturing the parity bits to achieve different coding rates. The role of the interleaver in the performance of the Turbo code is important specially when the Turbo block size (N) is small. The discussion on the significance of the interleaver design is beyond the scope of this paper. There are some papers on the design of the interleavers [11] [12] [13] [14] [15] The MAP decoding algorithm is a recursive technique that computes the Log-Likelihood Ratio (LLR) of each bit based on the entire observed data block of length N. (1) Pr(dk lIRf') is the a posteriori probability (APP) of the information input data at time k (dk) when it is equal to 1 given the entire recieved data. The observation block data sequence is Rf = {R1 , . . . , Rk , . . , RN } where Rk = {d, y }. The state of the encoder Sk is represented by a v-tuple Sk = (ak, ak_i, .. . , ak_V+1) (2) where ak is the output of the first shift register in the RSC encoder. The conditional joint probability I'(s)is defined F(s)=Pr(dk=j,Sk=sJRfl (3) The APP of dk is thus equal to Pr(dk =ilRfl = I'(s), j=O,1 (4) The LLR can be rewritten by substituting (4) into (1) .
TURBO DECODING ALGORITHMS
The numerator and denominator of (5) can be multiplied by Pr(R) and these values will become joint probabilities instead of conditional joint probabilities. If the system at time k -1 is at state S , then (5) can be written as
The BCJR algorithm [31 defines these joint probabilities in terms of three parameters.
k(3) Pr(Sk sIR)
Pr(R'+1ISk s)
D(DN Dk (8) r k+1 1 and
The LLR in (6) can now be described in terms of (7), (8), and (9).
ak(s) and /3k(5) can be computed by forward and backward recursions respectively based on y(Rk, S , s).
SI j=O where ha and hs are the normalization factors. y(Rk, S , s) consists of the transition probability of the descrete Gaussian memoryless channel and transition probabilities of the encoder trellis. From (9), 'y3(Rk, 8', s) is given by j(Rk, 5, s) = Pr(Rkldk j, Sk ,Sk_1 s') x Pr(dk jjSk 5, Sk_1 s') x Pr(Sk = StSk_1 s') (13) The second term in (13) is the transition probability of the discrete channel, the third term is equal to 1 or 0 depending on whether it is possible for dk = j when the sytem transition is from state s to state s, and the fourth term is the transition state probabilities and for equiprobable binary data, it is equal to . In the first iteration, decoder 1 does not have any additional a priori information. The second decoder however, will utilize the output information from the first decoder as a priori information. After the first interation, each decoder utilizes output information from the other decoder as a priori information. This output information of each decoder corresponds to the parity information of that decoder. d and are two uncorrelated Gaussian variables in Rk based on the conditions expressed in (13) . Therefore, the second term in (13) can be divided into two terms.
The received signals are utilized to compute 'yj (Rk, s', s) and consequently, accurate computation of this variable is very important in computation of other variables of the MAP decoding algorithm.
The MAP decoding algorithm consists of the following steps:
1. Initialize c0(s) and /3N(s) as follows:
/3N(S)7j foralls (16) where M is the total number of states. The above initialization is based on trellis termination of the Turbo block into arbitrary state. Thrbo block can be terminated to all zero state and in this case, /3N (s) function should be initialized accordingly [7] .
2. Upon receiving each d and its corresponding y' , the decoder computes (Rk , S , s) for j=O and 1 , then computes k (s) for all values of s according to (1 1 ) . The computed values of 'y (Rk , , s) and ok (s) are stored for 1 < k N.
3. The backward recursion for /3k (s) is performed after all the N data sequence and their corresponding parity bits are received based on (12) for 1 < k < N -1.
4. The soft output decoded bits, Al(dk), are computed according to (10) for 1 k N.
It can be shown [1] that the soft output decoded bits, A1 (dk) or A2(dk), can be divided into three terms. Figure  3 shows this iterative decoding scheme. The inputs to each decoder are the received input data sequence, d, the received parity bits y' or y2 , and the logarithm of the likelihood ratio (LLR) associated with the parity bits from the other decoder (W or W,), which is used as a priori information. All these inputs are utilized by the decoder to create three outputs corresponding to the weighted version of these inputs. In Figure 3 , dk represents the weighted version of the received input data sequence, d . Also d in the same figure demonstrates the fact that the input data sequence is fed into the second decoder after interleaving. The input to each decoder from the other decoder is used as a priori information in the next decoding step and corresponds to the weighted version of the parity bits. In order to utilize this algorithm, ck(s) variables are computed for the entire data block of length N, then /3k(S) variables are computed. This approach requires to store all these variables for the entire of Turbo block. For the cases that N is a large number, the memory requirement for MAP-based decoding algorithm becomes extremely large. If all the ak (s) and k(8) variables are computed within on clock cycle for each k, then this approach requires 2N clock cycles to compute all the variables of the MAP decoding algorithm. We will propose another hardware implementation that requires to store only half of the ok(s) and 3k(s) variables and the computations of these variables will be carried with the minimum number of clock cycles (N clock cycles).
MAX-Log-MAP Algorithm
As described earlier, MAP algorithm is computationally very intensive for most applications and it is not suitable for chip design. There are two major problems with MAP decoding algorithm. First, MAP requires accurate estimation of the noise variance and its performance is very sensitive to SNR mismatch. Second, the fixed-point representation of the MAP decoding variables requires between 16 to 24 bits. These requirements are not suitable for VLSI chip design.
To avoid these problems, we can compute the Natural logarithm of all these variables, i.e., 'y(Rk, S , s), ck (s), and /k(s). Since /)( (Rk, S , s) is the result of multiplication of three factors in (1 3), thus the logarithm of y3(Rk, S , j(Rk, S , 9), 15 the addition of the logarithm of these three factors.
(Rk, S ,s) = ln'y(Rk,s',s) =ln(Pr(Rkldk = j,Sk = , Sk_1 s')) + ln(Pr(dk = iISk = s,Skl S)) + ln(Pr(Sk SISk_1 = .s')) (17) in an additive white Gaussian noise (AWGN) environment, the first term in the right side of (17) is an exponent and by taking the Natural logarithm of this value, we get rid of the non-linear exponent operation, e.g., ln(exp(A)) =A. For ck(s) we have k(3) =ln(ak(s)) ==1n(ha'yj(Rk,s ,S)ck_1(S )). 
I otherwise Similar approximation can be used to compute /3k(s).
I3ic(S)
lfl(13k(S)) maxQ5(Rk+1, S s') + /3k+1(S)) + lnh 
The soft output of the decoded data for this approach is
Al(dk) maxQl(Rk, S, s) + k_1(S') + /3ic(8)) _ max(o(Rk, 8, s) + ak_i(s) + /3k(S))
(24) alls,s alls,s
The ak(s) and fik(S) parameters in the MAP algorithm are approximated in the MAX-Log-MAP algorithm by maximization operation. Therefore, there is an approximation error in the computation of these two variables. Since these two variables are computed recursively, this approximation error is propagated throughout the entire block of data. If the SNR requirement for a given BER performance is very high, then this approximation error is comparable to the noise and it will have a significant effect on the performance of the system. On the other hand, if the SNR requirement is not high, then this approximation error is much less than the noise power and this will not be a significant factor in performance degradation. We will show this in the simulation results section. The BER performance of the MAX-Log-MAP is always worse than that of the MAP algorithm.
Log-MAP Algorithm
The Log-MAP algorithm computes the MAP parameters by utilizing a correction function to compute the logarithm of sum of numbers. More precisely for A1 = A + B, then A1 = ln(A + B) = max(A, B) + f(IA -J) (25) where f (IA -B) is the correction function. f(IA -I) can be computed using either a look-up table {7J or simply a threshold detector [10] This recursive operation is specially needed for computation of the soft output decoded bits.
At each step, the logarithm of addition of two values by maximization operation is accommodated for by additional correction value which is provided by a look-up table or a threshold detector in the Log-MAP algorithm. The Log-MAP parameters are very close approximations of the MAP parameters and therefore, the Log-MAP BER performance is close to that of the MAP algorithm.
Simplified-MAX-Log-MAP Algorithm
Study of the MAP algorithm shows that accurate computation of y(Rk ,S ,s) function is very important since it contains all the received information data. ck (s) and 13k (s) parameters are computed recursively, therefore any error in the computation of these parameters can propagate and results in poor estimation of these parameters. For example, if ak(S) is computed with approximation using (19) , then this error will result in inaccurate computation of the values of k+1 (5 However, after computing the logarithm of these parameters using the correction function, it is not necessary to compute A1 (dk) with high accuracy. As a matter of fact, a lot of times at moderate to high SNR only one value in the numerator and denominator of (10) are the dominant factors. Therefore, using max operation similar to (19) to compute Al(dk) will not have any significant effect on the BER performance of the Simplified-Log-MAP algorithm, while it reduces the computational complexity of the Simplified-Log-MAP algorithm compared to that of the Log-MAP algorithm. Besides any error due to this approximation in computation of A1 (dk) will not propagate through the entire data sequence. The soft output of the decoded bits are approximated as: Al(dk) max (1(Rk,S,s) + akl(S) + 13k(S)) -max (O(Rk, S, s) + &ki(S') + 3k(S)) (29) alis,s alls,s Table 1 compares the computational complexity of all the MAP-based decoding algorithms for a Ti data rate.
The total number of operations are for only one iteration with v = 3 and M = 8. From this Table, we can conclude that the total number of operations per bit for the Simplified Log-MAP is i6 and i3 percent less than the MAP and Log-MAP algorithms respectively. 
A NEW HARDWARE ARCHITECTURE FOR MAP-BASED DECODING ALGORITHM
The new hardware implementation of the MAP-based decoding algorithms is based on the assumption that the entire N block of data is available at the receiver. For instance, the digital subscriber ioop (DSL) modems that use the discrete multi-tone (DMT) technology allocate the information bits to the tones in a transmission frame. The number of transmitted bits in each frame can be equal to the Turbo block size (N). In other applications that the information and the parity bits are transmitted sequentially, in order to take full advantage of the coding gain associated to Turbo codes, we need to perform the decoding opertion for many iterations. Therefore the above assumption is reasonable.
The new approach presented here is for the MAP decoding algorithm. A generalization of these steps to logarithmic versions of the MAP algorithm is straightforward. This approach can be used for both decoders.
This approach has many advantages over the previous proposed technique. If there are a total of M =2' states, the total memory requirement for Qk(S) or f3k(s) parameters for this approach is x N as compared to 2' x N in the previous technique. In many applications with Turbo code, N is a large number and this approach reduces the memory requirements. The compuation of parameters are carried much faster in this approach with the minimum hardware requirements. As a matter of fact, the soft output decoding for each iteration is immediately available after finishing computation of k (s) and 13k (s) parameters. The total number of clock cycles to perform the operations for the entire Turbo block of length N is only N clock cycles as compared to 2N cycles for the previous approach.
A new Turbo decoder chip based on the Simplified-Log-MAP algorithm and the proposed hardware architecture is currently being designed for DSL modems.
SIMULATION RESULTS
The BER performance of the Simplified-Log-MAP algorithm is compared to that of the MAP, Log-MAP, and MaxLog-MAP algorithms. The simulation results are for a Turbo code with 1/2 code rate, N = 1024, rn = 3, and the feedback and feedforward generator polynomials equal to I 5oct and I 7oct respectively. Three iterations were used for the simulations and the results in figure 4 and 5 are for QPSK and 64-QAM constellation sizes. As we can see from the results, the Log-MAP decoding algorithm has similar performance to the MAP algorithm (For QPSK, it is exactly the same, so we did not plot it). The performance loss for the MAX-Log-MAP compared to the MAP algorithm is from 0.2 dB. for QPSK up to 0.35 dB. for 64-QAM. The SNR requirement for a given BER is higher at larger constellation sizes, therefore, the approximation of the logarithm has more significant effect on the BER performance of the MAX-Log-MAP algorithm. The Simplified Log-MAP has a negligible performance degradation compared to the MAP algorithm for QPSK constellation, while the performance loss is approximately 0.1 dB. for 64-QAM. It can be concluded from the above results that Simplified Log-MAP algorithm together with the new hardware implementation are an appropriate choices for implementing Turbo decoders in practice without any significant loss in performance. In figure 6, 
