Abstract-Convolutional turbo decoding requires large data access and consumes large memories. To reduce the size of the metrics memory, the traceback MAP decoding is introduced for double-binary convolutional turbo codes without losing the correction performance. The traceback technique reduces the metrics memory size with no other checkers which prolong the decoding latency. Two proposed traceback structures have a tradeoff between the power and operating frequency. The traceback structures can achieve around 20% power reduction of the metrics memory and around 7% power reduction of the decoders for WiMAX standard.
I. INTRODUCTION
Binary convolutional turbo code (CTC) proposed in 1993 [1] has been proved that it can get a high coding gain near the Shannon capacity limit. In 1999, the non-binary CTC [2] was introduced to have superior coding gain. In recent years, the double-binary CTC was adopted in the advanced wireless communication standards, such as DVB-RCS [3] , and WiMAX [4] .
The reduction of the CTC in bit error rates is achieved at the expense of intensive computations involved in the iterative turbo decoding steps. The iterative turbo decoding is composed of soft-input soft-output (SISO) decoding algorithms. A powerful SISO algorithm is the maximum a posteriori probability algorithm (MAP). Because of additive forms of the log-MAP (L-MAP) and Max-log-MAP (ML-MAP), they have been widely used in MAP algorithms. An enhanced Max-log-MAP (EML-MAP) has been proposed to have much better coding gain than the ML-MAP [5] . Without massive mathematical approximations, the L-MAP however has the significant correction performance than the (E)ML-MAP.
The memory organization of the metrics in MAP algorithms is critical. For binary CTC decoders, some previous works have been proposed to reduce power consumptions based on decreasing size or accesses of the memory [6] - [7] . The reverse calculations with a flag memory and reversion checkers are proposed in [7] , [8] . The reversion checkers prolong the decoding critical path or cycles. Besides, the reverse calculation in [8] only works in (E)ML-MAP. Our previous work [6] was proposed to trace the metrics back. The traceback calculation works with low logic overhead in the L-MAP and (E)ML-MAP. Fig. 1 illustrates the decoding paths of the conventional and traceback computation. In the conventional path, the state metrics computed by natural recursion processor (NRP) in the natural order are stored in the metrics memory (MM). Then, the state metrics are read out to compute log-likelihood ratio (LLR) in the reverse order. In the traceback path, the difference metrics are stored in the MM. Then, the state metrics are traced back in the reverse order with the stored difference metrics. Instead of storing all state metrics, the size of MM can be reduced by access the difference metrics. The double-binary CTC complicates the computational complexity of SISO decoding due to the radix-4 trellis. The reverse calculation requires more complicated reversion checkers because the trellis becomes from radix-2 to radix-4. In the literatures, there are seldom works discussing the memory organization of the double-binary CTC decoders. In this paper, the traceback technique for double-binary CTC is introduced. Two pairs of the traceback technique are demonstrated to reduce power consumptions of the MM.
II. FUNDAMENTALS OF MAP ALGORITHMS
In double-binary CTC decoding, MAP algorithms decides binary information bits u k = (x 
where k is the forward recursion state metrics, k is the backward recursion state metrics.
The a posteriori information
During the operations of the turbo decoding, one value has to be iteratively interleaved and fed back to the ML-MAP algorithm as a priori information. This value called extrinsic information
The intrinsic information (z) in,k is defined as
Note that the values of the In this paper, the traceback technique is extended from the radix-2 to radix-4 trellis. Fig. 2(b) shows the traceback recursion of the radix-4 trellis for double-binary CTC. Instead of storing the all state metrics, the difference metrics computed by the ACS units are stored in the MM when the natural recursion generates the state metrics. The traceback recursion regenerates the state metrics with the stored difference metrics by the traceback units. The TRP composed of only 2 traceback units (black states in Fig. 2(b) ) recomputes the 8 state metrics. Thus, the traceback technique can reduce the MM size with low logic overhead, and require no flag memory and no reversion checker which prolongs the decoding latency. The traceback calculation works in the L-MAP, and (E)ML-MAP. Note that the proposed traceback technique can be simply applied to the radix-4 trellis of binary CTC with some modifications.
IV. ACS UNITS AND TRACEBACK UNITS
For a radix-4 trellis of double-binary CTC, there are 4 state inputs to compute a state output. We demonstrate 2 types of the ACS unit and corresponding traceback unit for the L-MAP and radix-4 trellis. In the L-MAP, a LUT implements the corrective term ln(1+e -x ). Note that all structures described in this section perform (E)ML-MAP if the LUT is not used. Fig. 3 shows an example of the radix2x2 ACS unit. The radix-2x2 ACS unit consists of 3 radix-2 ACS units and a LUT. In the traceback technique, 3 difference metrics (Diff_0, Diff_1 and Diff_2) of the radix2x2 ACS unit are stored in the MM. Fig. 4 shows the corresponding traceback unit of the radix-2x2 ACS unit. With 3 difference metrics stored in the MM, the state metrics can be recomputed by the traceback unit. Because 2 current states can trace 8 next states back, there are totally 6 difference metrics stored in the MM. The storage of the MM is reduced from 8 state metrics to 6 difference metrics. The second type is the radix-4 ACS unit which has shorter critical path but larger complexity than the radix-2x2 ACS unit. Fig. 5 shows an example of the radix-4 ACS unit. The radix-4 ACS unit has a comparator to select maximal metrics quickly. Unlike the radix-2x2 ACS unit, 3 difference metrics (Diff_0, Diff_1 and Diff_2) and 2 selective bits (S0 and S1) of the radix-4 ACS unit are stored in the MM. Fig. 6 shows the corresponding traceback unit of the radix-4 ACS unit. Compared with the traceback unit of the radix-2x2 ACS unit, the traceback unit of the radix-4 ACS unit has less complexity. The storage of the MM is reduced from 8 state metrics to 6 difference metrics and 4 extra bits. Note that the terms of state metrics plus branch metrics in the output end of the traceback units (A, B, C, D in Fig. 4 and Fig. 6 ) can be the input values of the LLR unit to compute (z) apo,k (4) . This approach reduces 8 adders in LLR unit. 
V. EXPERIMENTAL RESULTS
A fast and accurate hardware evaluation and power estimation approach is obtained by using Verilog HDL codes synthesized with the standard cell library of TSMC 0.13-µm CMOS Process. Table I shows evaluation parameters under the specification of WiMAX standard. To support the high throughputs of the WiMAX CTC, the traceback structures are estimated in the parallel-window (PW) [9] EML-MAP decoding. Table II shows the results of silicon area and power consumptions of the different structures. The radix-4 ACS unit and its corresponding traceback unit are represented as radix-4 pair, and the radix-2x2 ACS unit and its corresponding traceback unit are represented as radix-2x2 pair. The area results are reported by Synopsis Design Vision. The power consumptions on 2-dB SNR noisy data are estimated by Synopsis PrimePower at 100 MHz operating frequency. 8 ACS units (as a NRP) and 2 traceback units (as a TRP) are grouped as an example of the traceback structure. The results show that the radix-2x2 pair has less hardware cost and power consumptions. The radix-4 pair has shorter and more balanced critical path. It is a tradeoff between the power consumptions and maximal operating frequency. In Table III , the single-port RAM in TSMC 0.13µm Process is generated for the evaluations of the three different MM. Note that 8 state metrics have to be stored in the MM despite the conventional organization composed of the radix2x2 or radix-4 units. Thus, the traceback organization of the radix-2x2 pair achieves 24.9% area reduction of the RAM (0.011 mm 2 ) with 18.8% area overhead of the traceback units (0.008 mm 2 ), and it also achieves 25% power reduction of the MM (0.55 mW) with 3.6% power overhead of the traceback units (0.08 mW). The traceback organization of the radix-4 pair achieves 19.6% area reduction of the MM (0.009 mm 2 ) with 13.2% area overhead of the traceback units (0.006 mm 2 ), and it also achieves 19.5% power reduction of the RAM (0.43 mW) with 2.3% power overhead of the traceback units (0.05 mW). Table IV shows the area and power comparisons of the 10-SISO PW decoders which meet the specification in Table  I . The decoders operate at clock frequency of 100 MHz and achieve throughput rate of 115.4Mbps. Because the doublebinary CTC complicates the computational complexity of decoders (e.g. LLR units and branch metrics units) the silicon area is not reduced significantly. However, the power consumptions of the 10-SISO PW decoders are noticeably decreased in the traceback organizations.
In addition, an implementation of a 12-mode WiMAXcompliant CTC decoder with the radix-2x2 traceback pair will be detailedly presented in [10] . VI. CONCLUSIONS To reduce the MM size, the traceback MAP decoding for double-binary CTC is proposed in this paper. Two pairs of the ACS unit and its corresponding traceback unit are introduced for radix-4 trellis. The radix-2x2 pair has low hardware cost and the radix-4 pair has short critical path. The experimental results in TSMC 0.13-µm CMOS Process at 100MHz operating frequency show that the traceback organization of the radix-2x2 pair achieves 25% power reduction of the MM. The traceback organization of the radix-4 pair achieves 19.5% power reduction. The proposed traceback structures can achieve around 7% power reduction of the 10-SISO PW decoders for WiMAX CTC.
