Abstract: Memory accesses take a large part of the power consumption in the iterative decoding of double-binary convolutional turbo code (DB-CTC). To deal with this, a low-memory intensive decoding architecture is proposed for DB-CTC in this paper. The new scheme is based on an improved maximum a posteriori probability algorithm, where instead of storing all of the state metrics, only a part of these state metrics is stored in the state metrics cache (SMC), and the memory size of the SMC is thus reduced by 25%. Owing to a compare-select-recalculate processing (CSRP) module in the proposed decoding architecture, the unstored state metrics are recalculated by simple operations, while maintaining near optimal decoding performance.
Introduction
In 1999, nonbinary convolutional turbo codes were introduced by Berrou and Jezequel [1] , and they have been demonstrated to give better performance than classical single-binary turbo codes [2] . Due to these advantages, double-binary convolutional turbo code (DB-CTC) has been adopted by several radio standards as the channel encoding scheme, such as the digital video broadcasting-return channel over satellite (DVB-RCS) [3] , and the worldwide interoperability for microwave access [4] . Recently, in order to improve the error correction performance and the systematic throughput, DB-CTC was recommended by the IEEE 802.16m as the forward error correction code [5, 6] .
The typical decoder, using an iterative decoding algorithm for turbo-like codes, consists of 2 soft-in and soft-out (SISO) constituent decoders, where each one computes the extrinsic information using the outputs of the other one. However, due to the memory-intensive decoding architecture, frequent memory accesses are performed in the iteration procedures, and more than half of the entire power consumption is accounted for the accessing operation [7] . To design a power-efficient decoder for turbo-like codes, researchers have worked out different kinds of techniques to reduce the size of the state metrics cache (SMC). The authors in [8] reduced the bit width of the state metrics based on the saturation of the state metrics, which led to a decoding scheme with less storage for the SMC. Martina et al. compressed the decoder area by employing a nonuniform quantization technique [9] , and at the cost of a slight performance loss, this method achieves 20%-50% reduction of the state metrics memory area. As for the DB-CTC, Kim and Park studied the encoding of border metrics, where the energy consumed by the constituent decoder was reduced by approximately 26% [10] . Lin et al. introduced the traceback maximum a posteriori probability (MAP) decoding method to reduce the size of the SMC [11] . Additionally, researchers have developed low complexity [12] [13] [14] [15] decoding algorithms and memoryreduced technologies [16, 17] for DB-CTC. Although these schemes reduce power consumption effectively, the adopted algorithms are suboptimal; hence, performance loss is unavoidable in their hardware implementation.
Inspired by the memory-reduced technique in [11] and [18] , in this paper, we propose a low-memory intensive decoding architecture to decrease the storage of the SMC. This research is based on our previous work of an efficient decoding algorithm for DB-CTC [19] . In the proposed decoding scheme, 6 of the 8 forward (or backward) state metrics are stored in the SMC at each time slot, while the 2 unstored state metrics can be recalculated by a compare-select-recalculate processing (CSRP) module with simple operations. As a smallsized SMC and less memory accesses are needed in the proposed decoding architecture, the overall power consumption of the DB-CTC decoder can be decreased. Moreover, the simulation results show that the new architecture gets near optimal performance compared to that of the classical MAP algorithm.
The remaining of this paper is organized as follows. Section 2 introduces an improved MAP algorithm, where exponential operations are performed outside of the constituent decoder. Section 3 proposes a lowmemory intensive decoding scheme and shows how to recalculate the unstored state metrics by the CSRP module, where the structures of the CSRP module and the constituent decoder are also described in detail. Section 4 investigates the performance of the proposed architecture, such as the complexity of the recalculation, SMC organization, decoder timing diagram, and bit error rate (BER). Finally, section 5 gives the conclusion.
Improved MAP algorithm
The optimal decoding algorithm suitable for DB-CTC is the classical MAP algorithm [20] . By assuming that the code word is transmitted through an additive white Gaussian noise channel with noise variance σ 2 , the calculation of branch metrics γ z k (s ′ , s) in the classical MAP algorithm is given by:
where k is the time slot, (s ′ , s) ∈ S = {s 0 , · · · , s 7 } are the trellis states, z belongs to φ = {00, 01, 10, 11} , u k is the information bits in pairs, 
, we get a new method for the calculation of the branch metrics
where 
as the forward state metrics, the backward state metrics, the a posteriori likelihood ratio, and the extrinsic information, respectively, we have derived the following improved algorithm for the DB-CTC (please refer to [19] for details):ᾱ
In the above algorithm, ifr λ k are calculated in an exponential preprocessing module, and then stored in the receive buffer, we find that only the multiply and addition operations exist in the iteration of the constituent decoder. Therefore, we get a generic decoding structure for the DB-CTC, as illustrated in Figure 1 ( ) 
Low-memory intensive decoding architecture

Parallel decoding structure and butterfly scheme
To increase the decoding speed, the parallel window (PW) decoding structure is proposed for the hardware implementation of the turbo-like decoder [21] . Figure 2 gives a generic decoding structure for DB-CTC using the PW technique. The received soft bits of a constituent decoder are divided into N nonoverlapping windows, and each of these windows works independently and simultaneously. Moreover, the values for the border metrics of each window come from the previous iteration of the 2 neighbor windows (considering the tail-biting characteristic of DB-CTC, the first and last windows are adjacent windows for each other). Additionally, it should be noted that for the first iteration, all of these border metrics are initialized by the same value.
Interleaver / Deinterleaver Memory for extrinsic information 
To further speed the iteration, butterfly scheduling [11, 18] is adopted in this research. In the decoding window, let W be the window length, andᾱ 
, and the stored state metrics in the SMC are used to compute extrinsic
Based on the introduced MAP algorithm in Section 2, and by adopting the PW decoding structure and the butterfly scheme, we propose a method to decrease the memory size of the SMC in the following sections.
Recalculation of the unstored forward state metrics
In Max-log-MAP and its derivatives, the M ax operation is used to decrease the decoding complexity. Consequently, forward computation of the backward state metrics is considerably complicated [22] . In the forward recursion of the classical decoding architecture [15, 23] , 8 forward state metricsᾱ k (s) are calculated and then stored in SM C α at every time slot k . By adopting the introduced MAP algorithm in Section 2, our research shows that storing 6 of the 8 forward state metrics is enough, while at the time slot where the 2 unstored state metrics are needed, they can be recomputed by a CSRP module with simple operations. For convenience of presentation, we write Eq. (3) in the form of a matrix as below:
In the forward direction, the forward state metrics, except forᾱ
Considering that butterfly scheduling is adopted in our proposed decoding architecture, when the k th decoding slot comes, it is just the time slot to calculate the backward state metricsβ k (s ′ ). 
Because of the recursive process, ] =γ 
Similarly, to select the suitable equation in Eq. (10) for the recalculation, we calculate the maximum value among the coefficients ofᾱ k (s 4 ). For instance, if we get a result by:
the third equation in Eq. (11) is used to recalculateᾱ k (s 4 ). As the recalculation keeps running,
are computed in the backward direction.
Recalculation of the unstored backward state metrics
In the process of backward recursion, the 8 backward state metricsβ k (s ′ ) are calculated at every time slot k , and 6 of them are stored in SMC β . Rewriting Eq. (4) in the form of a matrix, we get:
In the backward direction, the backward state metrics, except forβ k (s 4 ) ,β k (s 6 ) , k ∈ {W, · · · , W /2 + 1 }, are stored in SMC β . By the same method described in Section 3.2, the unstored backward state metrics can also be recursively recalculated at their corresponding time slot.
Proposed decoding architecture with the CSRP module
Based on the aforementioned discussion, recalculation of the unstored state metrics includes 3 steps, which is called the CSRP operation in this paper, as shown in Figure 3 . 1) Compare coefficients of the unstored state metrics.
2) Select the appropriate equation among the 4 candidate equations.
3) Recalculate the unstored state metrics in the recalculate processing module recursively.
It should be noted that state metrics 1 is α in or β in (they are the border metrics that are delivered from the neighboring windows of the previous iteration), while state metrics 2 isᾱ W /2 (s) orβ W /2 (s ′ ) . Take constituent decoder 1 as an example, where the overall decoding architecture with the CSRP module for DB-CTC is presented in Figure 4 , and the decoding procedure can be decomposed into 2 main steps:
The a posteriori likelihood ratio
The a posteriori likelihood ratio s) ) are computed by the BMU α (or BMU β ).
1.2) The values ofᾱ
are computed by the recursive calculation module, and thenᾱ
2.1) By the same operations in Step 1.1, branch metricsγ
) are computed by the recursive calculation module, instead of be stored in SMC α (or SMC β ), they are used by CSRP α (or CSRP β ) to initiate state metrics 2. 
) are the forward (or backward) state metrics that are calculated in Step 2.3, being initiated as the border metrics α out (or β out ), they are delivered to the neighboring windows and will be used in the next iteration.
Performance analysis and BER simulation
In this section, we analyze the proposed decoding architecture in terms of the recalculated complexity, decoder timing diagram, and memory organization. In addition, to show the near optimal decoding performance of the new scheme, the result of the BER simulation is also presented.
Complexity of the recalculation
Based on the aforementioned discussion, recalculation of the unstored state metrics is performed in the CSRP module. To convenience the hardware implementation, 1 M ax operation with 4 operands is decomposed into 3 M ax operations with 2 operands as below [17] :
Considering the structure of the equations in Eq. (8), once a suitable equation is decided, the unstored state metrics can be recalculated by 4 multiply and 3 addition operations. As seen from the description in Section 3.4, when W /2 ≤ k ≤ W − 1, there are 4 state metrics that should be recalculated by the CSRP α and CSRP β modules. As a result, we summarize the overall complexity of the recalculation in Table 1 . 
Decoder timing diagram
In Figure 4 , the forward state metricsᾱ k (s) are recursively computed from the beginning of the window to the end of the window, while the backward state metricsβ k (s ′ ) are recursively computed in the opposite direction. Therefore, computation of the extrinsic information is separated into 2 parts, and then they are parallel computed from the middle of the window to the beginning and the end of the window. Figure 5 describes the timing flow of the proposed decoding architecture, and shows that the proposed architecture is a bidirectional high-speed decoding scheme. 
Memory organization
Since the proposed decoding architecture uses butterfly decoding scheduling, the branch metrics are computed from the opposite 2 directions by BMU α and BMU β simultaneously, and no memory is used for the branch metrics. More importantly, by employing the CSRP module in this architecture, only 6 forward (or backward) state metrics are needed to be stored at the corresponding time slot. Therefore, the memory storage of the SMC is decreased by 25% compared with the classic decoding architecture in [15] and [23] , and is also smaller than the decoding architectures in [10] , [11] , and [17] . Given the quantization for each of the state metrics J = 10 , memory organizations of the 4 decoding architectures are summarized in Table 2 .
Performance of the BER
To verify the effectiveness of our proposed decoding architecture, the enhanced Max-log-MAP (EML-MAP), the classical MAP and the proposed decoding architecture with the CSRP module are selected for comparison. In the simulation, the interleaver parameters of the code rate-1/3 DB-CTC are in accordance with the requirements of 802.16m [5] . The number of the iterations performed by the decoder is 8, and the signal-to-noise ratio ranges from 0.1 dB to 1.5 dB. The bit frame length and window size equals 800 and 20, respectively. The results in Figure 6 show the BER performance of the proposed decoding architecture is about 0.05 dB inferior to that of the classical MAP algorithm, although our scheme performs low-memory intensive decoding. [15] and [23] 
Conclusion
Based on an improved MAP algorithm for DB-CTC, we have proposed a low-memory intensive decoding architecture for the design of a low-power consumed decoder. When compared with the conventional decoding architecture, our scheme leads to a 25% reduction in the memory size for the LIFO SMC, and the unstored state metrics are effectively recalculated by a CSRP module. At the price of the computational complexity performed by the CSRP module, the number of memory accesses is also reduced by 25%. Since the PW decoding technique and the butterfly decoding scheduling are adopted, the proposed decoding architecture also achieves a high decoding speed. The simulation and comparison show that the new architecture has almost the same BER performance as that of the classical MAP algorithm. Therefore, the proposed decoding architecture can be applied in the very-large-scale integration implementation of a low-power and high speed DB-CTC decoder.
