Abstract-This paper presents an energy-efficient turbo decoder based on border metric encoding, which is especially suitable for non-binary circular turbo codes. In the proposed method, the size of the branch memory is reduced to half and the dummy calculation is removed at the cost of a small-sized memory that holds encoded border metrics. Due to the small size and infrequent access to the border memory, power consumption for soft-input soft-output (SISO) decoding is reduced by 26.0%. Based on the proposed SISO decoder and the dedicated hardware interleaver, a double-binary tail-biting turbo decoder is designed for WiMAX standard using a 0.18 µm CMOS process and it can support 12.14Mbps at operating frequency of 100MHz.
INTRODUCTION
The turbo code introduced in 1993 is one of the most powerful forward error correction channel codes, and provides near optimal bit error rates (BERs), that is, within 0.5dB of Shannon's limit at BER of 10 -5 [1] . Having this remarkable performance, the turbo codes were accepted in many standardized mobile radio systems. Recently, nonbinary turbo codes have been adopted in several mobile radio systems such as DVB-RCS and IEEE 802.16 standard (WiMAX) [2] [3], as they can offer many advantages over the classical single binary turbo codes [4] . To avoid spectrum waste caused by the tail bits, circular coding technique, namely tail-biting, is also adopted in recent turbo codes.
Recently, several papers have suggested effective schemes for the soft-input soft-output (SISO) decoding of double-binary turbo codes [5] - [7] . However, there is little research yet on the hardware implementation of the nonbinary turbo codes, although the previous works on the classical single binary turbo codes can be applied to the nonbinary turbo codes. Compared with the classical single binary turbo codes, non-binary turbo codes are much more complex in hardware implementation. In addition, the initial state determination incurred by the tail-biting property leads to increased hardware complexity. This paper presents an energy-efficient SISO decoder suitable for the hardware implementation of the non-binary circular turbo decoding. Based on the proposed architecture, a double-binary tail-biting turbo decoder is implemented for WiMAX with a dedicated hardware interleaver.
II. MAX-LOG-MAP ALGORITHM
A typical turbo decoder consists of two SISO decoders serially concatenated via an interleaver. Based on the MAP algorithm [1] , how to SISO decode the single-binary turbo codes is well described in [8] . In the double-binary turbo codes, the output of the k-th symbol is expressed in LLR form and three LLR values are required as follows.
where {01,10,11} z , s k is the state of an encoder at time k, and , and are the forward, backward, and branch metrics, respectively. The metrics are calculated as expressed in equations (2), (3) and (4), where A is the set of states at time k-1 connected to state s k , and B is the set of states at time k+1 connected to state s k .
where {00, 01,10,11} z , u k is the input symbol consisting of two bits, P(u k ) is a priori probability of u k , x k and y k are transmitted and received codewords associated with u k , respectively. Superscripts p and s denote the parity bits and systematic bits, respectively. In (4), ( ) ,
is the extrinsic information received from the other SISO decoder and the code is transmitted through an AWGN channel with a noise variance 2 . Since turbo decoding based on the Maxlog-MAP algorithm is independent of SNR, 2 2 c L is usually set to a constant value, although it can be obtained from channel estimation. As expressed above, the metric calculation complexity of the non-binary turbo codes is higher than that of the single-binary turbo codes since the number of branches connected with an each trellis state is increased from two to four and the complexity of the branch metric calculation is also increased.
Due to the tail-biting property, the initial values of the forward metric and backward metric are not explicitly specified, because the initial state, namely the circular state, is not delivered to the decoder. To determine the initial values, two methods are generally preferred. The first one is to pre-compute several trellis stages to get reliable initial values [6] . Another way is to use the metric values of the prior iteration [5] [6] . It has been reported that using the information of the prior iteration shows better performance and lower computational complexity than the pre-computing method [6] . Therefore, in this paper, the initial values of the forward and backward metrics can be determined as follows. 
where N is the frame size in pairs, s denotes the encoder state and ( ) N s and 0 ( ) s is the final metrics of the prior iteration.
III. SLIDING WINDOW FOR NON-BINARY TURBO CODES
The sliding window technique is effective in reducing the memory size required to store metric values in hardware implementation. In the sliding window technique, a large frame is split into several windows and the MAP decoding is applied to each window independently [11] . Figure 1 (a) shows the conventional sliding window diagram where forward metrics are calculated prior to backward metrics [7] . As shown in Fig. 1(a) , the dummy calculation is performed for the backward metrics to obtain the reliable initial values at the border of the window. We can obtain the reliable values that do not degrade performance if the window size is sufficiently long.
There is another way to obtain the reliable border metric values of the window. For each window, the final backward metric is stored in the border metric memory. The stored border metric is loaded in the next iteration to regard it as the initial backward metric value at the border of the window as illustrated in Fig. 1(b) . Since there is no stored value in the first iteration, all states at the border of the window are assumed to be equiprobable at the first time. Although there is slight performance degradation compared to the method based on the dummy calculation, the performance degradation disappears after a few iterations. By using the metrics stored in the prior iteration, we can avoid the dummy backward metric calculation. It has been reported that we can achieve higher bandwidth efficiency for triple-binary turbo codes [9] and obtain better performance if the number of states in the trellis increases [4] . Since these two factors increase the computational complexity of the dummy backward metric calculation, the sliding window associated with no dummy calculation can be more effective for the future turbo codes as well as the double-binary turbo codes. Additionally, the memory size required for the branch metric is reduced to half since the number of processes where the branch metrics are participated is changed from four to two.
IV. BORDER METRIC ENCODING
As described, a memory that holds the border metric values of the prior iteration is needed. To achieve more energy-efficient turbo decoding, the size of the border memory should be minimized. If the maximum frame size supported in the turbo codes is N max , the number of states in trellis is K, and state metric values are represented in P bits, the size of the border memory (BM) is defined as follows. max N BM K P WS (6) where WS is the window size. Since N max and K are fixed for a standard, the border memory size depends only on the window size and the size of the state metric. Since the window size is usually set to 32 for the 8-state trellis, we should decrease the value of P in order to reduce the overall border memory size.
The reduction of the border memory can be realized by allowing a few values to represent the border metrics. Though the reliability of the border metric is slightly decreased due to the loss of accuracy, this can be totally recovered after a few trellis stages. A simple encoding with low hardware complexity is to floor the original border metrics to the closest power of two numbers. The experimental environment for the double-binary turbo decoder for WiMAX is indicated in Table I , where (q, f) denotes a quantization scheme that uses q bits in total and f bits to represent the fractional part. The quantization schemes are determined by performing several simulations. Possible values at the border are denoted in Table II . BER results with with border memory various methods are shown in Fig. 2 . From Fig. 2 , it is noticed that two methods described in Fig. 1 (a) and (b) shows almost equal performance. Also, 4-bit border metric encoding schemes shows that there is almost no performance degradation and the method with 3-bit border metric encoding works well with about 0.03dB performance degradation compared to the non-encoding method. The performance in 3-bit border metric encoding is degraded mainly at the high SNR since the values are limited at the relatively small value. The size of the border metric memory can be reduced significantly by using the proposed encoding.
V. HARDWARE INTERLEAVER DESIGN
In turbo codes, the interleaver is involved in both encoding and decoding. The most straightforward method of implementing the address interleaving is to store interleaved addresses in a memory after generating all the possible addresses. In case of WiMAX, the memory size should be large enough to cover 2400 by 12bits, leading to significant area occupation and power consumption. To achieve low power consumption, a dedicated hardware interleaver is designed to generate the address on-the-fly [10] . Fig. 3 describers how to calculate the interleaved addresses on-thefly for WiMAX, where P 0 , P 1 , P 2 and P 3 are determined according to the frame length, N [3] . The first step can be simply accomplished by switching the values according to the least significant bit (LSB) of the address. Figure 4 illustrates the hardware structure for the second step, that is, inter-symbol permutation. Since the input address increases sequentially, we can generate the permutated address by accumulating P 0 to four different initial values and selecting one of four values according to two LSBs. For modulo operation, the accumulated values are always maintained to be less than N as shown in Fig. 4 . Initial values of the accumulators are pre-calculated and maintained in a small table.
VI. IMPLEMENTATION RESULTS
With the quantization indicated in Table I , a Max-log-MAP decoder based on the proposed border metric encoding was described in Verilog-HDL and synthesized with a 0.18 µm 4-Metal CMOS standard-cell library and compiled SRAM memories. Design Compiler and Power Compiler of Synopsys were used for the synthesis and power estimation, respectively. Switching activities resulting from gate-level simulation are annotated for gate-level power estimation. The window size is set to 32 and 4-bit border metric encoding has been employed. In the hardware implementation of the SISO decoder, forward metrics and backward metrics are normalized by subtracting a value of the zero state, 0 ( ) k s and 0 ( ) k s , from other metrics at the same trellis stage in order to avoid overflow in state metrics, which also eliminates the need to store metric values at the zero state. Since the SISO decoder takes two systematic bits and two parity bits at a time, the number of possible branch metrics is 16 while the number of possible branch metrics is 4 in the classical single binary turbo codes. Among the 16 possible branch metrics, only 8 branch metrics are unique for each trellis stage and others can be calculated by these unique branch metrics. Although the number of branch metrics to be stored is reduced by half, the size of branch memory is considerable if the conventional sliding window is adopted as indicated in Table III . The total memory size in the SISO decoder is reduced by 20.5% by applying the proposed border metric encoding method, as indicated in Table III. The energy consumption of the proposed SISO decoder and the interleaver in 1.2dB SNR with 8 iterations is presented in Table IV . As indicated in Table IV , the power consumption of the SISO logic is reduced by eliminating the dummy calculation logic. Also, as shown in Fig. 1(b) , the power consumption of the border metric memory accesses is very low because of its small size and infrequent accesses. While processing a window, we need to access the border memory only two times -one for load and the other for store. For the case of the dummy calculation, however, the dummy calculation logic should operate almost all the time as described in Fig. 1(a) . Also, by using the dedicated hardware interleaver rather than the table-based interleaver, the power consumption in the interleaver is hugely reduced. Therefore, the proposed turbo decoder can reduce the power consumption by 26.0% compared to the conventional turbo decoder where the dummy calculation and table-based interleaver are adopted. The proposed double-binary tailbiting turbo decoder is based on the time-multiplex architecture consisting of only one SISO decoder and one interleaver like [10] . All the components are shared for both the first and the second SISO decoding of an iteration. The gate count of the proposed turbo decoder is 46,338 excluding memories. The buffers are implemented using single-port SRAMs, and small-sized ROM memories are replaced with logic circuitry. For processing a 2400-pair (4800-bit) frame, the proposed turbo decoder takes 39,537 cycles with eight iterations. Operating at 100 MHz, therefore, the data rate of the proposed turbo decoder is 12.14 Mbps.
VII. CONCLUSION
We have proposed an energy-efficient turbo decoding method based on border metric encoding, which is especially suitable for the non-binary circular turbo codes. By applying the proposed method, the size of branch memory is reduced by half and the dummy calculation that is complex in nonbinary turbo codes is completely removed at the cost of a small memory that holds encoded border metrics. Also, a dedicated hardware interleaver compatible with WiMAX standard is employed to reduce power and area further. Due to the proposed SISO decoder and interleaver, the memory size is reduced by 20.5% and power consumption by 26.0%. The double-binary tail-biting turbo decoder implemented for WiMAX supports 12.14 Mbps when operated at 100MHz. 
