Abstract For high-mobility 4G applications of LTE-A and WiMAX-2 systems, this paper presents a dual-standard turbo decoder design with the following three techniques. 1) Circular parallel decoding reduces decoding latency and improves throughput rate. 2) Collision-free vectorizable dual-standard parallel interleaver enhances hardware utilization of the interleaving address generator. 3) One-bank extrinsic buffer design with bit-level extrinsic information exchange reduces size of the extrinsic buffer compared with the two-bank extrinsic buffer design. Furthermore, a multistandard turbo decoder chip is fabricated in a core area of 3.38 mm 2 by 90 nm CMOS process. This chip is maximally measured at 152 MHz with 186.1 Mbps for LTE-A standard and 179.3 Mbps for WiMAX-2 standard.
Introduction
Convolutional turbo code (CTC) has been a regular forward error correction (FEC) scheme for reliable wireless communications with a rapid growth of multimedia services. Single-binary CTC (SB-CTC) proposed in 1993 achieves high data rates and coding gains close to the Shannon limit [1] . SB-CTC has been adopted in 3rd Generation Partnership Project (3GPP) family standards [2] as an FEC scheme because of its good correction performance. Non-binary CTC (NB-CTC) [3] introduced in 1999 has superior coding performance than the SB-CTC. In recent years, double-binary CTC (DB-CTC) was adopted in Worldwide Interoperability for Microwave Access (WiMAX) family standards [4] as a FEC scheme. Table 1 lists the detailed specifications and CTC schemes of the prevalent wireless standards for wide area networks (WANs).
Recently, there are a large growing emergence and demand for an inexpensive and ubiquitous broadband wireless network. Thus, Long Term Evolution (LTE) and WiMAX standards become prevalent for the broadband wireless network. Meanwhile, fourth generation (4G) cellular wireless communication, the term referred to International Mobile Telecommunications-Advanced (IMT-Advanced) [5] , is emerging in high-end broadband wireless devices. Nowadays the 4G compliant versions of LTE and WiMAX are LTE Advanced (LTE-A) and WirelessMAN-Advanced (WiMAX-2), respectively. To achieve a smooth migration for different applications, a CTC decoder that works across the dual IMT-Advanced compliant standards is necessary. Hence, the goal of this work is to design a CTC decoder that can be used in high-mobility 4G communications. The features of the CTC decoder design are & to achieve the high-throughput requirement targeting at the specifications of LTE-A and WiMAX-2 standards shown in Table 1 , & to achieve parallel decoding for SB-CTC and DB-CTC schemes based on LTE-A and WiMAX-2 standards, & to design a parallel and reconfigurable interleaver based on the computational similarity of LTE-A and WiMAX-2 standards, and & to achieve area-efficient and memory-efficient CTC decoder design for the LTE-A and WiMAX-2 standards.
In this paper, circular parallel maximum a-posteriori probability algorithm (MAP) decoding is introduced to reduce decoding latency and hardware cost. The circular parallel MAP decoding results in numbers of circular MAP decoding according to its parallelism with a small coding gain loss. Meanwhile, the collision-free vectorizable dual-standard parallel interleaver based on the almost regular permutation (ARP) structure and the quadratic polynomial permutation (QPP) structure is proposed to enhance hardware utilization of the interleaver. Then, the available parallelism of the collision-free CTC decoding is achieved for the LTE-A and WiMAX-2 standards. In order to increase the hardware usage of the CTC decoder, two area-efficient extrinsic buffer designs are described for the dual-standard CTC decoder. Using UMC 90 nm CMOS technology, the proposed CTC decoder chip for WiMAX-2 and LTE-A systems has been fabricated in a core size of 3.38 mm 2 . The throughput rate of 186.1 Mbps can be maximally measured at 152 MHz with power consumption of 148.1 mW. This paper is organized as follows. In Section 2, the fundamentals of turbo codes for WiMAX-based standard and LTEbased standard are revisited. Section 3 describes the architecture design of the proposed CTC decoder for the dual standards. Section 4 presents the experimental results and comparison of the proposed CTC decoder for the LTE-A/WiMAX-2 standards. Finally, Section 5 concludes this paper.
Reviews of Turbo Codes
Background reviews of the CTCs of WiMAX-based standard and LTE-based standard are demonstrated in this section.
Turbo Codec
The encoder and decoder of CTC are shown in Fig. 1 . The encoder consists of two identical recursive systematic convolutional (RSC) encoders, which are connected in parallel by an interleaver. The RSC encoder produces systematic symbol (u s ) and parity symbol (u p ) to the channel. The information stream is reordered by the interleaver and then enters the second RSC encoder. The RSC encoders of SB-CTC encodes 1 bit per time while the RSC encoders of DB-CTC encodes 2 bits per time. The RSC encoders of LTE (SB-CTC) and WiMAX (DB-CTC) standards are shown in Fig. 2 . For the trellis termination, the RSC encoder in LTE uses a few redundant bits to terminate the trellis path. However, the RSC encoder in WiMAX adopts two-phase circular encoding [3] where the final trellis state can be the same to initial trellis state.
The decoder is composed of two soft-input soft-output (SISO) decoders that are serially concatenated by the interleaver and de-interleaver. Each SISO decoder uses the received systematic symbol and corresponding parity symbol to computes extrinsic information that is then iteratively fed to the other SISO decoder as the a priori information. The hard decision of decoded symbols is made after several iterations between these two SISO decoders.
CTC Interleaver
The interleaver is one of dominant modules of CTCs and the correction performance of CTCs depends on the structures and length of interleaver. The interleaver is used to permute the order of symbols. This can be done by using an interleaving address generator to access the symbols stored in a buffer. When the c i denotes normal-order symbol, the interleaved-order symbol c' π(i) can be represented as
where i=0, 1, …, (N -1) and N is the CTC block size. c π(i) is the normal-order symbol c i stored at interleaving pattern π(i) and π(i) is generated by interleaving address generator. The algorithms that generate π(i) are defined in each standard and described as follows. The WiMAX interleaver adopts the ARP structure. The interleaving address π W (j) is generated by switching (j) mod4 :
where j is an increasing value from 1 to N. The parameters P 0 , P 1 , P 2 , and P 3 defined in WiMAX depend on N. The function of (x) mody gets the remainder on division of x by y. The parameters P 0 , P 1 , P 2 , and P 3 can be stored into a lookup table (LUT). The individual architecture of WiMAX CTC interleaver can be referred to in [6] . The LTE interleaver adopts the QPP structure. The interleaving address π L (j) is generated by
where j is an increasing value from 1 to N. The parameters f 1 and f 2 defined in LTE also depend on N. The parameters f 1 and f 2 can be stored into a LUT as well. The individual architecture of LTE CTC interleaver can be referred to in [7] .
MAP Decoding for SB/DB CTC
Maximum a-posteriori probability algorithm (MAP) [8] and its derivatives [9] [10] [11] are widely employed in the CTC decoding. Furthermore, the enhanced Max-log-MAP (EML-MAP)
proposed in [11] is achieved with a little coding gain degradation for an ease of hardware implementation of the MAP. In the following sections, we use the name MAP for a short abbreviation of the EML-MAP. Given the received block sequence Y, the MAP gives each decoded bit u k a probability that the bit is 1 or 0. This is equivalent to find a-posteriori loglikelihood-ratio (LLR),
Λ apo can be decomposed as follows
where a is the forward state metrics; β is the backward state metrics; g is the branch metrics; k is the decoding time index; s, s' denotes the state indices; Pr(y k |x k ) is the conditional received symbol probability; and Pr(u k ) is the a priori probability of decoded bit u k attained from the other SISO decoder.
During the turbo decoding, extrinsic information is iteratively fed back to the other SISO decoder as a priori LLR. The extrinsic information Λ ex is formulated as
where Λ in and Λ apr are the intrinsic information and a priori LLR of this SISO decoding, respectively. The extrinsic information in the EML-MAP multiplies a scaling factor (0<δ<1). For either the SB-CTC or DB-CTC decoding, the MAP is composed of branch metrics (g), forward recursion state metrics (a), backward recursion state metrics (β), a priori LLR (Λ apr ), a posteriori LLR (Λ apo ), and extrinsic information (Λ ex ). For the dual-standard MAP decoding, the radix-4 SB MAP decoding and radix-4 DB MAP decoding are employed because of their similarity. The details of radix-4 SB MAP decoding and radix-4 DB MAP decoding can be referred to in [12] .
Window-Based MAP Decoding
The windowing technique proposed in [13] is used to facilitate the memory cost of CTC decoders. The sliding window (SW) decoding [14] deals with any CTC block size but has an intrinsic low throughput rate. Some VLSI architectures of the sliding window (SW) MAP decoding can be referred to in [15] [16] [17] [18] . Figure 3 shows the timing chart of the warm-up SW MAP decoding, and the vertical and horizontal axes denote the decoding symbol and decoding time, respectively. The timing chart of MAP decoding is mainly composed of the branch metric acquisition, forward state metrics recursion, backward state metrics recursion, and a posteriori LLR acquisition. In order to achieve the reliable window rim, the warm-up recursion of a basic windows size W is performed. In general, W=4υ~6υ, where υ is the constraint length of RSC encoder. The decoding latency of the warm-up SW MAP decoding is 3 W. The SW MAP decoding deals with any CTC frame size but it has an intrinsic low throughput. MAP decoding described in [12] applies parallel decoding with parallelism P to decode one received block. Because of the warm-up processes of forward states metrics, the decoding latency is prolonged to 4W. Nevertheless, the HW MAP decoding can shorten the decoding cycles to N/P+4W by working with several sub-blocks simultaneously.
Proposed CTC Decoder for LTE-A/WiMAX-2 Standards
In order to support the decoding of both WiMAX-based and LTE-based CTC schemes, our realization of the proposed dual-standard CTC decoder is based on the architecture shown in Fig. 4 . The input buffers store the soft received data including the systematic information, non-interleaved parity information, and interleaved parity information. Meanwhile, the internal buffer stores the extrinsic information and the output buffer stores the decoded hard bits. Because at most P warm-up free HW (WFHW) MAP processors work concurrently, each buffer is divided into P banks to be accessed simultaneously. The proposed LTE/-WiMAX parallel interleaver generates collision-free vectorizable addresses for the input buffer and extrinsic buffer. Therefore, the normal-order or interleaved-order data can be correctly processed between the WFHW MAP processors and the buffer subbanks. When the targeted iteration number is reached or the hard bits of two half iteration are the same, the CTC decoder finishes the decoding procedure and outputs the hard bits from the output buffer. The design techniques are presented in detail as follows. 
Collision-Free Vectorizable WiMAX/LTE Parallel Interleaver
The CTC interleaving is used to permute symbols by an interleaving address generator that accesses symbols from the buffers. For parallel decoding, P MAP processors may read and write a same memory bank simultaneously. Since the port of a memory is finite, the simultaneous memory access is prohibited. Without a cautious analysis of the LTEbased and WiMAX-based CTC interleaving, the memory collision occurs frequently and makes the parallel decoding unrealizable [19] . Finding a proper parallelism P has been discussed in [20] . The parallel interleaver is collision-free when it satisfies
where 0 ≦ t<W, and 0 ≦ j, k<P. The terms on both sides in (7) are indices of the memory banks that are accessed by the j th and k th MAP processors at the t th time instant. This inequality need to be true for any time instant t for no memory collision.
For an interleaver design, the complexity of the interleaving address generation is also critical. Each memory bank requires an address decoder to transform the global interleaving address to the local address for each memory bank. As the parallelism P increases, the duplication of address decoder leads to hardware inefficiency. A better 
where 0 ≦ t<W, and 0 ≦ j<P. The equality implies that each MAP processor accesses data based on the same local address. Based on this vectorizable property, only one decoder is required. All memory banks can merge into a single physical memory with data stored and fetched as vectors as shown in Fig. 5 . A high-level simulation model is constructed to analyze the available parallelism for the LTE-based and WiMAXbased CTC decoding with 24 ≦ W ≦ 36. Then, we achieve P MAP processors and S number of decoding window for each MAP processor. Table 2 lists the available parallelism achieving the collision-free and vectorizable interleaving. To design the proposed dual-standard CTC decoder, the parallelism P is set to 8 for the WiMAX-2 standard and 16 for the LTE-A standard. Figure 6 shows the overall architecture of the proposed collision-free vectorizable dualstandard parallel interleaver and the CTC controller. The CTC controller is used to provide control signals and initial parameters. To perform the radix-4 SB/DB MAP decoding, the proposed dual-standard address generators generate the WiMAX-based addresses or the LTE-based even addresses by adopting a hardware sharing technique. The additional LTE address generators generate the LTE-based odd addresses in the LTE modes. The LTE-based and WiMAXbased interleaving parameters (P0, TP1, TP2, TP2, P(0), H(0), J(0), and f 2 ) can be implemented in two parameter read-only memories (PROMs). The address decoder transforms the interleaving addresses into the collision-free addresses of the memory banks. Table 3 shows total gate counts of the proposed parallel dual-standard interleaver and CTC controller. Compared with the LTE address generators, the overhead of the proposed dual-standard address generators is about 1.2 K gates in order to support the WiMAX interleaving. Besides, the address decoders are less dominant in the design of the dual-standard parallel interleaver and CTC controller.
Circular Parallel MAP Decoding
To achieve the circular parallel MAP decoding, we first introduce the methods to achieve initial forward and backward state metrics of each frame for the distinct trellis terminations described in Section 2.1. The initial forward and backward state metrics of each frame can be attained by two distinct methods for the WiMAX-based standard and LTE-based standard. The method used in WiMAX-based standard is the circular encoding [21] , which ensure the ending trellis state equals the initial trellis state. Thus, the initial values of the forward state metrics a k 0 (s) and backward 
where s' denotes the states that are not state 0. We then introduce the WFHW MAP decoding of the circular parallel MAP decoding. Figure 7(b) shows the overall timing chart of the WFHW MAP decoding, which is composed of the basic WFHW window shown in Fig. 7(a) . The key concept is to introduce the initial rim forward and backward state metrics of the (k+1) th iteration to the HW decoding by utilizing the final rim forward and backward state metrics of the k th iteration. Instead of performing a warm-up recursion, each WFHW MAP processor achieves the initial rim state metrics of current iteration by fetching the final rim state metrics of previous iteration from the rim metrics cache (RMC). The initial forward and backward state metrics of the (k+1)th iteration can be determined by Fig. 8 shows the circular trellis propagation of backward recursion of the circular parallel MAP decoding as the iteration is increased. The forward recursion can achieve the same effect in the opposite direction. Thus, the circular parallel MAP decoding results in P number of circular MAP decoding as iteration increases. Instead of performing warmup recursion to get the initial state metrics, the WFHW MAP processor can fetch it from RMC so that the decoding latency can reduce to 1W. The throughput rates can be improved 18.2 % for N=2400 WiMAX-2 CTC decoding and 15 % for N=3072 LTE-A CTC decoding compared to the warm-up HW MAP decoding [12] . Figure 9 presents the floatingpoint simulation results of distinct HW MAP decoding for N=2400 WiMAX-2 and N=6144 LTE-A CTC schemes with P=8 at fixed 6 iterations. The circular parallel MAP decoding Figure 12 BER performance of the two extrinsic buffer designs for WiMAX-2 CTC schemes by using the circular parallel MAP decoding at fixed 6 iterations. greatly improves the coding gain compared with the no warmup HW MAP decoding. Due to the less reliability of extrinsic information exchange, however, the coding gain loss of the circular parallel MAP decoding is less than 0.1 dB at a BER of 10 -5 compared with the warm-up HW MAP decoding. Figure 10 shows the block diagram of warm-up HW MAP processor and WFHW MAP processor which are based on the timing chart shown in Figs. 3 and 7 , respectively. In order to achieve a high area usage, we apply the radix-4 SB/DB EML-MAP decoding modules [12] to the both MAP processors. The six temporary terms of branch metrics calculated by the first-stage branch metrics unit (BMU S1) are stored into the branch metrics cache (BMC), and then the sixteen branch metrics are fetched by the second-stage branch metrics unit (BMU S2). Based on the basic window shown in Fig. 7(a) , one set of BMU S1 and two sets of BMU S2 are required for the WFHW MAP decoding. The forward recursion processing element (RPA), warm-up backward recursion processing element (WRPB), and backward recursion processing element (RPB) are composed of radix-4 add-compare-select units (ACSUs). The forward traceback recursion processing element (TRPA) composed of radix-4 traceback units [23] are adopted to reduce the access power of the state metrics cache (SMC). Finally, a posteriori LLR module (Lapo) is used to compute a-posteriori LLR. Table 4 lists the area evaluation obtained by using the 90 nm CMOS process based on the quantization scheme referred to in Table 3 in [12] with W=36 and S=16. The RMC is evaluated with the WFHW MAP processor to correctly perform the CTC decoding. The RMC is composed of two single-port SRAMs because each SRAM stores the rim state metrics of half iteration. Even the hardware cost of RMC is large. The WFHW MAP processor achieves Figure 13 BER performance of the CTC decoding by using the prototyping CTC decoder chip.
area reduction because the hardware cost of WRPB and BMC are majorly reduced compared with the warm-up HW MAP processor. Thus, the proposed CTC decoder for WiMAX-2 and LTE-A standards adopts the circular parallel MAP decoding with the WFHW MAP processor shown in Fig. 10(b) .
Efficient WiMAX/LTE Extrinsic Buffer Design
Because of the radix-4 DB MAP decoding, three extrinsic information values are accessed in an extrinsic buffer in one cycle. The conventional radix-4 DB CTC decoder requires 2400×3n ex SRAM to store the three extrinsic information values for WiMAX standard, where n ex denotes the bitlength of an extrinsic information value. Meanwhile, two extrinsic information values are accessed in an extrinsic buffer in one cycle because of the radix-4 SB MAP decoding. Thus, the conventional radix-4 SB CTC decoder requires 3072×2n ex SRAM to store the extrinsic information values for LTE standard. In order to increase the hardware usage of extrinsic buffer, Fig. 11(a) shows an efficient two-bank extrinsic buffer design for the dual standards. Compared with a one-bank 3072×3n ex SRAM design, this two-bank extrinsic buffer design can reduce the size of extrinsic buffer. The two-bank Figure 15 Shmoo plot of chip testing. This plot is captured by an Agilent 93000 SOC Series Test System. X-axis denotes core vdd from 0.9 V to 1.1 V, and Y-axis denotes frequency from 70 MHz to 155 MHz. Light-gray block means that the chip passes the testing, and dark-gray block means that the chip does not pass the testing. extrinsic buffer design can disable the 2400×n ex SRAM in LTE mode to reduce the power consumption. To further increase the hardware usage of extrinsic buffer, Fig. 11(b) shows an efficient extrinsic buffer design by using the bit-level extrinsic information exchange method [22] . This method transfers the extrinsic information of radix-4 DB MAP decoding from symbol-level values to bit-level values and makes the extrinsic buffer access only two extrinsic information values. Thus, the extrinsic buffer for the dual standards can be implemented by a one-bank 3072×2n ex SRAM and achieve 100 % utilization ratio in LTE mode. Figure 12 illustrates the floatingpoint simulation results of the two extrinsic buffer designs for WiMAX-2 CTC schemes by using the circular parallel MAP decoding at fixed 6 iterations. The two-bank extrinsic buffer design loses less than 0.2 dB coding gain at a BER of 10 -5 due to the bit-level extrinsic information exchange method for the WiMAX-2 CTC decoding. Compared to the aforementioned two-bank design, the one-bank design with the bit-level extrinsic information exchange method reduces 28.1 % size of the extrinsic buffer and eliminates the 2400×n ex SRAM. Table 5 lists the area evaluation obtained by using the 90 nm CMOS process with n ex =9. Compared with the two-bank extrinsic buffer design, the one-bank extrinsic buffer design achieves a low area cost and low power consumption. Hence, the proposed CTC decoder for WiMAX-2 and LTE-A standards adopts the efficient extrinsic buffer design with the bitlevel extrinsic information exchange method.
CHIP Implementation and Comparisons
The design of the CTC decoder is simulated using C-to-RTL flow with the quantization scheme referred to in Table 3 in [12] . The parameters of the proposed CTC decoder can meet the targeted BER of 10 -5 for WiMAX-2 and LTE-A standards. Figure 13 shows the simulated BER performance of the distinct CTC schemes decoded by the prototyping CTC decoder based on additive white Gaussian noise (AWGN) channels and 6 iterations. The ideal MAP represents the floating-point CTC decoding with knowing the initiate trellis states and without the windowing technique. The fixpoint represents the fixed-point CTC decoding by using the proposed CTC decoder for the LTE-A and WiMAX-2 standards.
Prototyping Chip Implementation and Measurement Results
The proposed CTC decoder has been implemented in an ASIC by using Verilog HDL codes synthesized with the standard cell library of UMC 90 nm 1P9M CMOS process and packed in a CQFP128 package. This prototyping decoder supports the WiMAX-2 and LTE-A CTC schemes. The chip implementation of the proposed CTC decoder is obtained in a core size of 3.38 mm 2 and contains 232.8 Kb RAM. Figure 14 shows the die photo of the proposed CTC decoder and Table 6 summarizes this chip. The chip is measured by using an Agilent 93000 system-on-a-chip (SoC) Series Test System. The chip is maximally measured at 152 MHz operating frequency. The number of active MAP processors with the distinct design modes is shown in Table 7 . At 6 iterations, the chip achieves maximum throughput rate of 179.3 Mbps and 186.1 Mbps for WiMAX-2 and LTE-A, respectively. Besides, Fig. 15 shows the shmoo plot with core vdd from 0.9 V to 1.1 V and operation frequency from 70 MHz to 155 MHz. Then, Fig. 16 shows the measured power consumptions with different operation frequencies at core vdd of 1.1 V. This plot indicates that the power consumptions are 148.1 mW at operation frequency of 152 MHz and core vdd of 1.1 V. Furthermore, to consider reduction of power consumption, the core supply voltage can be reduced from 1.1 V to 0.9 V. The measured maximal operating frequencies and power consumptions are shown in Fig. 17. 
Comparisons
In Table 8 , the proposed CTC decoder for LTE-A and WiMAX-2 standards is compared with other chip designs. The works in [24] and [26] perform the radix-2 SB MAP decoding and radix-4 DB MAP decoding for HSDPA and WiMAX systems, respectively. Since the throughput rate requirements are less than 20 Mbps, both of these employ only one MAP processing. To achieve throughput rates higher than 100 Mbps, the work in [27] employs 8 MAP processors and supports LTE CTC schemes. Our proposed CTC decoder employs 8 radix-4 SB/DB WFHW MAP processors with the proposed collision-free parallel interleaver for the dual-standard CTC schemes. It is hard to compare these chips since the coding parameters are different from each other. However, we use normalized energy efficiency (NEE), as the performance indices. The NEE indicates how much energy a decoder chip consumes to process a hard bit at an iteration. The NAE indicates how many hard bits per one mm 2 for a single CTC block a decoder chip decodes. To support the high-mobility 4 G application of the LTE-A/WiMAX-2 CTC decoding, this chip achieves a high NAE of 0.36 bit/mm 2 with a low NEE of 0.13 nJ/bit/iteration.
Conclusion
In this paper, a turbo decoder chip supporting distinct block sizes of convolutional turbo code schemes for the highmobility 4 G applications of both LTE-A and WiMAX-2 systems is proposed. The circular parallel MAP decoding is introduced to achieve high throughput rate and low hardware cost. The collision-free vectorizable dual-standard parallel interleaver is proposed to enhance the hardware usage. The two efficient extrinsic buffer designs are also described in this paper to increase the memory utilization. The CTC decoder chip for LTE-A/WiMAX-2 standards is fabricated to verify the proposed techniques. This decoder chip achieves the both LTE-A and WiMAX-2 data-rate requirements with a high area efficiency and a low energy efficiency.
