We present an efficient VLSI architecture for 3GPP LTE/LTE-Advance Turbo decoder by utilizing the algebraic-geometric properties of the quadratic permutation polynomial (QPP) interleaver. The highthroughput 3GPP LTE/LTE-Advance Turbo codes require a highly-parallel decoder architecture. Turbo interleaver is known to be the main obstacle to the decoder parallelism due to the collisions it introduces in accesses to memory. The QPP interleaver solves the memory contention issues when several MAP decoders are used in parallel to improve Turbo decoding throughput. In this paper, we propose a low-complexity QPP interleaving address generator and a multi-bank memory architecture to enable parallel Turbo decoding. Design trade-offs in terms of area and throughput efficiency are explored to find the optimal architecture. The proposed parallel Turbo decoder has been synthesized, placed and routed in a 65-nm CMOS technology with a core area of 8.3 mm 2 and a maximum clock frequency of 400 MHz. This parallel decoder, comprising 64 MAP decoder cores, can achieve a maximum decoding throughput of 1.28 Gbps at 6 iterations &
Introduction

3GPP Long Term Evolution (LTE) [1]
, which is a set of enhancements to the 3G Universal Mobile Telecommunications System (UMTS) [2] , has received tremendous attention recently and is considered to be a very promising 4G wireless technology. For example, Verizon Wireless has decided to deploy LTE in their next generation 4G evolution. One of the main advantages of 3GPP LTE is high throughput. For example, it provides a peak data rate of 326.4 Mbps for a 4 Â 4 antenna system, and 172.8 Mbps for a 2 Â 2 antenna system for every 20 MHz of spectrum. Furthermore, LTE-Advance [3] , the further evolution of LTE, promises to provide up to 1 Gbps peak data rate.
The channel coding scheme for LTE is Turbo coding [4] . The Turbo decoder is typically one of the major blocks in a LTE wireless receiver. Turbo decoders suffer from high decoding latency due to the iterative decoding process, the forward-backward recursion in the maximum a posteriori (MAP) decoding algorithm and the interleaving/deinterleaving between iterations [5] [6] [7] . Generally, the task of an interleaver is to permute the soft values generated by the MAP decoder and write them into random or pseudo-random positions.
A high throughput Turbo decoder can be realized by parallelizing several MAP decoders, where each MAP decoder operates on a segment of the received codeword [8] . Due to the randomness of the Turbo interleaver, two or more MAP decoders may access the same memory at the same clock cycle which will lead to a memory collision. As a result, the decoder has to be stalled which consequently delays the decoding process. The Interleaver structures in the current 3G standards, such as CDMA/W-CDMA/UMTS, do not have a parallel structure. Although the memory stalls caused by the interleaver can be partially reduced by using write buffers [9] , the memory stalls will become more and more frequently as the parallelism degree increases. To solve this problem, the high data rate 3GPP LTE standard has adopted a contention-free, parallel interleaver which is called quadratic permutation polynomial (QPP) Turbo interleaver [4] . From an algebraic-geometric perspective, the QPP interleaver allows analytical designs and simplifies hardware implementation of a parallel Turbo decoder [10] . Based on the permutation polynomials over integer rings, every factor of the interleaver length can be a parallelism degree for the decoder [10] which is contention-free.
In the literature, many decoder architectures have been extensively investigated for 3G or 3G-like Turbo codes [11] [12] [13] [14] [15] [16] [17] [18] . Recently, several high speed Turbo decoders have been developed for 3GPP LTE standard [19] [20] [21] [22] . As a 4G candidate system, the 3GPP LTE-Advance system is pushing for 1 Gbps data rate. Thus, it is very important and challenging to design a Turbo decoder to support such a high data rate. In this paper, we propose an efficient hardware architecture for 3GPP LTE/LTE-Advance Turbo decoder. A low-complexity circuit is designed to generate the QPP interleaving addresses on the fly. By utilizing the QPP contention-free property, memory systems are partitioned into multiple banks to allow concurrent accesses by multiple MAP decoders. More than 1 Gbps data rate is feasible with the proposed decoding scheme. The rest of this paper is organized as follows. Section 2 reviews the fundamentals of Turbo codes. Section 3 describes the basic structure of the QPP interleaver and its several algebraic properties. Then we propose an online address generator for the QPP interleaver. In Section 4, two types of low-latency MAP decoder architectures are introduced and compared. By employing multiple MAP decoder cores and multiple QPP interleavers, we present a parallel Turbo decoder architecture in Section 5. Then the VLSI implementation results are summarized and compared with existing Turbo decoders.
Fundamentals of turbo codes
In order to explain the proposed parallel Turbo decoder architecture, the fundamentals of Turbo codes are briefly described in this section.
Turbo encoder structure
As shown in Fig. 1 , the Turbo encoding scheme in the LTE standard is a parallel concatenated convolutional code with two 8-state constituent encoders and one quadratic permutation polynomial (QPP) interleaver [4] . The function of the QPP interleaver is to take a block of N-bit data and produce a permutation of the input data block. From the coding theory perspective, the performance of a Turbo code depends critically on the interleaver structure [23] . The basic LTE Turbo coding rate is 1/3. It encodes an N-bit information data block into a codeword with 3N+12 data bits, where 12 tail bits are used for trellis termination. The initial value of the shift registers of the 8-state constituent encoders shall be all zeros when starting to encode the input information bits. LTE has defined 188 different block sizes, 40 r N r6144.
Turbo decoder structure
The basic structure of a Turbo decoder is functionally illustrated in Fig. 2 . A Turbo decoder consists of two maximum a posteriori (MAP) decoders [24, 25] separated by an interleaver that permutes the input sequence. The decoding is an iterative process in which the so-called extrinsic information is exchanged between MAP decoders. Each Turbo iteration is divided into two half iterations. During the first half iteration, MAP decoder 1 is enabled. It receives the soft channel information (soft value L s for the systematic bit and soft value L p 1 for the parity bit) and the a priori information L a 1 from the other constituent MAP decoder through deinterleaving ðp À1 Þ to generate the extrinsic information L e 1 at its output. Likewise, during the second half iteration, MAP decoder 2 is enabled, and it receives the soft channel information (soft value L s for a permuted version of the systematic bit and soft value L p 2 for the parity bit) and the a priori information L a 2 from MAP decoder 1 through interleaving ðpÞ to generate the extrinsic information L e 2 at its output. This iterative process repeats until the decoding has converged or the maximum number of iterations has been reached. The MAP algorithm at each constituent MAP decoder computes the log-likelihood ratios (LLRs) of the a posteriori probabilities (APPs) for information bit u k as follows: [7, 26] 
where a k and b k denote the forward and backward state metrics, and are recursively computed as follows:
The g k term above is the branch transition probability that depends on the trellis diagram, and is usually referred to as the branch metric. The max star operator employed in the above descriptions is defined as follows: [26] 
QPP interleaver
Interleaving/deinterleaving of extrinsic information is a key issue that needs to be addressed to enable parallel decoding because memory access contention may occur when MAP decoders fetch/write extrinsic information from/to memory. The QPP interleaver defined in the new 3GPP LTE standard differs from previous 3G interleavers in that it is based on algebraic constructions via permutation polynomials over integer rings. It is known that permutation polynomials generate contention-free interleavers [27, 10] , i.e. every factor of the interleaver length becomes a possible parallelism degree.
Algebraic description of QPP interleaver
The QPP interleaver can be expressed via a simple mathematical formula. Given an information block length N, the x-th interleaving output position is specified by the quadratic expression: [4] f ðxÞ ¼ ðf 2
where parameters f 1 and f 2 are integers and depend on the block size N ð0 rx,f 1 ,f 2 o NÞ. For each block size, a different set of parameters f 1 and f 2 are defined. In LTE, all the block sizes are even numbers and are divisible by 4 and 8. Moreover, the block size N is always divisible by 16, 32, and 64 when N Z 512, 1024, and 2048, respectively. By definition, parameter f 1 is always an odd number whereas f 2 is always an even number. Through further inspection, we can list the following algebraic properties for the QPP interleaver. QPP interleaver algebraic property 1: f(x) has the same even/odd parity as x:
QPP interleaver algebraic property 2: The remainders of f(x)/4, f(x+ 1)/4, f(x + 2)/4, and f(x+ 3)/4 are unique:
( QPP interleaver algebraic property 3:
Property 1 can be easily verified since parameter f 2 is always even and parameter f 1 is always odd by definition. Property 2 can be shown through the following equations:
Property 3 can be verified by
We will explain later that these algebraic properties are very useful in designing memory systems for parallel Turbo decoder.
QPP contention-free property
In general, a Turbo interleaver/de-interleaver f(x), is said to be contention-free for a window size of L if and only if it satisfies the following constraint [10, 28, 29] 
where 0 rx o L, 0ri,j o P (¼N/L), and i aj. The terms in (6) are essentially the memory indices that are concurrently accessed by the P MAP decoder cores. If these memory indices are unique during each read and write operation, then there are no contentions in memory accesses. Fig. 3 shows an example of the contention-free memory access scheme. It has been shown in [27, 10] that every factor of the interleaver length N becomes a possible interleaver parallelism that satisfies the contention-free requirement in (6) . Table 1 summaries the parallelism degrees (up to 64) for some of the LTE QPP interleavers.
Hardware implementation of QPP interleaver
Based on the algebra analysis in [27] , the QPP interleaver is guaranteed to always generate a unique address which greatly simplifies the hardware implementation. In MAP trellis decoding, the QPP interleaving addresses are usually generated in a consecutive order (with step size of d). By taking advantage of this fact, the QPP interleaving address can be computed in a recursive manner. Suppose the interleaver starts at x 0 , we first pre-compute f(x 0 ) as
In the following cycles, as x is incremented by d, f(x+ d) is computed recursively as follows:
where g(x) is defined as
Note that g(x) can also be computed in a recursive manner:
The initial value g(x 0 ) needs to be pre-computed as
The modulo operation in (9) and (12) can be difficult to implement in hardware if the operands are not known in advance. However, by definition we know that both f(x) and g(x) are less than N so calculating (9) and (12) can be realized by additions. In the proposed method, three numbers need to be pre-computed: (2d 2 f 2 ) mod N, f(x 0 ), and g(x 0 ). Fig. 4 shows a hardware architecture to compute the interleaving address f(x), where x starts from x 0 and is incremented by d on every clock cycle. For example, by setting d to 1, this circuit can generate interleaving addresses at each step of 1. If n consecutive interleaving addresses
3. An example of the contention-free interleaving, where a data block is divided into P ¼4 segments (SEG 0-SEG 3) with equal length of L¼ N/P. The contention-free property requires that for a fixed offset x at each segment, the segment indices for the interleaving addresses bf ðx þ iLÞ=Lc (0 r ir PÀ1) are unique so that they can be physically mapped to different memory modules. are required at each clock cycle, this circuit can be replicated n times with n different initial values: x 0 , x 0 +1,y, and x 0 +n À1. The circuit in Fig. 4 can generate interleaving address in a descending order as well by setting d to be a negative number, eg. d ¼ À1. But g(x 0 ) needs to be recomputed for negative d. To be able to generate both forward and backward addresses using the same f(x) and g(x) functions, we now describe a method to generate the QPP interleaving address in the descending order. By substituting x with x Àd in (9) and reorganize (9), we can get
Similarly, substitute x with x Àd in (12) and reorganize (12), we can get
Based on (14) and (15), Fig. 5 shows a hardware architecture to compute the QPP address f(x) in the descending order (backward generating), where x starts from x 0 and is decremented by d on every clock cycle. The three pre-computed values are the same as those in the forward QPP address generator (cf. Fig. 4 ).
As can be seen from Figs. 4 and 5, the proposed QPP interleaver pattern generator consumes very few resources. The complexity of this circuit is an order of magnitude smaller than the previous 3G interleavers. For example, a circuit with about 30 K gate count is reported in [30] to generate the interleaving addresses for Turbo codes in the previous 3G standard (3GPP Release-4), and a UMTS hardware interleaver with 10.5 K gate count is presented in [31] .
MAP decoder architecture for LTE turbo codes
MAP decoder architectures have been studied by many researchers [24, 25, [32] [33] [34] [35] . Several factors, such as interleaver structure and sliding window scheme, must be considered when choosing an appropriate MAP decoder for LTE Turbo decoding. In this section we modify two low-latency MAP decoder architectures and propose a low-complexity QPP interleaving address generator to operate full-speed with the MAP decoder.
Due to the double recursion in the MAP decoding algorithm [7] , the MAP decoder suffers from high decoding latency. To reduce the decoding latency, the sliding window algorithm is often used [36] . However, the problem of the sliding window approach is the unknown backward (or forward) state metrics which are required in the beginning of the backward (or forward) recursion. We refer to the state metrics at sliding window length distance as stakes. These stakes can be estimated by using a training calculation [36] , which will result in an additional decoding delay depending on the training length. For LTE Turbo codes, we do not recommend this traditional sliding window method when the Turbo coding rate is high. Because many parity bits will be removed after the base Turbo code is punctured to a higher code rate, the training length has to be increased to accurately estimate the state metrics at those stakes which consequently delays the decoding process.
For LTE Turbo decoding, we suggest to use a low-latency decoding method, referred to as state metric propagation (SMP) method, where the state metrics at stakes are initialized with stakes from the previous iteration [37] . In the very first iteration, uniform state metrics can be used for initialization. This method avoids the training calculation by propagating the state metrics to the next iteration. This method is especially useful when the Turbo coding rate is high. Based on our simulation results, the performance degradation caused by the window truncation in the SMP method is smaller than that in the traditional training based sliding window method in the case of high Turbo code rate. To compare the decoding performance using these two sliding window algorithms for high rate LTE Turbo codes, we perform floating point simulations using BPSK modulation over AWGN channel. The LTE rate matching algorithm [4] is used for code puncturing. Fig. 6 shows the floating-point simulation result for a rate of 0.95 Turbo code. Because of the high code rate, the maximum number of iterations is set to 10. In the figure, we show the block error rate (BLER) curves for the SMP based sliding window algorithm and the traditional training based sliding window algorithm. In the traditional training algorithm, we assume the training length is equal to the window length. As can be seen, the BLER performance of the SMP algorithm with window length W¼64 is better than that of the training algorithm with window length W¼64, and is close to that of the training algorithm with W¼96. The SMP algorithm with W¼96 and the training algorithm with W¼ 128 perform close to the optimal case when there is no window effect. Because of the good decoding performance and low decoding delay, we adopted the SMP algorithm in our Turbo decoder design.
The SMP based sliding window (SW) MAP algorithm (SW-MAP) has a window overhead of W (c.f. Fig. 7(a) ), which will lead to additional decoding delays. To eliminate this window overhead, we also consider a non-sliding window (NSW) based MAP algorithm (NSW-MAP) which is shown in Fig. 7(b) . To be more general, we consider the case of decoding a segment of the code block where the segment length is L¼ N/P. In the SW algorithm, a sliding window is applied to the backward recursion where the stakes are initialized from the previous Turbo iteration. If the window length is W, (L/W) Â 2 stakes needed to be saved (note that MAP 1 can only be initialized with stakes from MAP 1, not from MAP 2, resulting in twice the amount of stake memory). In the NSW algorithm, no sliding window is applied to the backward recursions. So only the stakes in the end of the recursion needed to be saved. It should be noted that the memory bandwidth of the NSW-MAP algorithm is higher than the SW-MAP algorithm since two LLRs are read and two LLRs are written in one cycle. When the decoder parallelism is high, i.e. P is large, the NSW-MAP algorithm has throughput advantage over the SW-MAP algorithm. There are many other varieties of the MAP algorithms. See [38] for a thorough analysis of the MAP decoder architectures. In this paper, we primarily focus on these two simple but effective MAP algorithms, and we will present QPP interleaving address generator architectures for these two MAP algorithms. Fig. 8 shows the recommend SW-MAP decoder architecture.
QPP interleaving address generator for SW-MAP decoder
The SW-MAP decoder requires one set of a unit, b unit, branch unit, and LLRC unit because of the single flow structure. It employs fully parallel add-compare-select-add (ACSA) [39] units to calculate the state metrics in the a and b recursion processes. A SMP buffer was used to save the stakes for use in the next Turbo iteration. In the SW algorithm, the channel LLRs (systematic L s and parity L p ) are loaded from the symbol memory in the sequential order. A priori information LLR(in) are loaded from the LLR memory in the sequential order for the first half iteration, and in the interleaving order for the second half iteration. The soft information LLR(out) are written to the LLR memory in the backward sequential order during the first half iteration, and in the backward interleaving order for the second half iteration. To avoid loading interleaving systematic LLR from the symbol memory during the second half iteration, we have modified the MAP algorithm to combine the systematic LLR with the extrinsic LLR in the first half iteration.
In this algorithm, the interleaving addresses must be generated during the second half iteration to provide read and write addresses to the LLR memory. In the SW algorithm, the read operation is in the forward direction, whereas the write operation is in the backward direction and is always behind of the read operation. Fig. 9(a) shows an example of the addressing scheme for W¼4 and x 0 ¼ 0. Fig. 9(b) shows the a hardware architecture for generating interleaving read/write addresses by using one forward QPP generator (cf. Fig. 4 ) and one last-in first-out (LIFO) buffer.
When the sliding window length is large, using a LIFO can be costly. We will now propose another method to generate the interleaving write addresses. As depicted in Fig. 10(b) , a forward QPP address generator and a backward QPP address generator are used to recursively generate the read addresses f(x) and write address f(y), respectively. The initial values f(x 0 ) and g(x 0 ) for the forward QPP generator need to be pre-computed, whereas the initial values for the backward QPP address generator are obtained from (synchronized with) the forward QPP address generator every W cycles and then a backward recursion is performed on the next WÀ 1 cycles to generate the next WÀ 1 write address. Fig. 10(a) gives an example of this algorithm for W¼4 and x 0 ¼0.
QPP address generator for Radix-4 SW-MAP decoder
Radix-4 MAP decoding [13, 34] is a commonly used technique to achieve a higher trellis processing speed. For binary Turbo codes, eg. LTE Turbo codes, the trellis cycles can be reduced 50% by doing Radix-4 processing. In the Radix-4 processing, during the second half iteration two LLRs for information bit vector {u x , u x + 1 } are needed to be fetched/written from/to the LLR memory at addresses f(x) and f(x +1). Thus, two read and two write interleaving addresses need to be generated in each clock cycle. Fig. 11(a) shows an example of the read/write addressing scheme where a sequence is partitioned into even and odd sub-sequences. Fig. 11(b) shows a hardware architecture to generate the interleaving read and write addresses for the Radix-4 SW-MAP decoder. Two forward QPP address generators (with step d¼2) are used to generate the interleaving read addresses, and two backward QPP address generators (with step d ¼2) are used to generate the interleaving write addresses. Based on the QPP algebraic property 1, the LLR memory can be partitioned into even and odd indexed banks to avoid collisions.
QPP address generator for NSW-MAP decoder
In the NSW algorithm, forward and backward recursions are performed simultaneously by processing data from both ends of the sub-trellis. After the middle point, soft LLRs are calculated in both forward and backward directions. Fig. 12 shows the NSW-MAP decoder architecture. Note that the NSW-MAP decoder requires two branch metric calculation units and two LLR calculation (LLRC) units because of the double-direction data processing. Fig. 13(a) shows the forward/backward data flow in the NSW-MAP decoding process. Because both the forward and the backward processes need to access memory, we propose to use a two phase memory accessing scheme to support doubledirection data processing. As shown in Fig. 13(b) , in phase 0, the ... ...
Two read index: ... forward MAP process is allowed to read two data at addresses f(x) and f(x+ 1) from the LLR memory. In the next clock cycle (phase 1), the backward MAP process is allowed to read two data at addresses f(y) and f(y À 1) from the LLR memory. And then this process repeats. For the write operation, it is the same as the read operation. And write address is just a delayed version of the read address. The number of the delay cycles depends on the pipeline delays in the LLRC unit in the MAP decoder which is typically several clock cycles. Fig. 13(c) shows a hardware architecture to implement this two-phase memory accessing algorithm, where the LLR memory is partitioned into even and odd indexed banks to avoid collisions.
QPP address generator for Radix-4 NSW-MAP decoder
The two-phase memory accessing scheme shown in Fig. 13 (b) can be extended to support Radix-4 NSW-MAP decoding as well, where four data at addresses f(x), f(x +1), f(x+ 2), and f(x+ 3) are needed to be generated in each clock cycle. Based on the QPP algebraic property 2, the memory can be partitioned into four banks to allow concurrently memory accesses in each clock cycle without any collisions. Fig. 14 shows a hardware architecture for generating interleaving addresses for Radix-4 NSW-MAP decoder. Table 2 compares the resource usage and decoding latency for a SW-MAP decoder and a NSW-MAP decoder, in which W is the sliding window length in the SW algorithm, L is the segment length L¼N/P, B a and B g are the total bit widths for the a state metrics (8 states in total) and the g branch metrics, respectively.
MAP decoder comparison
To compare the area for these two types of MAP decoder architectures, we have synthesized them in a TSMC 65-nm CMOS technology for a 400 MHz clock frequency. The fixed point word lengths for the channel LLRs, extrinsic LLRs, and state metrics are 6, 7, and 10, respectively [40] . For the SW-MAP architecture, the sliding window length W is assumed to be 64. Consider decoding of a segment of a code block where the code length is N ¼6144 and the segment length is L¼N/P, Fig. 15 shows the area cost for these two types of MAP decoders. As can be seen, as the decoder parallelism P increases, the area cost of the NSW-MAP decoder reduces quickly and comes closer to the area cost of the SW-MAP decoder.
To compare the efficiency of these two architectures, we define an efficiency metric as area Â time, or AT, where area is one MAP decoder area and time is the processing time for a sub-trellis for half Turbo iteration. Fig. 16 plots the AT complexities for different P, where the AT value is displayed on a logarithmic scale. Clearly, when the parallelism degree P is small, the NSW-MAP architecture has a higher AT complexity than the SW-MAP architecture because a large number of state metrics have to be buffered. On the other hand, as P increases, the NSW-MAP architecture will become more efficient due to the fact that the double-flow NSW-MAP decoding has no sliding window overhead, whereas the single-flow SW-MAP decoding has a sliding window overhead of W=ðN=P þ WÞ. As a design tradeoff, we adopted the SW-MAP architecture in our final hardware implementation to save area while still achieving 1 Gbps throughput.
Forward QPP Generator x2
Forward addresses One observation is that the Radix-4 transform can effectively reduce the AT complexity of the NSW-MAP decoder when P is small. However, Radix-4 transform will not necessarily reduce the AT complexity of the SW-MAP decoder. This is due to the fact that the Radix-2 decoder can run at a faster clock frequency, and has a lower complexity than the Radix-4 decoder (assuming full LogMAP implementation). We will compare the Radix-2 and the Radix-4 architectures in more detail in the next section.
Parallel turbo decoder architecture
Decoder parallelism is necessary to achieve the LTE/LTEAdvance high throughput requirement which is up to 1 Gbps. In order to increase the throughput by a factor of P, an information block can be divided into P segments with equal length L and then each segment is processed independently by a dedicated MAP decoder [14, 19, 33, [40] [41] [42] [43] [44] [45] [46] [47] . In this scheme, each of the P MAP cores processes the data sequentially and fetches/writes the data simultaneously always at the same offset x to each segment. The interleaver structure in the current and previous 3G standards do not have a parallel structure which makes it difficult to realize the parallelization of the MAP decoders. Expensive write buffers have to be used to reduce the memory collision caused by the interleaver [9, 48] . However, when the parallelism degree increases, the collisions cannot be effectively resolved by using write buffers. The LTE QPP interleaver, however, has an inherent parallel structure that supports contention-free memory accesses which result in a large design space for the selection of appropriate levels of decoder parallelism.
In this section, we will present a scalable parallel Turbo decoder architecture and give an analysis of the complexity and the throughput. Fig. 18 illustrates the proposed parallel decoding algorithm where multiple MAP decoders are used to improve the throughput. Fig. 19 shows a hardware architecture for implementing the proposed parallel SW-MAP algorithm. In this architecture, P sets of QPP interleavers are used to generate the interleaving addresses f(x), f(x+ L),y, and f(x +(P À 1)L) concurrently, where L is the segment length L¼N/P. Based on the QPP contention-free property, these P addresses will be mapped to different memory modules 0 to P À1 without any collisions. Thus, no write buffers are required. A crossbar network is used to permute the data between the MAP decoders and the memory modules. Furthermore, based on the QPP interleaver algebraic property 3, this architecture can be modified to support the Radix-4 SW and NSW MAP decoding algorithms by setting the following constraints. To support the Radix-4 SW-MAP decoding, L needs to be divisible by 2, and each memory module needs to be partitioned into even and odd indexed banks. To support the Radix-4 NSW-MAP decoding, L needs to be divisible by 4, and each memory module needs to be partitioned into four banks.
Throughput-area tradeoff analysis
High throughput is achieved by using multiple MAP decoders and multiple memory modules/banks. In this section, we will analyze the impact of parallelism on throughput and area. The maximum throughput is measured as Fig. 20 which plots the area and the throughput for different parallelism degrees and clock rates. As can be seen, a 1 Gbps throughput is achievable with 64 Radix-2 MAP decoder cores running at 310 MHz clock frequency or 32 Radix-4 MAP decoder cores running at 250 MHz clock frequency.
For parallel Turbo decoder which consists of multiple MAP units, the MAP units tend to dominate the silicon area especially when the parallelism is high. From Fig. 20 , we can see that given the same throughput target, the Radix-2 architecture provides a lower area cost than the Radix-4 architecture for most of the cases and especially when P is large. This is mainly due to the fact that the Radix-2 MAP unit can run at a faster clock frequency, and has a lower complexity than the Radix-4 MAP unit (assuming full LogMAP implementation). However, it should be noted that the Radix-2 decoder may need a higher partitioning of the code block than the Radix-4 decoder to achieve the same throughput target. As a design tradeoff, we adopted the Radix-2 architecture in our final hardware implementation to save area while still meeting the 1 Gbps throughput target. . VLSI layout view which shows the core area of the decoder.
VLSI implementation result
A highly-parallel 3GPP LTE/LTE-Advance Turbo decoder, which consists of 64 Radix-2 SW-MAP decoder cores, has been synthesized, placed and routed for a 1.0 V 8-metal layer TSMC 65 nm CMOS technology. The decoder has scalable parallelism. The decoder can employ 64, 32, and 16 MAP units when the block size N Z2048, 1024, and 512, respectively. For small block size N o 496, the decoder can use up to 8 MAP cores. Fig. 21 shows the top layout view of this ASIC which shows the core area of this decoder. The fixed-point bit precisions are as follows: the channel symbol LLRs for systematic and parity bits are represented with 6-bit signed numbers, the internal a and b state metrics are represented with 10-bit unsigned numbers (modulo normalization), and the extrinsic LLRs are represented with 8-bit signed numbers. Based on the fixedpoint simulation result, the finite word-length implementation leads to negligible BER performance degradation from using the floatingpoint representation. The maximum achievable clock frequency is 400 MHz based on the post-layout simulation. The corresponding maximum throughput is 1.28 Gbps (at 6 iterations) with a core area of 8.3 mm 2 .
Comparison with existing turbo decoders
In this section, we compare the proposed Turbo decoder with existing Turbo decoders from [42, 43, 19] , and [22] . In [42] , a parallel Turbo decoder based on 7 MAP decoders is presented. In order to avoid memory contention, a custom designed interleaver, which is not standard compliant, is used. In [43] , a 3G-compliant parallel Turbo decoder based on the row-column permutation interleaver is introduced. In [19] , a 188-mode Turbo decoder chip for 3GPP LTE standard is presented. In this decoder, 8 MAP units are used to achieve a maximum decoding throughput of 129 Mbps (at 8 iterations). In [22] , a Radix-4 Turbo decoder is proposed for 3GPP LTE and WiMax standards. A maximum throughput of 186 Mbps is supported by employing 8 MAP units (at 8 iterations). Table 3 summarizes the implementation result of the proposed decoder and the hardware comparison with existing decoders. As can be seen, the proposed decoder supports 3GPP LTE-Advance throughput requirement (1 Gbps) at a small area cost, and achieves a good energy efficiency.
Conclusion
We have presented a highly-parallel architecture for the decoding of 3GPP LTE/LTE-Advance Turbo codes. Based on the algebraic constructions, the QPP interleaver offers contention-free memory accessing capability which enables parallel Turbo decoding by using multiple MAP decoders working concurrently. We proposed a low-complexity recursive architecture for generating the QPP interleaver addresses on the fly. The QPP interleavers are designed to operate at full speed with the MAP decoders. The proposed architecture has scalable parallelism and can be tailored for different throughput requirements. With this architecture, a throughput of 1.28 Gbps is achievable with a core area of 8.3 mm 2 in a 65-nm CMOS technology. 
