Abstract-This work proposes a VLSI decoding architecture for concatenated convolutional codes. The novelty of this architecture is twofold: 1) the possibility to switch on-the-fly from the Universal Mobile Telecommunication System turbo decoder to the WiMax duo-binary turbo decoder with a limited resources overhead compared to a single-mode WiMax architecture; and2) the design of a parallel, collision free WiMax decoder architecture. Compared to two single-mode solutions, the proposed architecture achieves a complexity reduction of 17.1% and 27.3% in terms of logic and memory, respectively. The proposed, flexible architecture has been characterized in terms of performance and complexity on a 0.13-m standard cell technology, and sustains a maximum throughput of more than 70 Mb/s.
I. INTRODUCTION

I
N THE LAST few years, several standards have been proposed for reliable transmission of data over wireless channels (e.g., [1] , [2] ). Besides this, in order to cope with severe transmission environments, typical of wireless systems, channel codes ought to be adopted. Turbo codes [3] are among the most performing channel codes, and are still a major topic of interest in the scientific literature. Recent works dealing with turbo decoder implementation mainly focus on three aspects. 1) The design of VLSI architectures for duo-binary turbo codes [4] - [6] .
2) The design of flexible architectures able to support multiple codes [7] - [9] . 3) The design of parallel decoders to sustain very high throughput (tens or hundreds of megabits per second), where the interleaver parallelization is particularly challenging, due to the problem of collisions in memory access [10] - [12] . Though current scaled CMOS technologies allow to reach clock frequencies of several hundreds of megahertz, parallelization is still an effective methodology to achieve high throughputs and to approach the long term objective of 1 Gb/s in wireless communications. Furthermore, in high-throughput application-specific integrated circuit (ASIC) design, the adoption of lower frequency parallel architectures instead of higher frequency serial ones is an effective method to combat unreliability and reduce nonrecurrent costs.
This work presents a high-performance turbo decoder architecture, which faces parallelization, flexibility, and duo-binary implementation issues while keeping the complexity as reduced as possible, and achieves a throughput of several tens of megabits per second. Implementation results on a 0.13-m standard cell technology show that the complexity overhead required to support both Universal Mobile Telecommunication System (UMTS) and WiMax is limited, compared to a single-mode WiMax decoder architecture. Moreover, the proposed architecture yields noteworthy complexity reduction figures compared to a dual mode architecture, where no sharing technique is employed. The rest of the paper is organized as follows. In Section II the decoding algorithm is briefly recalled. Section III presents a reference architecture. In Section IV the design of the low complexity interleaver employed in our architecture is addressed, whereas Section V deals with the employed design strategies. Finally, Section VI shows the experimental results and Section VII draws some conclusions.
II. DECODING ALGORITHMS Both the UMTS and the WiMax turbo codes are based on the parallel concatenation of two 8-state convolutional codes (CCs). However, the constituent code used in UMTS is a single binary systematic CC, whereas that used in WiMax is a duo-binary circular recursive systematic CC. At the decoder side, the softin-soft-out (SISO) module [13] executes the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm [14] , usually in its logarithmic form [15] . Each SISO module receives the intrinsic log-likelihood ratios (LLRs) of coded symbols from the channel and outputs the LLRs of information symbols . The two SISO modules exchange extrinsic LLRs by means of interleaving memories and [ Fig. 1(a) [15] is implemented as a followed by a correction term stored in a small look-up table (LUT) [16] . The correction term, usually adopted when decoding binary codes, can be omitted for duo-binary turbo codes [4] .
It is worth pointing out that in binary turbo codes, at each trellis step the SISO outputs only one extrinsic LLR, whereas in duo-binary turbo codes the SISO produces three extrinsic LLRs; thus, in general, the terms and are vectors. The term in (1) is defined as
where and are the starting and the ending states of , and are the forward and backward metrics associated to and , respectively [see Fig. 1(b) ]. The term in (5) is computed as a weighted sum of the produced by the soft demodulator as (6) where is one of the coded symbols associated to and is the number of bits forming a coded symbol. On the other hand, we can write for a binary turbo code, whereas for a duo-binary turbo code the terms are piecewise functions
For further details on the theoretical aspects, the reader can refer to [13] .
III. REFERENCE ARCHITECTURE
The throughput of a turbo decoder , defined as the number of decoded bits over the time required to perform the decoding operations , can be roughly estimated as (8) where is the number of SISOs instanced into the decoder, is the number of iterations, is the SISO latency, is the clock frequency and is equal to or for binary and duo-binary turbo decoders, respectively. In a windowed architecture, the SISO latency is directly related to the window size , as clock cycles are required for computing both the and values. If boundary metrics calculated at previous iteration on the neighboring windows are used to initialize and recursion, as suggested in [17] , the SISO latency can be estimated as . Considering (typical value), MHz and , a throughput Mb/s is obtained with . This value is more than sufficient to achieve the UMTS ). On the other hand the WiMax standard requires a throughput close to 70 Mb/s for the maximum block length . Considering the same parameters listed above for UMTS, we obtain . As a consequence, the WiMax turbo decoder ought to be implemented as a parallel architecture, for the same clock frequency MHz. A large part of the decoder area is devoted to the interleaver memory and the SISO modules, so these blocks can be effectively shared between the two decoders. The general SISO architecture is shown in Fig. 3 . The processor implements (3) and the processor implements (4) on two consecutive windows of data. Both the and processors compute in parallel all the new state metrics (SMs) (see the SM processing block in Fig. 3 ). Since the processor works in direct order on the input data, whereas the processor computes them in reverse order, two branch metrics units (BMUs), are placed before the and processors. Each BMU is devoted to combine and and obtains in parallel the BMs associated to the th trellis section. As a consequence a local buffer (BMU-MEM) is required to store extrinsic information and channel transition information values. The processor generates the values according to (1) receiving the values directly from the processor and loading the values from a local buffer ( -MEM).
It is known that in binary turbo code decoders LLRs are commonly used. On the contrary, in duo-binary turbo decoders the use of logarithmic probabilities (LPs) instead of LLRs allows to save a certain amount of logic in the SISO architecture [5] . However, the use of 4 LPs instead of 3 LLRs has a negative impact on both the interleaver memory and the BMU-MEM footprint. In order to select the most suitable approach, we implemented both the LLR-based SISO (SISO-LLR) and the LP-based SISO (SISO-LP) in VHDL and synthesized them on a 0.13-m standard cell technology. Moreover, we generated the dual port SRAMs to implement the interleaver memory both for the SISO-LLR (2p-LLR) and the SISO-LP (2p-LP) and the single port SRAMs to implement the BMU-MEM as a "ping-pong" buffer for both the cases (1p-LLR and 1p-LP). Fig. 2 shows the complexity growth of SISO-LLR, SISO-LP, 2p-LLR, 2p-LP, 1p-LLR, and 1p-LP in m as a function of the number of bits (bit) used to represent the LLRs or the LPs. The range explored in this analysis shows that the SISO-LLR complexity is slightly larger than that of the SISO-LP. However, the amount of memory required by a SISO-LP-based decoder increases more than that of a SISO-LLR-based one. In Fig. 2 the complexity of a single SISO decoder, including the interleaver, is also shown. Further experiments show that increasing , the overhead required by the LP-based decoder with respect to the LLR-based one decreases from 7.6% to 2.2% . However, the LLR-based decoder is still less complex. As a consequence, in the following the LLR-based decoder architecture is described according to the formalism introduced in Section II. This choice implies that the BMU-MEM contains words, each word being made of 4 channel LLRs represented on bits and 3 extrinsic LLRs represented on bits.
IV. LOW COMPLEXITY INTERLEAVER DESIGN
Since the proposed architecture achieves the throughput required by UMTS with a single SISO, the UMTS interleaver parallelization is not addressed in this work. In order to reduce as much as possible the complexity of the UMTS permutation generator, we implemented the two step architecture detailed in [18] , which is very similar to that proposed in [19] .
On the other hand a parallel decoder is required to achieve the WiMax throughput with the assumed clock frequency, MHz; as a consequence we designed the parallel interleaver shown in Fig. 4 . The permutation algorithm specified in the WiMax standard is structured in two steps. The first step switches and stored at odd addresses leaving un-moved (where can be either or ). The second step provides the interleaved address of the th triplet as (9) where and are constants depending on , defined in [2] and [18] . The interleaver architecture can be simplified by rewriting (9) as as detailed in [18] .
In this work we consider that the throughput sustained by the decoder scales with , namely for short block lengths a single SISO is active (e.g., Mb/s with ). When , two SISOs are active (e.g., Mb/s with ) and when all the four SISOs are active (e.g., Mb/s with ). Given SISOs and memories to interleave extrinsic information, two different SISOs should not read from or write to the same memory at the same time to avoid collisions. As detailed in [18] , the resulting parallel interleaver with variable parallelism degree is a circular shifting interleaver [20] , whose implementation requires nearly the same complexity as the nonparallel version. In fact, the collision free characteristic is achieved by making the SISOs accessing at the same time the same location of different memories.
V. FLEXIBLE UMTS/WIMAX SISO ARCHITECTURE
In the following paragraphs the solutions employed to share the SISO architecture between UMTS and WiMax are detailed. In Fig. 3 Fig. 3 : and are the th SMs connected to the th state at the th trellis step and the corresponding BMs. The UMTS mode at the th trellis step requires the th processing element (PE) to combine 2 SMs and 2 BMs to produce a new SM, whereas 4 SMs and 4 BMs are required in the WiMax mode. The function shared by the UMTS and the WiMax decoder is implemented as a programmable two input . Since the UMTS turbo code achieves excellent performance even with a 1-bit correction (3 positions LUT [16] ), the 1-bit correction can be exploited to substitute the last adder in the standard add-compare-select-offset implementation with a simpler, programmable increment.
c) Processor Sharing-Top Right Side Fig. 3 : The processor input stage is made of two normalization blocks (norm) devoted to subtracting the 0-state from the others, and . The normalized SMs, combined with , become the inputs of the trees (two trees for UMTS and four trees for WiMax). The tree output referred to is subtracted from the others; then, subtracting the corresponding , the output extrinsic LLRs are obtained. To ease the decoded bits generation the hard decision circuit is embedded into the processor. For a binary turbo decoder it can be implemented taking the sign of . On the other hand for a duo-binary turbo decoder the hard decision is selected as the couple with maximum LLR in , as shown in Fig. 3 . d) SM Exchange Network Sharing: To grant a windowed computation, the -MEM contains words, each word being made of 8 SMs, represented on bits. As stated in Section III the SISO complexity and latency can be reduced [17] implementing a metrics inheritance strategy at the expenses of additional memory. Given the number of windows per SISO , a words local memory ( -LOC-MEM) stores the SMs at the boundary of two consecutive windows . Each word is made of 8 SMs, each of which is represented on bits. Moreover, every SISO requires two 8 SMs values to initialize its trellis portion ( and ) . This architecture is suited to a single SISO decoder, where only intra SISO SMs inheritance is required. However, in a parallel SISO decoder, inter SISO SMs inheritance is required to properly initialize trellis slices of different SISOs. This can be achieved by inserting two 2-position shift registers ( -EXT-MEM and -EXT-MEM) to exchange the and SMs with the neighboring SISOs. As depicted in Fig. 4 , a simple network allows to properly exchange the boundary SMs among the different SISOs considering that in the UMTS mode the trellis starting and termination SMs are fixed ( and ) whereas in the WiMax mode they are estimated as explained in Section III. Depending on which is the last SISO active the SMs ought to be inherited from a different SISO.
VI. IMPLEMENTATION, THROUGHPUT AND LATENCY
According to the literature [16] , [21] , , and have been chosen as a significant, conservative case for both UMTS and WiMax. Synthesis results on a 0.13-m standard cell technology show that the proposed, flexible UMTS/WiMax architecture requires about 204 kgates. The single-mode WiMax architecture requires about 171 kgates, the UMTS one described in [22] , similar to the one employed in this work, requires 75 kgates. So the combination of the two single-mode decoders leads to 246 kgates: the proposed solution is 17.1% less complex. As stated in Section III, memory sharing and on-the-fly generation of scrambled addresses grant a large area saving. This is confirmed by the actual memory requirements: the WiMax decoder requires 133.6 kbits, whereas the UMTS decoder requires 70.9 kbits. As a consequence the two architectures require 204.5 kbits. The proposed solution with memory sharing requires only 148.6 kbits, thus it grants a memory saving of 27.3% and a total area saving of 27.7% compared to the two single-mode architectures. In Table I the proposed architecture is compared to some binary and duo-binary turbo decoder architectures. The proposed dual mode architecture shows excellent performance and complexity figures compared both to a fixed implementation [23] and to a programmable solution [7] ( [7] -I refers to the single processor solution, whereas [7] -II is related to the 16 processor architecture).
As it can be inferred from Table I the proposed architecture achieves a throughput higher than specified by the WiMax standard. This implies that enough processing power is available for the concurrent execution of the UMTS and WiMax decoding. Of course external buffers must be available to receive an UMTS frame while WiMax decoding is in progress and viceversa. In the following we prove that the proposed architecture can support the concurrent execution of the UMTS and WiMax decoding. The time required to decode blocks of bits with the proposed architecture is (10) (12) where (WiMax) and (UMTS). Substituting (10) and (11) in (12) we obtain (13) The final choice for has been made taking Mb/s and solving (13) for the maximum ; this results in and Mb/s. The concurrent decoding also affects latency. The proposed architecture latency can be obtained from (8) as (14) namely s and s. Thus, the total latency to decode WiMax blocks and 1 UMTS block is (15) In the worst case ( and ) we obtain ms, which is a small percentage of the global latency specified by both UMTS and WiMax standards. The single and multistandard figures of throughput and latency offered by the presented architecture are summarized in Table II. VII. CONCLUSION In this paper, a flexible UMTS/WiMax turbo decoder architecture has been presented together with a parallel WiMax interleaver architecture. Compared to a single-mode, parallel WiMax architecture the proposed one exhibits a limited complexity overhead. Moreover, compared to a separated dual mode UMTS/WiMax turbo decoder architecture, it achieves the 17.1% logic reduction and the 27.3% memory reduction.
