Abstract-A low complexity, parallel, collision-free interleaver architecture for the WiMax duo-binary turbo decoder is presented. The proposed architecture dynamically adapts to different block sizes and it features reduced complexity resorting to parallel circular shifting interleavers. Moreover, it sustains a peak throughput of nearly 90 Mb/s with a 200 MHz clock frequency, when synthesized on a 0.13 µm standard cell technology.
I. INTRODUCTION
T URBO codes are employed in several standards for wireless communications, such as WiMax [1] . When high transmission throughputs are required, parallel decoder architectures are needed to meet application speed constraints while keeping the clock frequency limited to few hundreds of MHz. A parallel turbo decoder is basically structured as M processing elements (PE) and M memories. Each PE plays the role of a Soft In Soft Out (SISO) module on a given window of data, whereas memories are used for exchanging extrinsic information among SISOs. Since WiMax resorts to a duo-binary turbo decoder [2] , the encoder processes N c couples of information bits u = {A, B} with A, B ∈ {0, 1}, whereas the decoder works on N c triplets of logarithmic likelihood ratios (LLRs) λ[u] = (λ AB [u] , λ AB [u] , λ AB [u] ) whereũ = {A, B} is taken as the reference symbol. The decoding process is iterative: in the in-order half iteration, the n-th SISO accesses only the n-th memory, whereas in the scrambled half iteration the n-th SISO reads from and writes to different memories. A collision occurs when two or more SISOs try to simultaneously access the same memory.
Parallel circular shifting interleavers are intrinsically collision-free [3] so they do not require further memory and logic to avoid collisions. In this letter, we propose a collision-free, low complexity, parallel architecture supporting the interleaving laws specified for the WiMax duo-binary turbo code [1] . Collisions are avoided by means of a variable parallelism architecture where M is chosen to grant that the resulting parallel interleaver is circular shifting. To the best of our knowledge this is the first work concerning the VLSI implementation of a collision-free parallel interleaver for the WiMax turbo decoder.
II. PROPOSED ARCHITECTURE The permutation algorithm specified in [1] 
where K j = 1 when j mod 4 = 0, K j = 1 + N c /2 + P 1 when j mod 4 = 1, K j = 1 + P 2 when j mod 4 = 2, K j = 1+N c /2+P 3 when j mod 4 = 3; P 0 , P 1 , P 2 , P 3 are constants that depend only on the number of couples N c [1] . Since the two steps can be swapped, the first step can be performed on the fly using Π(j) least significant bit (LSB[Π(j)]) as a selector. The implementation of (1) can be derived as follows: if x ∈ [0, 2 ·N c − 1], x mod N c can be implemented by means of a subtracter and a multiplexer. Unfortunately,
. As a consequence, several x mod N c blocks ought to be cascaded to obtain Π(j). However, (1) can be rewritten as
Given that a small Look-Up- Table ( LUT) is employed to store P 0 and the K j mod N c terms, (2) can be implemented by two parts as depicted in Fig. 1 (shaded gray box). The first part accumulates P 0 to implement the P 0 · j term and the mod N c block produces the correct modulo N c result. Since j is a counter (j − cnt in Fig. 1 ), (P 0 · j) mod N c is generated in one clock cycle adding P 0 to [P 0 · (j − 1)] mod N c and performing the modulo operation. The second part employs the two least significant bits of the j − cnt counter to select the proper K j mod N c value, which is added to the (P 0 · j) mod N c term. A further modulo N c operation is performed at the output. Since in this architecture both the first and the second part work on data belonging to [0, 2 · N c − 1], all the mod N c operations are implemented by means of a subtracter and a multiplexer (dark-shaded gray box in Fig. 1 ).
In the WiMax HUMAN-OFDM profile for 10 MHz channelization [1] the worst case downlink throughput isT dl 65 Mb/s. The decoder throughput can be estimated as the number of decoded bits (2N c ) over the time required to perform the decoding operations:
where 2I is the number of half iterations, f clk is the clock frequency and SISO l is the SISO latency. We adopt a sliding window based approach where boundary metrics are inherited from one iteration to the next one as proposed in [4] . This allows to obtain SISO l =2W (W is the window size). Assuming W =32 [5] , I=8 and f clk =200 MHz, we estimate the throughput of the decoder for the 17 possible values of N c [1] . As shown in Fig. 2 , M =3 allows to achieveT dl (horizontal solid line) only for N c ≥1440, whereas with M =4 it can be reached for N c >500 (i.e. N c ≥960, the next specified size).
1089-7798/08$25.00 c 2008 IEEE Table I for solid curve).
To satisfy the throughput requirement and to avoid high cost inter-SISO communication structures, a parallel collision-free interleaver is advisable. According to [3] , a circular shifting interleaver is defined as Π(j) = (a · j + r) mod N where N is the block size, r < N is an offset and a < N is a step size that is relatively prime to N . Comparing this definition with (1), it is clear that the WiMax interleaver could be circular shifting with a = P 0 , r = K j and N = N c . As detailed in [6] , parallel collision-free, circular shifting interleavers are obtained imposing
with j = 0, 1, . . . (4) we obtain the conditions required to ensure that the WiMax interleaver is collision-free for a given parallelism degree M . Given that (6) and (7) must be simultaneously satisfied:
where
Let's introduce I as the set of the 17 possible N c values specified by the WiMax standard [1] . Given N c ∈ I and the corresponding P 0 , P 1 , P 2 , P 3 , we find which M ∈ {2, 3, 4} grants to obtain parallel and collision-free interleavers. It is worth pointing out that all the possible P 0 specified in [1] satisfy (5) with M ∈ {2, 3, 4}. As a consequence, all the configurations where Nc M mod 4 = 0 with M ∈ {2, 3, 4} and N c ∈ I correspond to a parallel collision-free interleaver. This is the case of M =3 that leads to parallel collision-free interleavers for every N c ∈ I.
When M =2, we have (6) is N c =108, which leads to collisions.
When M =4, we have (6) is verified with M =4 and N c ∈ I , these configurations lead to parallel collision-free interleavers. On the other hand, when M =4 and N c ∈ I both (6) and (7) must be satisfied to obtain parallel collision-free interleavers. The only N c ∈ I leading to collisions is N c =108, that satisfies neither (6) nor (7) .
In this work a parallel, collision-free interleaver is obtained selecting M as a function of N c and in particular M =2 and M =4 when N c =108. Since the resulting interleaver is a parallel circular shifting interleaver, we can write 
III. RESULTS AND CONCLUSIONS
The architecture detailed in the previous paragraphs simultaneously produces M addresses per cycle and is employed to implement the interleaver reading part. Since idx k j identifies the memory accessed by SISO-k at time j, the parallel interleaver architecture ought to signal to the memory which SISO is requiring the data. This operation is accomplished by a 4 × 4 crossbar switch (radx-switch) controlled by idx k j with 2 bit wide fixed inputs, as shown in Fig. 3 . When the idx k memory (EI-MEM idx k ) is read, it sends back the corresponding λ [u] triplet to SISO-k, through a 4 × 4 crossbar switch (rdata-switch). This crossbar switch is controlled by the output of the radx-switch.
Since each SISO outputs its data in reverse order, during the reading operation idx k j and adx j are stored into a LIFO; idx k j and adx j are read from the LIFO during the writing operation to configure a 4 × 4 crossbar switch (wdata-switch).
The proposed parallel interleaver has been described in VHDL and synthesized on a 0. [7] of memory, where 57.6 kbits are devoted to the EI-MEM and 2.2 kbits to the LIFO.
A parallel decoder made of M SISOs requires to simultaneously produce M addresses per cycle. Thus, for a fair comparison, we consider a parallel interleaver obtained using four instances of the single address per cycle interleaver in [5] : this solution requires about 4880 equivalent gates (1220 equivalent gates for each address generator) not including the switches and the memories, with an average power consumption of about 3.2 mW (0.8 mW for each address generator) at 200 MHz. As it can be observed the proposed interleaver, is more than 50% simpler than placing four instances of [5] .
