Introduction
WIMAX has gained a wide popularity due to the growing interest and diffusion of broadband wireless access systems. In order to be flexible and reliable WIMAX adopts several different channel codes, namely convolutional-codes (CC), convolutional-turbocodes (CTC), block-turbo-codes (BTC) and low-density-parity-check (LDPC) codes, that are able to cope with different channel conditions and application needs. On the other hand, high performance digital CMOS technologies have reached such a development that very complex algorithms can be implemented in low cost chips. Moreover, embedded processors, digital signal processors, programmable devices, as FPGAs, application specific instruction-set processors and VLSI technologies have come to the point where the computing power and the memory required to execute several real time applications can be incorporated even in cheap portable devices. Among the several application fields that have been strongly reinforced by this technology progress, channel decoding is one of the most significant and interesting ones. In fact, it is known that the design of efficient architectures to implement such channel decoders is a hard task, hardened by the high throughput required by WIMAX systems, which is up to about 75 Mb/s per channel. In particular, CTC and LDPC codes, whose decoding algorithms are iterative, are still a major topic of interest in the scientific literature and the design of efficient architectures is still fostering several research efforts both in industry and academy. In this Chapter, the design of VLSI architectures for WIMAX channel decoders will be analyzed with emphasis on three main aspects: performance, complexity and flexibility. The chapter will be divided into two main parts; the first part will deal with the impact of system requirements on the decoder design with emphasis on memory requirements, the structure of the key components of the decoders and the need for parallel architectures. To that purpose a quantitative approach will be adopted to derive from system specifications key architectural choices; most important architectures available in the literature will be also described and compared. The second part will concentrate on a significant case of study: the design of a complete CTC decoder architecture for WIMAX, including also hardware units for depuncturing (bitdeselection) and external deinterleaving (sub-block deinterleaver) functions.
From system specifications to architectural choices
The system specifications and in particular the requirement of a peak throughput of about 75 Mb/s per channel imposed by the WIMAX standard have a significant impact on the decoder architecture. In the following sections we analyze the most significant architectures proposed in the literature to implement CC decoders (Viterbi decoders) , BTC, CTC and LDPC decoders.
Viterbi decoders
The most widely used algorithm to decode CCs is the Viterbi algorithm [Viterbi, 1967] , which is based on finding the shortest path along a graph that represents the CC trellis. As an example in Fig. 1 a binary 4-states CC is shown as a feedback shift register (a) together with the corresponding state diagram (b) and trellis (c) representations. In the given example, the feedback shift register implementation of the encoder generates two output bits, c 1 and c 2 for each received information bit, u; c 1 is the systematic bit. The state diagram basically is a Mealy finite state machine describing the encoder behaviour in a time independent way: each node corresponds to a valid encoder state, represented by means of the flip flop content, e 1 and e 2 , while edges are labelled with input and output bits. The trellis representation also provides time information, explicitly showing the evolution from one state to another in different time steps (one single step is drawn in the picture). At each trellis step n, the Viterbi algorithm associates to each trellis state S a state metric Γ S n that is calculated along the shortest path and stores a decision d S n , which identifies the entering transition on the shortest path. First, the decoder computes the branch metrics (γ n ), that are the distances from the metrics labelling each edge on the trellis and the actual received soft symbols. In the case of a binary CC with rate 0.5 the soft symbols are λ1 n and λ2 n and the branch metrics γ n (c2,c1) (see Fig. 2 (a) ). Starting from these values, the state metrics are updated by selecting the larger metric among the metrics related to each incoming edge of a trellis state and storing the corresponding decision d S n . Finally, decoded bits are obtained by means of a recursive procedure usually referred to as trace-back. In order to estimate the sequence of bits that were encoded for transmission, a state is first selected at the end of the trellis portion to be decoded, then the decoder iteratively goes backward through the state history memory where decisions d S n have been previously stored: this allows one to select, for current state, a new state, which is listed in the state history trace as being the predecessor to that state. Different implementation methods are available to make the initial state choice and to size the portion of trellis where the trace back operation is performed: these methods affect both decoder complexity and error correcting capability. For further details on the algorithm the reader can refer to [Viterbi, 1967] ; [Forney, 1973] . Looking at the global architecture, the main blocks required in a Viterbi decoder are the branch metric unit (BMU) devoted to compute γ n , the state metric unit (SMU) to calculate Γ S n and the trace-back unit (TBU) to obtain the decoded sequence. The BMU is made of adders and subtracters to properly combine the input soft symbols (see Fig. 2 (a) ). The SMU is based on the so called add-compare select structure (ACS) as shown in Fig.2 (b) . Said i the i-th starting state that is connected to an arriving state S by an edge whose branch metric is γ i n-1 , then Γ S n is calculated as in (1).
Fig. 2. BMU and ACS architectures for a rate 0.5 CC As it can be inferred from (1) Γ S n is obtained by adding branch metrics with state metrics, comparing and selecting the higher metric that represents the shortest incoming path. The corresponding decision d S n is stored in a memory that is later read by the TBU to reconstruct the survived path. Due to the recursive form of (1), as long as n increases, the number of bits to represent Γ S n tends to become larger. This problem can be solved by normalizing the state metrics at each step. However, this solution requires to add a normalization stage increasing both the SMU complexity and critical path. An effective technique, based on two complement representation, helps limiting the growth of state metrics, as described in [Hekstra, 1989] . The WIMAX standard specifies a binary 64 states CC with rate 0.5, whose shift register representation is shown in Fig. 3 . Usually Viterbi decoder architectures exploit the trellis intrinsic parallelism to simultaneously compute at each trellis step all the branch metrics and update all the state metrics. Thus, said n the number of states of a CC, a parallel architecture employs a BMU and n ACS modules. Moreover, to reduce the decoding latency, the trace-back is performed as a sliding-window process [Radar, 1981] on portions of trellis of width W. This approach not only reduces the latency, but also the size of the decision memory that depending on the TBU radix requires usually 3W or 4W cells [Black & Meng, 1992] .
To improve the decoder throughput, two [Black & Meng, 1992] or more [Fettweis & Meyr, 1989] ; [Kong & Parhi, 2004] ; [Cheng & Parhi, 2008] trellis steps can be processed concurrently. These solutions lead to the so called higher radix or M-look-ahead step architectures. According to [Kong & Parhi, 2004] , the throughput sustained by an M-lookahead step architecture, defined as the number of decoded bits over the decoding time is
where f clk is the clock frequency, N T is the number of trellis steps, k=1 for a binary CC, k=2 for a double binary CC and the right most expression is obtained under the condition W << N T that is a reasonable assumption in real cases. Thus, to achieve the throughput required by the WIMAX standard with a clock frequency limited to tens to few thousands of MHz, M=1 (radix-2) or M=2 (radix-4) is a reasonable choice. However, since CCs are widely used in many communication systems, some recent works as [Batcha & Shameri, 2007] and [Kamuf et al., 2008] address the design of flexible Viterbi decoders that are able to support different CCs. As a further step [Vogt & When, 2008] proposed a multi-code decoder architecture, able to support both CCs and CTCs.
BTC decoders
Block Turbo Codes or product codes are serially concatenated block codes. Given two block codes C 1 =(n 1 ,k 1 ,δ 1 ) and C 2 =(n 2 ,k 2 ,δ 2 ) where n i , k i and δ i represent the code-word length, the number of information bits, and the minimum Hamming distance, respectively, the corresponding product code is obtained according to [Pyndiah, 1998 ] as an array with k 1 rows and k 2 columns containing the information bits. Then coding is performed on the k 1 rows with C 2 and on the n 2 obtained columns with C 1 . The decoding of BTC codes can be performed iteratively row-wise and column-wise by using the sub-optimal algorithm detailed in [Pyndiah, 1998 ]. The basic idea relies on using the Chase search [Chase, 1972] a near-maximum-likelihood (near-ML) searching strategy to find a list of code-words and an ML decided code-word d={d 0 ,…, d n-1 } with d j {-1,+1}. According to the notation used in [Vanstraceele et al., 2008] , decision reliabilities are computed as
where r={r 0 ,…r n-1 } is the received code-word and c -1(j) and c +1(j) are the code-words in the Chase list at minimum Euclidean distance from r such that the j-th bit of the code-word is -1 and +1 respectively. Then one decoder sends to the other the extrinsic information 
where β is a weight factor increasing with the number of iterations. The decoder that receives the extrinsic information uses an updated version of r obtained as
where  is a weight factor increasing with the number of iterations. A scheme of the elementary block turbo decoder is shown in Fig. 4 where the block named "decoder" is a Soft-In-Soft-out (SISO) module that performs the Chase search and implements (3), (4) and (5). An effective solution to implement the SISO module is based on a three pipelined stage architecture where the three stages are identified as reception, processing, and transmission units [Kerouedan & Adde, 2000] . As detailed in [LeBidan et al., 2008] , during each stage, the N soft values of the received word r are processed sequentially in N clock periods. The reception stage is devoted to find the least reliable bits in the received code-word. The processing stage performs the Chase search and the transmission stage calculates λ(d j ), w j and r j new . Another solution is proposed in [Goubier et al. 2008] where the elementary decoder is implemented as a pipeline resorting to the mini-maxi algorithm, namely by using mini-maxi arrays to store the best metrics of all decoded code-words in the Chase list. Several works in the literature deal with BTC complexity reduction. As an example [Adde & Pyndiah, 2000] suggests to compute β in (5) on a per-code-word basis, whereas in [Chi et al., 2004 ] the dependency on  in (6) is solved by replacing the term ·w j with tanh(w j /2). In [Le et al. 2005] both  in (6) and β in (5) are avoided by exploiting Euclidean distance property.
Due to its row-column structure, the block turbo decoder can be parallelized by instantiating several elementary decoders to concurrently process more rows or columns, thus increasing the throughput. As a significant example in [Jego et al., 2006] a fully parallel BTC decoder is proposed. This solution instantiates n 1 +n 2 decoders that work concurrently. Moreover, by properly managing the scheduling of the decoders and interconnecting them through an Omega network intermediate results (row decoded data or column decoded data) are not stored. A detailed analysis of throughput and complexity of BTC decoder architectures can be found in [Goubier et al. 2008] and [LeBidan et al., 2008] . In particular, according to [Goubier et al. 2008 ] a simple one block decoder architecture that performs the row/column decoding sequentially (interleaved architecture) requires 2(n 1 +n 2 ) cycles to complete an iteration; as a consequence it achieves a throughput
where I is the number of iterations and f clk is the clock frequency. The BTC specified for WIMAX is obtained using twice a binary extended Hamming code out of the ones show in Table 1 N k 15 11 31 26 63 57 Table 1 . WIMAX binary extended Hamming codes (H(n,k)) used for BTC Considering the interleaved architecture described in [Goubier et al. 2008 ] where a fully decoded block is output every 4.5 half iterations, we obtain that 75 Mb/s can be obtained with a clock frequency of 84 MHz, 31 MHz and 14 MHz for H (15,11), H(31,26) and H(63,57) respectively.
CTC decoders
Convolutional turbo codes were proposed in 1993 by Berrou, Glavieux and Thitimajshima [Berrou et al., 1993] as a coding scheme based on the parallel concatenation of two CCs by the means of an interleaver (Π) as shown in Fig. 5 (a). The decoding algorithm is iterative and is based on the BCJR algorithm [Bahl et al., 1974] applied on the trellis representation of each constituent CC ( Fig. 5 (b) ). The key idea relies on the fact that the extrinsic information output by one CC is used as an updated version of the input a-priori information by the other CC. As a consequence, each iteration is made of two half iterations, in one half iteration the data are processed according to the interleaver (Π) and in the other half iteration according to the deinterleaver (Π -1 ). The same result can be obtained by implementing an in-order read/write half iteration and a scrambled (interleaved) read/write half iteration. The basic block in a turbo decoder is a SISO module that implements the BCJR algorithm in its logarithmic likelihood ratio (LLR) form. If we consider a Recursive Systematic CC (RSC code), the extrinsic information λ k (u;O) of an uncoded symbol u at trellis step k output by a SISO is
where ũ is an uncoded symbol taken as a reference (usually ũ=0), e represents a certain transition on the trellis and u(e) is the uncoded symbol u associated to e. The max * function is usually implemented as a max followed by a correction term [Robertson et al., 1995] ; [Gross & Gulak, 1998 ]; [Cheng & Ottosson, 2000] ; [Classon et al., 2002] ; [Wang et al., 2006] ; [Talakoub et al. 2007] . A scaling factor can also be applied to further improve the max or max * approximation [Vogt & Finger, 2000] . The correction term, usually adopted when decoding binary codes, can be omitted for double binary turbo codes [Berrou et al. 2001] with minor error rate performance degradation. The term b(e) in (8) 
where s S (e) and s E (e) are the starting and the ending states of e,  k [s S (e)] and β k [s E (e) ] are the forward and backward state metrics associated to s S (e) and s E (e) respectively (see Fig. 5 (b) ) and γ k [e] is the branch metric associated to e. The π k [c(e) ;I] term is computed as a weighted sum of the λ k [c;I] produced by the soft demodulator as
where c i (e) is one of the coded bits associated to e and n c is the number of bits forming a coded symbol c and π k [c u (e) ;I] in (8) is obtained as π k [c(e) ;I] considering only the systematic bits corresponding to the uncoded symbol u out of the n c coded bits. The π k [u(e) ;I] term is obtained combining the input a-priori information λ k (u;I) and for a double binary code can be written as in (14), where A and B represent the two bits forming an uncoded symbol u. The CTC specified in the WIMAX standard is based on a double binary 8-state constituent CC as shown in Fig. 6 , where each CC receives two uncoded bits (A, B) and produces four coded bits, two systematic bits (A,B) and two parity bits (Y,W). As a consequence, at each trellis step four transitions connect a starting state to four possible ending states. Due to the trellis symmetry only 16 branch metrics out of the possible 32 branch metrics are required at each trellis step. As pointed out in [Muller et al. 2006 ] high throughput can be achieved by exploiting the trellis parallelism, namely computing concurrently all the branch and state metrics. The 16 branch metrics are computed by a BMU that implements (12) as shown in Fig. 7 . To reduce the latency of the SISO, usually the decoding is based on a sliding-window approach [Benedetto et al., 1996] . As a consequence, at least two BMUs are required to compute the two recursions (forward and backward) according to the BCJR algorithm. However, since β metrics require to be trained between successive windows, usually a further BMU is required. A solution based on the inheritance of the border metrics of each window [Abbasfar & Yao 2003 ] requires only two BMUs. Furthermore, this strategy reduces the SISO latency to the sliding window width W. The state metrics are updated according to (10) and (11) by two state metric processors, each of which is made of a proper number of processing elements (PE). As shown in Fig. 7 for the WIMAX CTC 8 PEs are required. It is worth pointing out that the constituent codes of the WIMAX CTC use the circulation state tailbiting strategy proposed in [Weiss et al. 2001 ] that ensures that the ending state of the last trellis step is equal to the starting state of the fist trellis step. However, this technique requires estimating the circulation state at the decoder side. Since training operations to estimate the circulation state would increase the SISO latency, an effective alternative [Zhan et al. 2006] is to inherit these metrics from the previous iteration.
Code (CC )
Code ( As in Viterbi decoder architectures often in CTC decoders the state metrics are computed by means of the "wrapping" representation technique proposed in [Hekstra, 1989] . This solution requires a normalization stage, depicted in Fig. 7 , when combining , β and γ metrics to compute the extrinsic information as in (8). The last stage of the output processor, that computes the output extrinsic information, is a tree of max blocks for each component of the extrinsic information and few adders to implement (8). As highlighted in Fig. 7 this scheduling requires a buffer to store input LLRs that are used to compute the backward recursion (BMU-MEM). Since the output extrinsic information is computed during the backward recursion, forward recursion metrics are stored in a buffer (-MEM). Further memory is required to implement the border metric inheritance, -EXT-MEM, β-EXT-MEM and β-LOC-MEM. The throughput sustained by the CTC decoder, defined as the number of decoded bits over the time required for their computation, is
where f clk is the clock frequency, N T is the number of trellis steps, k=1 for a binary CTC, k=2 for a double binary CTC, 2I is the number of half iterations, N cyc SISO and N cyc ID represent the number of clock cycles required by one SISO and by the interleaving/deinterleaving structure. γ γ γ γ Usually optimized architectures [Masera et al., 1999] ; [Bickerstaff et al., 2003] ; [Kim & Park, 2008] are obtained with SP=1, whereas flexible architectures have higher SP values [Vogt & Wehn, 2008] ; [Muller et al., 2009] . However, even with SP=1, a double binary turbo decoder architecture that achieves the throughput imposed by WIMAX with eight iterations (I=8), would require f clk =600 MHz. A possible solution to improve the throughput by a factor that ranges in [1.2, 1.9] is the based on decoder level parallelism [Muller et al. 2006] and is usually referred to as "shuffling" [Zhang & Fossorier, 2005] . However, to further improve the throughput a parallel decoder made of P SISOs working concurrently is required. As a consequence, a parallel architecture achieves a throughput
Thus, setting P=4, I=8 and SP=1, the WIMAX throughput is obtained with f clk =150 MHz. It is worth pointing out that a P-parallel CTC decoder is made of P SISOs connected to P memories devoted to store the extrinsic information. However, in a parallel decoder during the scrambled half iteration collisions can occur, namely more SISOs could need to access the same memory during the same cycle. Since the collision phenomenon increases ID cyc oh , several algorithmic approaches to design collision free interleavers [Giulietti et al. 2002] ; [Kwak & Lee, 2002] ; [Gnaedig et al., 2003] ; [Tarable et al., 2004] have been proposed. On the other hand, architectures to manage collisions in a parallel turbo decoder have also been proposed in the literature [Thul et al., 2002] ; ; ; [Speziali & Zory, 2004] ; [Martina et al. 2008-a] ; [Martina et al., 2008-b] , in particular [Martina et al. 2008-b] deals with the parallelization of the WIMAX CTC interleaver and avoids collision by the means of a throughput/parallelism scalable architecture that features ID cyc oh =0. It is worth pointing out that parallel architectures increase not only the throughput but also the complexity of the decoder, so that some recent works aim at reducing the amount of memory required to implement SISO local buffers. In [Liu et al., 2007] and [Kim & Park, 2008] saturation of forward state metrics and quantization of border backward state metrics is proposed. Further studies have been performed to reduce the extrinsic information bit width by using adaptive quantization [Singh et al., 2008] , pseudo-floating point representation and bit level representation [Kim & Park, 2009 ].
LDPC code decoders
LDPC codes were originally introduced in 1962 by Gallager [Gallager, 1962] and rediscovered in 1996 by MacKay and Neal [MacKay, 1996] . As turbo codes, they achieve near optimum error correction performance and are decoded by means of high complexity iterative algorithms. An LDPC code is a linear block code defined by a CB parity check matrix H, characterized by a low density of ones: B is the number of bits in the code (block length), while C is the number of parity checks. A one in a given cell of the H matrix indicates that the bit corresponding to the cell column is used for the calculation of the parity check associated to the row. A popular description of an LDPC code is the bipartite (or Tanner) graph shown in Figure 8 for a small example, where B variable nodes (VN) are connected to C check nodes (CN) through edges corresponding to the positions of the ones in H. LDPC codes are usually decoded by means of an iterative algorithm variously known as sum-product, belief propagation or message passing, and reformulated in a version that processes logarithmic likelihood ratios instead of probabilities. In the first iteration, half variable nodes receive data from adjacent check nodes and from the channel and use them to obtain updated information sent to the check nodes; in the second half, check nodes take the updated information received from connected bit nodes and generate new messages to be sent back to variable nodes. In message passing decoders, messages are exchanged along the edges of the Tanner graph, and computations are performed at the nodes. To avoid multiplications and divisions, the decoder usually works in the logarithmic domain. The message passing algorithm is described in the following equations, where k represents the current iteration, Q ji is the message generated by VN j and directed to CN i, R ij is the message computed by CN i and sent to VN j. C[j] is the whole set of incoming messages for VN j and R[i] is the whole set of the incoming messages for CN i.
Each variable node is initialized with the log-likelihood ratio (LLR)  j associated to the received bit. Next, messages are propagated from the variable nodes to the check nodes along the edges of the Tanner graph. At the first iteration, only  j are delivered, while starting from the second iteration VNs sum up all the messages R ij coming from CNs and combine them with  j according to
The check node computes new check to variable messages as
where |R [j] |is the cardinality of the CN and (x) is a non linear function defined as
After a number of iterations that strongly depends on the addressed application and code rate (typically 5 to 40), variable nodes compute an overall estimation of the decoded bit in the form
where the sign of  j can be understood as the hard decision on the decoded bit.
A large implementation complexity is associated to (19), which is simplified in different ways. First of all, function (x) can be obtained by means of reduced complexity estimations [Masera et al., 2005] . Moreover sub-optimal, low complexity algorithms have been successfully proposed to simplify (19), such as for example the normalized Min-Sum algorithm [Chen et al., 2005] where only the two smallest magnitudes are used. A further change is usually applied to the scheduling of variable and check nodes in order to improve communications performance. In the two-phase scheduling, the updating of variable and check nodes is accomplished in two separate phases. On the contrary, the turbo decoding message passing (TDMP) [Mansour & Shanbhag, 2003] , also known as layered or shuffled decoding, allows for overlapped update operations: messages calculated by a subset of check nodes are immediately used to update variable nodes. This scheduling has been proved to be able to reduce the number of iterations by up to 50% at a fixed communications performance. The required number of functional units in a decoder can be estimated based on the concept of processing power P c [Gouillod et al., 2007] , which can be evaluated on the basis of the rate R c of the code, the number K of information bits transmitted per codeword, the block size N=K/R c , the required information throughput D, the operating clock frequency f clk , the maximum number of iterations i MAX and the total number of edges to be processed per iteration . This relation is expressed as
As two messages are associated with each edge (to be sent from the CN to the VN and vice versa), 2P c gives the number of messages that must be concurrently processed at each decoding iteration in order to achieve the target throughput D. Equation (22) does not consider the message exchange overhead: yet it assumes that all messages dispatched during a cycle are delivered simultaneously during the same cycle. The P c value must then be assumed as a lower bound and the actual degree of parallelism strongly depends on both the structure of the H matrix [Dinoi et al., 2006] and the adopted interconnect architecture among processing units [Masera et al., 2007] . Actually, most of the implementation concerns come from the communication structure that must be allocated to support message passing from bit to check nodes and vice versa. Several hardware realizations that have been proposed in the literature are focused on how efficiently passing messages between the two types of processing units. Three approaches can be followed in the high level organization of the decoder, coming to three kinds of architectures.
-
Serial architectures: bit and check processors are allocated as single instances, each serving multiple nodes sequentially; messages are exchanged by means of a memory.
Fully parallel architectures: processing units are allocated for each single bit and check node and all messages are passed in parallel on dedicated routes.
Partially parallel architectures: more processing units work in parallel, serving all bit and check nodes within a number of cycles; suitable organization and hardware support is required to exchange messages.
For most codes and applications, the first approach results in slow implementations, while the second one has an excessive cost. As a result the only general viable solution is the third partially parallel approach, which on the other hand introduces the collision problem, already known in the implementation of parallel turbo decoders. Two main approaches have been proposed to deal with collisions:
To design collision free codes [Mansour & Shanbhag , 2003 ], [Hocevar, 2003] , -To design decoder architecture able to avoid or at least mitigate collision effects [Kienle et al., 2003 ], [Tarable et al., 2004] . Even if the first approach has proven to be effective, it significantly limits the supported code classes. The second approach, on the other hand, is well suited for flexible and general architectures. An even more challenging task is the design of LDPC decoders that are flexible in terms of supported block sizes and code rates [Masera et al., 2007] . In partially parallel structures, permutation networks are used to establish the correct connections between functional units. However, structured LDPC codes, such as those specified in WIMAX, allow for replacing permutation networks by low complexity barrel shifters [Boutillon et al., 2000] ; [Mansour & Shanbhag, 2003] . Early terminal schemes can be adopted to improve the decoding efficiency by dynamically adjusting the iteration number according to the SNR values. The simplest approach requires that decoding decisions are stored and compared across two consecutive iterations: if no changes are detected, the decoding is terminated, otherwise it is continued up to a maximum number of iterations. More sophisticated iteration control schemes are able to reduce the mean number of iterations, so saving both latency and energy [Kienle & When, 2005] ; [Shin et al., 2007] .
Case of study: complete WIMAX CTC decoder design
The WIMAX CTC decoder is made of three main blocks: symbol deselection (SD), subblock deinterleaver and CTC decoder as highlighted in Fig. 9 where N represents the number of couples included in a data frame. SD, subblock deinterleaver and CTC decoder blocks are connected together by means of memory buffers in order to guarantee that the non iterative part of the decoder (namely SD and subblock deinterleaver) and the decoding loop work simultaneously on consecutive data frames. Since the maximum decoder throughput is about 75 Mb/s and the native CTC rate is 1/3 (two uncoded bits produce six coded bits), at the input of the decoding loop the maximum throughput can rise up to 225 millions of LLRs per second. The same throughput ought to be sustained by the subblock deinterleaver, whereas even higher throughput has to be sustained at the SD unit in case of repetition.
Symbol deselection
Depending on amount of data sent by the encoder (puncturing or repetition), the throughput sustained by the symbol deselection (SD) can rise up to 900 millions of LLRs per second (repetition 4). When the encoder performs repetition, the same symbol is sent more than once. Thus, the decoder combines the LLRs referred to the same symbol to improve the reliability of that symbol. As shown in Fig. 9 this can be achieved partitioning the symbol deselection input buffer into four memories, each of which containing up to 6N LLRs. Since the symbol deselection architecture can read up to four LLRs per clock cycle, it reduces the incoming throughput to 225 millions of LLRs per second. However, the symbol deselection has to compute the starting location and the number of LLRs to be written into the output buffer. The number of LLRs and the starting location are obtained as in (23) and (24) respectively, where N SCHk , m k and SPID k are parameters specified by the WIMAX standard for the k-index subpacket when HARQ is enabled, namely N SCHk , is the number of concatenated slots, m k is the modulation order and SPID k is the subpacket ID. The efficient implementation of (25) A block scheme of the architecture employed to compute F k and L k is depicted in Fig. 10 (a) . Furthermore, in order to support the puncturing mode, the output memory locations corresponding to unsent bits must be set to zero. To ease the SD architecture implementation, all the output memory locations are set to zero while L k and F k are computed. As a consequence, about two clock cycles per sample are required to complete the symbol deselection, namely 6N LLRs are output in 12N clock cycles. So that the symbol deselection throughput can be estimated as 2 12
As it can be observed, to sustain 225 millions of LLRs per second a clock frequency of 450 MHz is required. To overcome this problem we impose not only to partition the input buffer into four memories, but also to increase the memory parallelism, so that each memory location contains p LLRs. Thus, we can rewrite (27) as (28) and by setting p to a conservative value, as p=4, the SD architecture processes simultaneously up to sixteen LLRs with f clk =113 MHz.
2 12
Subblock deinterleaver
The received LLRs belong to six possible subblocks depending on the coded bits they are referred to (A, B, As a consequence, the number of tentative addresses generated, N M , can be greater than N. Exhaustive simulations, performed on the possible N specified by the standard, show that the worst case is N M =191 that occurs with N=144. Since 191/144=1.326, a conservative approximation is N M =4N/3. The whole subblock deinterleaver architecture is obtained with one single address generator implementing Algorithm 1 to simultaneously write one LLR from each of the six subblock memories. In particular, as imposed by the WiMax standard, the interleaved LLRs belonging to the A and B subblocks are stored separately, whereas the interleaved LLRs belonging to Y 1 and Y 2 are stored as a symbol-by-symbol multiplexed sequence, creating a "macro-subblock" made of 2N LLRs. Similarly a macro-subblock made of 2N LLRs is generated storing a symbol-by-symbol multiplexed sequence of interleaved W 1 and W 2 subblocks. Since all the subblocks can be processed simultaneously, this architecture deinterleaves six LLRs per clock cycle. As a consequence, the subblock deinterleaver sustains a throughput
Thus, a throughput of 225 Millions of LLRs per second is sustained using f clk =50 MHz. To implement line 4 and 5 in Algorithm 1, three steps are required, namely the calculation of k mod J and k/J, the calculation of 2 m (k mod J) and BRO m (k/J), the generation of T k while checking T k <N. It is worth pointing out that k mod J can be efficiently implemented as an up-counter followed by a mod J block. Moreover, each time the mod J block detects k=J, a second counter is incremented: the final value in the second counter is k/J. Since m [3, 10] , the 2 m (k mod J) term is implemented as a programmable shifter in the range [0, 7] followed by a hardwired three position left shifter. The BRO m (k/J) term is obtained by multiplexing eight hardwired bit reversal networks. Finally, a valid T k address is obtained with an adder and is validated by a comparator. The address generation architecture is shown in Fig. 10 (b).
CTC decoder
As detailed in section 2.3 to sustain the throughput required by the WIMAX standard a parallel decoder architecture is required. To that purpose we set SP=1, I=8, and f clk =200 MHz, then from (17) we analyze the throughput as a function of N for W=32. As shown in Fig. 11 , only P=4 allows to achieve the target throughput (horizontal solid line) for N≥480. Moreover, the window width impacts both on the decoder throughput and on the depth of SISO local buffers. So that a proper W value for each frame size must be selected. In particular if N/(P·W)  SISOs synchronization is simplified. However, the choice of P should minimize collisions in memory access. Exhaustive simulations show that collisions occur for P=2 and P=4 only with N=108. As a consequence, we select P as a function of N to simultaneously obtain a monotonically increasing throughput as a function of N and to avoid collisions. It is worth pointing out that, when collisions are avoided, the resulting parallel interleaver is a circular shifting interleaver: the address generation is simplified with all SISOs simultaneously accessing the same location of different memories. Said idx 0 t the memory accessed by SISO-0 at time t during a scrambled half iteration, the memory concurrently accessed by SISO-k is idx k t =(idx 0 t ±k) mod P. Thus, the parallel CTC interleaver-deinterleaver system is obtained as a cascaded two stage architecture (see Fig. 12 ). The first stage efficiently implements the WIMAX interleaver algorithm, whereas the second one extracts the common memory address adx t and the memory identifiers idx k t from the scrambled address i. The CTC interleaver algorithm specified in the WIMAX standard is structured in two steps. The first step switches the LLRs referred to A and B that are stored at odd addresses. The second step provides the interleaved address i of the j-th couple as
where P 0 and P j ' are constants that depend only on N and are specified by the standard. It is worth pointing out that the two steps can be swapped, as a consequence the first step can be performed on-the-fly, avoiding the use of an intermediate buffer to store switched LLRs. A simple architecture to implement (30) can be derived by rewriting (30) A small Look-Up- Table ( LUT) is employed to store P 0 mod N and P j ' mod N terms; then (31) is implemented by two parts as depicted in Fig. 12 . The first part accumulates P 0 to implement the P 0 ·j term and the mod N block produces the correct modulo N result. The second part employs the two least significant bits of a counter (j−cnt) to select the proper P j ' mod N value, which is added to the (P 0 ·j) mod N term. A further modulo N operation is performed at the output. The straightforward implementation of (33) needs to calculate N/P and to allocate P−2 multipliers, P−1 subtracters, a P-way multiplexer and few logic for selecting the proper adx t value. The N/P division can be simplified by choosing the possible P values as powers of two. Thus, we obtain a CTC decoder architecture that exploits throughput/parallelism scalability to avoid collisions, namely we employ: P=1 when N≤180, P=2 when 192≤N≤240 and P=4 when 480≤N≤2400. Moreover, as it can be inferred from Fig. 12 , multiplications are avoided resorting to simple shift operations (x>>i=x/2 i ). The sign of the subtractions (dashed lines in Fig. 12 ) allows not only to select the proper adx t but also to find idx 0 t . Then, with P−1 modulo P adders the other idx k t values are straightforwardly generated. As it can be observed, choosing P as a power of two reduces the modulo P adders to simpler, binary adders. The actual throughput sustained by the described throughput/parallelism scalable architecture is represented by the bold line in Fig. 11 . The global architecture of the designed parallel SISO is given in Fig. 13 where each SISO contains the processors devoted to compute the different metrics required by the BCJR algorithm as detailed in section 2.3. A simple network is used to properly connect the SISOs according to the current value of P by setting the signal last_SISO. Furthermore, one address crossbar-switch (radx-switch) is used to implement the reading operation, a LIFO stores the address and makes them available for the writing phase, two data crossbar-switches (rdataswitch/wdata-switch) are used to properly send (receive) the data to (from) the memory (EI-MEM) according to the parallel interleaver idx k t values. 
Acknowledgements
This work is partially supported by the WIMAGIC project funded by the European Community.
