This chapter describes the implementation on field programmable gate array (FPGA) of a turbo decoder for 3GPP long-term evolution (LTE) standard, respectively, for IEEE 802.16-based WiMAX systems. We initially present the serial decoding architectures for the two systems. The same approach is used; although for WiMAX the scheme implements a duo-binary code, while for LTE a binary code is included. The proposed LTE serial decoding scheme is adapted for parallel transformation. Then, considering the LTE high throughput requirements, a parallel decoding solution is proposed. Considering a parallelization with N = 2 p levels, the parallel approach reduces the decoding latency N times versus the serial decoding one. For parallel approach the decoding performance suffers a small degradation, but we propose a solution that almost eliminates this degradation, by performing an overlapped data block split. Moreover, considering the native properties of the LTE quadratic permutation polynomial (QPP) interleaver, we propose a simplified parallel decoder architecture. The novelty of this scheme is that only one interleaver module is used, no matter the value of N, by introducing an even-odd merge sorting network. We propose for it a recursive approach that uses only comparators and subtractors.
Introduction
The channel coding theory was intensively studied during the last decades, but the interest on this topic increased even more following the pioneering work of Berrou et al. on turbo codes [1] [2] [3] .
In their early existence, the turbo codes proved to obtain great decoding performances, so that they were used in many standards as recommendations. They transformed into a more appealing solution once the processing capacity increased for the field programmable gate array (FPGA) and digital signal processor (DSP). Their implementation complexity was not prohibitive anymore, this allowing them to become mandatory.
In this context, the Third-Generation Partnership Project (3GPP) organization early proposed these novel coding techniques. It should be mentioned that turbo codes were introduced in standard by the first version of Universal Mobile Telecommunications System (UMTS) technology (in 1999). Moreover, the next UMTS releases (the following high-speed packet access) contributed with new and interesting features, while turbo coding remained still unchanged. Furthermore, several modifications were introduced by the long-term evolution (LTE) standard. Even if they were not significant as volume, their importance arose in terms of concept. In this framework, the 3GPP proposed for LTE a new interleaver scheme, while maintaining exactly the same coding structure as in UMTS. Also, the turbo codes were introduced by the Institute of Electrical and Electronics Engineers (IEEE) in 802. 16 standards, known as the base for WiMAX systems.
In Ref. [4] , an UMTS dedicated turbo decoding binary scheme is developed, whereas for WiMAX systems a similar duo-binary architecture is presented in Refs. [5] and [6] . Thanks to the new LTE/LTE-advanced (LTE-A) interleaver, the decoding performances are improved, as compared to the ones corresponding to the UMTS standard. In addition, the new LTE interleaver comes with native properties suited for a parallel decoding approach inside the algorithm, thus taking advantage on the main idea brought by turbo decoders (i.e., exchanging the extrinsic values between the two decoding units). In Ref. [7] , a serial decoding scheme implemented on FPGA is presented. However, parallelization is still required when high throughput is required, as in the particular case of LTE systems using diversity techniques.
In the past years, many interesting parallel decoder schemes were studied by the researchers. In this context, the obtained results are measured on two directions. The direction number 1 is represented by the decoding performance degradation between the parallel and the serial solutions. The direction number 2 is the hardware resources occupied for such parallel decoder implementation. In Ref. [8] , a first group of parallel decoding solutions is presented. It is based on the classical maximum a posteriori (MAP) algorithm. This method passes through the trellis twice, first time to compute the forward state metrics (FSM) and the second time to obtain the backward state metrics (BSM) and simultaneously the log likelihood ratios (LLR). Following this approach, several approaches were developed in order to reduce the theoretical latency of the decoding process of 2K clock periods for each semi-iteration (where K is the data block length).
In Refs. [9] and [10] , a second set of parallel architectures that take advantage of the quadratic permutation polynomial (QPP) interleaver algebraic-geometric properties is described. In these works, efficient hardware implementations of the QPP interleaver are proposed. However, the parallelization factor N still represents the number of used interleavers in the developed architectures.
In Ref. [11] , a third approach was reported, which consists in using a folded memory. All the data needed for parallel processing are stored on the same time. On the other hand, the main challenge of this kind of implementation is to correctly distribute the data to each decoding unit, once a memory location containing all N values is read. In order to solve this issue, an architecture based on two Batcher sorting networks was proposed. However, even in this approach, N interleavers are still needed to generate all the interleaved addresses that input the master network.
In this chapter, we present the optimized implementations for serial architectures for WiMAX and LTE turbo decoding schemes. Then, for LTE systems, we describe a parallel decoding architecture introduced in Refs. [12] and [13] , which also relies on a folded memory-based approach. Nevertheless, the main difference as compared to the already existing solutions presented above is that our proposed approach includes only one interleaver. Additionally, with an even-odd merge sorting unit [14, 15] , the parallel architecture maintains the same structure as the serial one, the only difference being given by the fact that the soft-input softoutput (SISO) decoding unit is included N times in the scheme. The block memory number and dimensions remain unchanged between the two proposed decoding structures. In terms of decoding performance, the obtained results for the serial and parallel approaches are almost similar. We propose an overlapped data block split that reduces the small degradation introduced by the parallel architecture.
Finally, we present throughput and speed results obtained when targeting a XC5VFX70T [16] chip on Xilinx ML507 [17] board. Moreover, we provide simulation curves for the three considered cases, i.e., serial decoding, parallel decoding and parallel decoding with overlap.
2. The coding scheme 2.1. WiMAX systems Section 8.4 from 802.16 standard [18] presents the coding scheme on the basis of which the proposed decoder is implemented. Figure 1 shows the duo-binary encoder. The native coding rate is 1/3. In order to obtain other coding rates, a puncturing block must be used. Accordingly, a depuncturing block must be added to the receiver architecture. Let us define the following parameters: coding rate R; block dimension (in pairs of bits, i.e., dibits) K, which is computed independent of a coding rate, as a function of the uncoded block size; the number of iterations L, i.e., the latency Latency (in clock periods); information bits rate R b [Mbps]; and system clock frequency F clk [MHz] .
As mentioned in Ref. [6] , the main problem of a convolutional turbo code (CTC) decoder implementation is represented by the amount of required hardware resources. Moreover, in order to reach the targeted high data rate, the system clock has to be fast. Equation (1) presents the decoding throughput.
For a fixed latency algorithm, according to Eq. (1), the output throughput is improved when achieving a higher clock frequency. Another way is to reduce latency using a parallel architecture; however, this increases the occupied area and may lead to a smaller clock frequency due to longer routes. Moreover, another direct constraint is the significant memory needed for storing data. This issue also affects the frequency, since a large number of used memory blocks leads to a large resource spread on chip and, obviously, longer routes.
Taking into account the previously mentioned aspects, we can conclude that all the parameters presented above are related, so that a global optimization is not possible. Consequently, we have chosen to balance each direction in order to meet throughput requirements.
LTE systems
A classic turbo coding scheme is presented in the 3GPP LTE specification, including two constituent encoders and one interleaver module ( Figure 2 ). The data block C k can be observed at the input of the LTE turbo encoder. The K bits from this input data block are transferred at the output, as systematic bits, in the steam X k . At the same time, the first constituent encoder processes the input data block, resulting the parity bits Z k , whereas the second constituent encoder processes the interleaved data block C 0 k , resulting the parity bits Z 0 k . Combining the systematic bits and the two streams of parity bits, we obtain the following sequence (at the output of the encoder):
In order to drive back the constituent encoders to the initial state (at the end of the coding process), the switches from Figure 2 are moved from position A to position B. Since the final states of the two constituent encoders are not the same (different input data blocks produce different final state), this switching procedure generates tail bits for each encoder. These tail bits are sent together with the systematic and parity bits, thus resulting the following final sequence:
As it was previously mentioned and discussed in Ref. [7] , the LTE turbo coding scheme introduces a new interleaving structure. Thus, the input sequence is rearranged at the output using:
Field -Programmable Gate Array
where the interliving function π applied over the output index i is defined as
The input block length K and the parameters f 1 and f 2 are provided in 3. The decoding algorithm
WiMAX systems
The decoding architecture consists of two decoding units called constituent decoders. Each such unit receives systematic bits (in natural order or interleaved) and parity bits, as shown in Figure 1 .
The block diagram implements a maximum-logarithmic-maximum A posteriori (Max-Log-MAP) algorithm. For the case of turbo binary codes, the decoder scheme will represent, in the log likelihood ratio (LLR) space, each binary symbol as a single likelihood ratio. But in the situation of turbo duo-binary codes, the decoding unit requires three likelihood ratios in the same space. If we consider the duo-binary pair A k and B k ,t h eL L R sm a yb e computed as: 
where (a,b) are (0,1), (1,0), or (1, 1) . The ratio set is updated by each decoding unit (constituent decoder) for each input pair, using the corresponding LLRs and parity bits, also seen as LLRs. Then, the output LLRs minus the input LLRs provides the extrinsic values. The trellis for a duo-binary code contains eight states, each such state with four inputs and four outputs, as presented in Figure 3 . Using the systematic and parity pairs LLRs, for each branch, the metric γ k ðS i ! S j Þ is computed, i.e.,
The constituent decoder ( Figure 4 ) performs the corresponding processing forward and backward over the trellis. When moving forward, the decoder computes the unnormalized metric α 0 kþ1 ðS j Þ corresponding to each computed normalized metric α k ðS i Þ associated with state S i , using ( Figure 4 ) Field -Programmable Gate Array where the operator "maximum" is executed over all four branches entering the state S j at the time stamp k + 1. Once the metrics for all states are updated at time stamp k + 1, the normalization versus the state S 0 value is made by the decoder. Analogously to forward processing, for backward moving, the decoder computes:
where the operator "maximum" and the normalization method are similar to Eq. (6).
The initialization with null values is carried out for all the forward and backward metrics at all states. Once the new values are computing and stored, the decoding unit executes the second step in the decoding procedure, i.e., the LLRs computing as in Eq. (4). The decoding unit starts by computing the likelihood ratio for each branch
and continues with the value t k ða, bÞ¼ max
where the operator "maximum" is computed over all eight branches generated by the pair (a, b). At the end, the output LLR is computed as
The decoding procedure is executed for a decided number of iterations or until a convergence criterion is reached. Then, a final decision is taken over the bits. This is achieved by computing for each bit from the pair (A k , B k ) the corresponding LLR: where Λ o 0, 0 ðA k , B k Þ¼0. Finally, by comparing each LLR with a null threshold, i.e., looking at the sign, the hard decision is made.
LTE systems
The decoding architecture for the LTE systems is presented in Figure 5 . The two decoding units called recursive systematic convolutional (RSC) use theoretically the MAP algorithm. The MAP solution, a classical one, ensures the best decoding performances. Unfortunately, at the same time, it is characterized by an increased implementation complexity and also it may include variables with a large dynamic range. These are the reasons why the classical solution with the MAP algorithm is used only as a reference for the expected decoding performance. When it comes to real implementation, new suboptimal algorithms have been studied: Logarithmic MAP (Log MAP) [20] , Max Log MAP, Constant Log MAP (Const Log MAP) [21] and Linear Log MAP (Lin Log MAP) [22] .
For the LTE systems, we consider a decoding architecture based on the Max Log MAP algorithm. This suboptimal algorithm overcomes the problems of implementation complexity and dynamic range by paying the price of lower decoding performance when compared with the MAP algorithm. However, this degradation can be maintained inside some accepted limits. Starting from the Jacobi logarithm, only the first term is used by the Max Log MAP algorithm, i.e., max Ã ðx, yÞ¼lnðe x þ e y Þ¼maxðx, yÞþlnð1 þ e −jy−xj Þ ≈ maxðx, yÞ :
The trellis diagram for the turbo decoding architecture of the LTE systems contains eight states, as presented in Figure 6 . Each state of the diagram has two inputs and two outputs. The branch metric between the states S i and S j is Field -Programmable Gate Array Figure 6 . LTE turbo coder trellis.
Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 33
where X(i,j) and Z(i,j) are the data, respectively, the parity bits, both associated with one branch and Λ i ðZ k Þ is the LLR for the input parity bit. For SISO 1 decoding unit, this input
where "IL" operator denotes the interleaving procedure. In Figure 5 , W(X k ) is the extrinsic information,
k Þ are the output LLRs generated by the two SISOs.
Looking at the LTE turbo encoder trellis, one can notice that between two states, there are four possible values for the branch metrics:
The LTE decoding process follows a similar approach as for WiMAX systems, i.e., it moves forward and backward through the trellis.
Backward recursion
The algorithm moves backward over the trellis computing the metrics. The obtained values for each node are stored in a normalized manner. They will be used for the LLR computation once the algorithm will start moving forward through the trellis. We name β k ðS i Þ the backward metric computed at the kth stage, for the state S i , where 2 ≤ k ≤ K þ 3 and 0 ≤ i ≤ 7. For the backward recursion, the initialization β Kþ3 ðS i Þ¼0, 0 ≤ i ≤ 7 is used at the stage k = K + 3. For the rest of the stages 2 ≤ k ≤ K + 2, the computed backward metrics arê
where S j1 and S j2 are the two states from stage k + 1 connected to the state S i from stage k and β k ðS i Þ represents the unnormalized metric. Once the unnormalized metricβ k ðS 0 Þ is computed for state S 0 , all the backward metrics for states S 1 …S 7 are normalized as
and then stored in the dedicated memory.
Forward recursion
When the backward recursion is finished, the algorithm moves forward through the trellis in the normal direction. This specific phase of the decoding is similar to the one for Viterbi algorithm. In this case, the storing procedure is needed only for the previous stage metrics, Field -Programmable Gate Array i.e., for computing the current stage k metrics, only the forward metrics from the last stage k − 1 are needed. We will name α k ðS i Þ the forward metric corresponding to state at the stage k, where 0 ≤ k ≤ K − 1 and 0 ≤ i ≤ 7. For the forward recursion, the initialization α 0 ðS i Þ¼0, 0 ≤ i ≤ 7 is used at the stage k = 0. For the rest of the stages 1 ≤ k ≤ K, the unnormalized forward metrics are computed asα
where S i1 and S i2 are the two states from stage k − 1 connected to the state S j from stage k. Once the unnormalized metricα k ðS 0 Þ is computed for state S 0 , all the forward metrics for states S 1 …S 7 are normalized as
The decoding algorithm can obtain now an LLR estimated for the data bits X k since it has for each stage k the forward metrics just computed and also the backward metrics stored in the memory. For the first time, this LLR is obtained by computing the likelihood of the connection between the state S i at stage k − 1 and the state S j at stage k as
The likelihood of having a bit equal to 0 (or 1) is when the Jacobi logarithm of all the branch likelihood corresponds to 0 (or 1) and thus:
where "max" operator is recursively computed over the branches, which have at the input a bit of 1 fðS i ! S j Þ : X i ¼ 1g or a bit 0 fðS i ! S j Þ : X i ¼ 0g.
Proposed serial decoding scheme 4.1. WiMAX systems
One important remark about the decoding algorithm is that the outputs of one constituent decoder represent the inputs for the other constituent decoder. At the same time, knowing that the interleaver and deinterleaver procedures apply over the data blocks (so the complete block is needed) in a nonoverlapping manner will allow the usage of a single constituent decoder. This decoding unit operates time multiplexed and the corresponding proposed scheme is presented in Figure 7 .
In Figure 7 , we can identify storing requirements: the memory blocks that store data from one semi-iteration to another and the memory blocks used from one iteration to another. IL stands for the interleaver/deinterleaver procedure, while CONTROL is the management unit, controlling the decoder functionalities. This module provides the addresses used for read and write, the signals used to trigger the forward and backward movements through the trellis, the selection for one of the two SISO units and also the control of MUX and DEMUX blocks. The input buffer is also selected since the decoding architecture can accept a new-encoded data block while still processing the previous one. The most important module shown in Figure 7 is the SISO unit, which is the decoding structure. Figure 8 depicts the block scheme of this decoding unit. One can observe the unnormalized metric computing modules BETA (backward) and ALPHA (forward) and the module GAMMA that computes the transition metric. This last one ensures also the normalization: the metrics values obtained for state S 0 are subtracted from the metrics values obtained for the states S 1 …S 7 . The output LLRs are computed inside the L module and normalized inside the NORM module. The MUX-MAX module provides the correct inputs when moving forward or backward through the trellis. It also computes the maximum function.
The backward metrics are stored in MEM BETA memory during backwards recursion, their values being read when executing the forward recursion, in order to compute the estimated LLRs.
It is important to mention that some studies have been conducted regarding the normalization function. Trying to increase the system frequency (in order to reduce the decoding latency and Field -Programmable Gate Array so, to increase the decoded data throughput), one may think of removing the normalization and so to reduce the amount of logic on the critical path. This solution is not applicable because five extra bits would be needed for metrics values. From here more the memory blocks and more the complex arithmetic. Finally, all these will lead to a lower system frequency, so no benefit on this approach. On the other hand, we propose a dedicated approach to implement the metric computation blocks (ALPHA, BETA and GAMMA). Based on the trellis state, we identified the relations for each metric, 32 equations being used for transition metric computation (we remind that for each of the eight trellis states we have four possible transitions). Moreover, only 16 are distinct (the other 16 are the same) and from these 16, some are null. Using this approach, a complexity decrease is obtained. Figure 9 depicts the timing diagram for the proposed SISO. This corresponds to the scenario with one SISO unit and some MUX and DEMUX blocks replacing the two SISO units from the theoretical decoding architecture (see Figure 7 ).
In Figure 9 , R/W (K − 1:0) means reading/writing memory from addresses K − 1 to 0, R/W {IL (K − 1:0)} means reading/writing memory from interleaved addresses K − 1 to 0 and COM-PUTE means that the block is processing the input data.
LTE systems
T h es a m er e m a r ka b o u tt h et w oS I S Ou n i t sf r o mFigure 5 working in a nonoverlapping manner applies for LTE systems as for WiMAX ones. The same approach is used, i.e., the proposed decoding architecture includes only one SISO unit and some MUX and DEMUX blocks. Figure 10 depicts the block scheme of the proposed decoding architecture. Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017
One can observe the memory blocks in Figure 10 . Some are used to store data between two successive semi-iterations, respectively, between two successive iterations. Others, in dotted-line, are virtual memories used just to clarify the introduced notations. Moreover, the interleaver and deinterleaver modules are distinctively introduced in the scheme, but in fact they are the same. Both include a block memory called ILM (interleaver memory) and an interleaver. The novelty of this approach compared to the previous serial implementation proposed in Ref. [7] is the ILM. This memory will allow a fast transition to a parallel decoding architecture. The input data memories Field -Programmable Gate Array (on the left side in Figure 10 ) and the ILM are switched buffers, allowing new data to be written while the previous block is still decoded. The ILM is filled with the interleaved addresses; at the same time, the new data are stored in the input memories. The saved addresses are then used as read addresses for the interleaver unit and as write addresses for the deinterleaver unit. Here, we detail the way the architecture from Figure 10 In detail, SISO 1 reads the input memories and starts the decoding process, outputting the computed LLRs. Having the LLRs available and the extrinsic values, the vector V 2 (X k )i s computed and then stored in a normal order in the memory. The ILM content read in the normal order provides the reading addresses for V 2 (X k ) memory, emulating the interleaver process. The reordered LLRs V 2 (X' k ) are available, the corresponding values for the three tail bits X' K+1 , X' K+2 and X' K+3 being added at the end of this sequence. The same SISO unit acts now as SISO 2, this time reading data inputs from the other memory blocks. The two switching mechanisms from Figure 10 The memories Λ o 2 ðX 0 k Þ and V 2 (X k ) are read in the normal order to allow W(X k ) computation; W(X k ) is written in the corresponding memory and at the same time it is used for a new semiiterations. In other words, the memory for W(X k ) is updated during a semi-iteration. The time diagram for the proposed serial decoding architecture is presented in Figure 11 ,t h e intervals colored with gray indicating the writing periods for W(X k ) memory. As mentioned in this chapter, the input memories and the ILM (the upper four memory blocks in the image) are switched buffers and they are filled with new data while the previous-coded block passes the last phase of its decoding process. The same notations as shown in Figure 9 are used.
All the memory blocks in Figure 10 have 6144 locations, this being the maximum coded data block length defined by the standard. Only the memory blocks with the input data for SISO units have 6144 + 3 locations because they store also the tail bits. All locations contain 10 bits. Using a Matlab simulator in finite precision, it has been observed that six bits are needed for the integer part, in order to cover the dynamic range of the variables and three bits are needed for the fractional part to maintain the decoding performance close to the theoretical one, with a certain accepted level of degradation. The 10th bit is for sign.
The SISO decoding unit is similar to the one depicted in Figure 8 . ALPHA and BETA modules compute the unnormalized forward metrics and the unnormalized backward metrics, respectively. The GAMMA module computes the transition metrics and executes also the normalization (the metrics for state S 0 are subtracted from the metrics corresponding to states S 1, …, S 7 ). The output LLRs are computed inside the L module and normalized by the NORM module. The selection of the inputs for forward and backward moving on the trellis and also the maximum function are executed by the MUX-MAX module. Finally, the MEM BETA module stores the backward metrics.
The L module produces the output log likelihood ratios. These are then normalized inside the NORM module. The MUX-MAX makes the inputs selection (for forward or backward trellis runs) and implements also the maximum operator. The MEM BETA module keeps the backward metrics corresponding values into the memory.
Using the same approach for both WiMAX and LTE proposed serial decoding architectures, the same remarks apply. So, for the LTE turbo decoder also, the normalization function allows Field -Programmable Gate Array a reduced dynamic range for the variables. Trying to eliminate it, in order to reduce the number of logic levels on the critical path, will not lead to a higher system frequency because again, more memory blocks are required, more complex arithmetic (since variables are expressed on more bits) is used and finally, as an overall consequence, lower clock frequency is reported for the design.
And for ALPHA, BETA and GAMMA modules inside the SISO decoding unit, again the dedicated equations are used to compute the metrics. Sixteen such relations are implemented for transition metric computation (eight states in trellis with two possible transitions each). In fact, only four equations are distinct (as indicated in Eq. (15) ]. And from these four equations, one of them is null. This way the computational effort is minimized for this proposed architecture.
The interleaving and deinterleaving procedures implement the same equation. The interleaved index is computed using a modified form of Eq. (3), i.e.,
For the interleaving process, the data are written in the memory block in the natural order and then it is read in the interleaved order, while for the deinterleaver process the data are written in the interleaved order and then it is read in the natural order.
The computation in Eq. (22) is executed in three phases. First, the value for ðf 1 þ f 2 Á iÞmodK is obtained. The index i (describing the natural order) multiplies this partial result and the obtained value is passed once again through modulo K block. And as a remark for this computation: the formula is increased with f 2 for consecutive values of index i. So a register adds f 2 for each new index. If the register current value is higher than K, K is subtracted and the result is placed back in the register. This processing requires one system clock period, the results being generated in a continuous manner.
Proposed parallel decoding scheme
The serial architecture described in Figure 10 for LTE systems can be reorganized in a parallel setup, by instantiating the RSC SISO module N times in the structure. We propose a configuration that concatenates the N values associated with the N RSCs and employs a single memory location for all the memories in the scheme. The K locations with 10 bits per location (corresponding to the serial architecture) are replaced by K/N positions with 10N bits per position (working for the parallel format).
The most important benefit brought by the proposed serial decoding scheme is the single usage of the interleaver module before the decoding stage. The ILM is updated, each time a new data block enters the decoder, while the previous block is still being decoded. This approach prepares a fast and simple transition to the parallel scheme. Considering that the factor N is known, the ILM will have K/N locations, with N values being written at each location (i.e., the ILM can be prepared for the parallel processing that follows). As mentioned in Ref. [16] , a Virtex 5 block memory can be organized from a configuration of 32k locations · 1bittoasetupof512locations · 72 bits. In the costliest scenario (i.e., K = 6144), based on the N values and representing the stored values on 10 bits, the parallel ILM can be employed as:
• 768 locations · 80 bits
• 1536 locations · 40 bits
• 3072 locations · 20 bits
• 6144 locations · 10 bits
Only two BRAMs are used, the same as in the case of serial ILM. Figure 12 shows the ILM working principle. As one can observe, during the writing procedure, each index i from 0 to K -1 generates a corresponding interleaved value. All the computed values are stored in the ILM, in the same order. We will consider the ILM as a matrix, the rows being the memory locations and the columns being the positions on each location. The first K/N interleaved values are placed on the first column. The second set of K/N values is stored on the second column and the procedure continues. In order to perform the described method, a true dual port BRAM is selected. In Figure 12 , each time a new value is added on row WA at column WP (near the already existing content at columns till WP-1), the content of row WA + 1 is also read from the memory. In the next clock period, a new value is added at row WA + 1 at column WP (near the already existing content at columns till WP − 1), while reading also the content of row WA + 2. And so on. When the interleaver function is used, the ILM is read in a normal way and the N interleaved values from a row are employed as reading addresses for the V 2 (X k ) memory. Furthermore, the new LTE interleaver module (with the QPP algebraic properties) will always place at the same row the N values that should be read in the interleaved order from ILM. The only additional task is a reordering process needed to match the corresponding RSCs. An example is presented in Figure 13 for the values K = 40 and N = 8. On the left side, the content of the V 2 (X k ) memory is shown. Each column is composed of the outputs generated by one of the N RSC SISOs. On the right side, the content of ILM memory is described. Each minimum value from a line of the ILM represents the line address for the V 2 (X k ) memory (see the gray color circle in the illustration). By using a reordering module, each position from the outputted line is directed to its corresponding SISO. For example, position c from the first read line (index 10) is sent to SISO g, whereas position c from the second read line (index 13) is sent to SISO a. The same procedure applies also for the deinterleaving process, only that the write addresses are extracted from ILM, while the reading ones are used in the natural order.
For the reordering module, an even-odd merge sorting network is applied. The corresponding method was introduced by Batcher in Ref. [14] and is part of the sorting network group that includes several sorting approaches. One such example is the bubble sorting, which sorts in a repeated manner the adjacent pairs of elements. Another example is the shell sorting, which groups the input data into an array and then performs the array's column sorting (also in a repeating manner). After each associated iteration, the array becomes one column smaller. A third example is the even-odd transposition sorting, which sorts alternatively the odd-indexed Field -Programmable Gate Array Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 43 and the adjacent even-indexed elements, respectively, the even-indexed elements and the adjacent odd-indexed values. The fourth example is the bitonic sorting. The two halves of the input data are sorted in opposite directions and then jointly processed to produce one complete sorted sequence.
The even-odd merge sorting method is based on a theorem saying that any list of a =4 b (b natural) elements can be sorted if the following steps are applied: first, separate sorting is executed over the two halves of the list. After this step, the elements with odd index and the ones with even index are sorted separately. The last step consists in a comparing and switching procedure executed over all the elements 2n and 2n +1( n =1 , …, a/2 − 1). The demonstration of this theorem is available in Ref.
[23]. An example for N = 8 is depicted in a graphical format shown in Figure 14 . From a timing point of view, Figure 15 depicts the case when N = 2 is used. Same comments as the ones for Figure 11 apply.
In combination with the presented parallel decoding architecture, we also propose a simplified implementation for the interleaver block. As seen from Eq. (3), the arithmetic requirements for the computation of the memory addresses πðiÞ consist of three multipliers, one adder and one divider (used for the extraction of the remainder associated with the modulo operation). For all possible K values associated with the division, the quotients range is very large, since the numerator and the denominator can have very big values (and often situated in different numerical ranges-up to billions). We propose an efficient method to reduce the arithmetic complexity associated with Eq. (3).
By introducing the notation Figure 14 . Even-odd merge sorting for N =8.
Field -Programmable Gate Array 44
it can be observed that 
where
We can rewrite Eq. (3) using Eqs. (23) and (24) πðiÞ¼pðiÞ mod K ¼½pði−1Þþs 1 þ s 2 ðiÞ mod K
The multiplications are replaced by additions, which require less hardware resources. Nevertheless, the division is still necessary for the modulo operation. If we consider the modulo operator applied to a sum of elements expressed as
we can decrease the arithmetic effort needed to obtain πðiÞ in Eq. (26) . The number of modulo operations becomes bigger, but the overall complexity of the corresponding divisions is reduced since smaller quotients are used. Consequently, using Eqs. (25)- (27) , one can write:
Using Eq. (29) in Eq. (26), the result is
All of the numerical values added in the last stage of Eq. (29) are lower than K and available recursively (during the processing of a distinct frame), such as πði−1Þ and s 3 ði−1Þ mod K or they can be predetermined and stored, like the case of 2f 2 mod K:. The overall arithmetic complexity is reduced to 2K additions and 2K simplified modulo operations (i.e., each is resolvable using a comparison and a subtraction) for the address generation module. The method improves the solutions presented in [24, 25] , by eliminating any multiplications or divisions. Additionally, the lower numerical range of the operators (with values lower than 2K; i.e., values in the range of thousands) allows the usage of minimal resources for the representation of binary values.
Field -Programmable Gate Array 6. Implementation results 6.1. WiMAX systems
The estimated system frequency when implementing the decoding structure on a Xilinx XC4VLX80-11FF1148 chip using the Xilinx ISE 11.1 tool is 125 MHz. The reserved chip area is around 3000 (8.37%) slices from a total of 35,840. The results are comparable with the assessments presented in [26] .
The decoding latency and decoding rate corresponding to the above-mentioned clock frequency (see Table 1 ) are
The implementation delay is represented by 10 clock periods per iteration and is added to the theoretical latency of the MAP algorithm (which is 4KN clock periods).
In Figure 16 , the decoding performances are presented for a quadrature phase shift keying (QPSK) modulation, ½ rate, 1-4 iterations, a block size of 6 bytes (the smallest possible) and a transmission simulated through an additive white Gaussian noise (AWGN) channel. The results are depicted for the worst case scenarios, considering that the test was performed for the smallest block size. Figures 11 and 15 show that the decoding latency is reduced in the case of parallel decoding with a factor almost equal to N. The presented implementation has an 11 clock period Delay, which is added for each forward trellis run (when the LLRs are computed). As a consequence, two such values must be considered during each iteration.
LTE systems
For serial decoding, the native latency is computed as follows: at the first semi-iterations, K clock periods required for the backward trellis run and another (K + Delay) clock periods for the forward trellis run and LLR computation. The value is then considered twice in order to take into account the second semi-iteration. By denoting L the number of executed iterations, When performing tests for the parallel decoding performances, a certain level of degradation was observed, since the forward and backward metrics are altered at the data block boundaries. In order to have similar performance as in the serial decoding case, a small overhead is accepted. By introducing an overlap at each parallel block boarder, the metrics computation gains a training phase. The minimum overlap window length is selected to cover the minimum standard defined data block (in this case K min = 40 bits). Figure 17 shows this situation, for the N = 2 setup. If we consider N > 2, which leads to blocks with K min at both the left and right sides, the corresponding latency can be expressed as:
For even-odd merge sorting network implementation, we can study the configuration K =4 0 bits and N = 8. The input of the ILM content is represented by the 40 interleaved addresses organized in five memory locations and eight addresses for each location. The minimumdetected value for each ILM location (i.e., the natural-order memory location that will be Field -Programmable Gate Array accessed) is contained in the output of the sorting unit. Also, the module provides the order which will be used to send data read from natural-order memory location to the N decoding units. In this example, at the third clock period, the second ILM location is read, i.e., the addresses 6, 31, 36, 21, 26, 11, 16 and 1. The sorting module labels these addresses with an index, obtaining the pairs: (6, 0), (31, 1), (36, 2), (21, 3) , (26, 4) , (11, 5) , (16, 6) and (1, 7) . Then the addresses are arranges in an increasing order: (1, 7), (6, 0), (11, 5) , (16, 6) , (21, 3) , (26, 4) , (31, 1) and (36, 2). At the same time, the minimum address found at this location is sent at the output, 1 in this example. In conclusion, location number 1 is read from the natural-order data memory. The eight samples from the location 1 are distributed to the eight decoding units as indicated by the output index. The first sample from this location is sent to decoder unit 7, the second sample to decoder unit 0, the third one to decoder unit 5 and so on. As Figure 18 shows that at the register transfer level (RTL), besides flip flops, the sorting unit includes only basic selection elements.
It can be seen in Figure 19 that the sorting unit allows a pipeline data processing. C o n s e q u e n t l y ,w i t hac e r t a i ni m p l e m e n t a t ion delay (7 clock periods in the proposed scheme), the module provides a value belonging to the set of sorted indexes at each clock cycle. It is important to mention that the even-odd merge sorting was selected because it allows a pipeline functioning, consuming also lower resources than the other listed methods. Some comparative results were provided in [11, 27] in terms of used resources for the applicationspecific integrated circuit (ASIC).
In order to evaluate the performances, we used the very high speed hardware description language (VHDL), programming language. The code was tested using ModelSIM 6.5. For the generation of RAM/ROM memory blocks, Xilinx Core Generator 14.7 was employed and the synthesis process was accomplished using Xilinx XST from Xilinx ISE 14.7. Using the abovementioned tools, the resulted values for the decoding structure when implemented on a Xilinx XC5VFX70T-FFG1136 are the following [28] : frequency of 310 MHz and 664 flip flops and 568 LUTs for the sorting unit, respectively, a frequency of 300 MHz, 1578 flip flop registers and 1708 LUTs for the interleaver.
The values listed in Table 2 are obtained using Eqs. (32)-(34), when N = 8 is considered. One can observe that the overhead introduced by the overlapping split method is less important for bigger values of K, this being the scenario when a parallel approach is usually applied. The achieved overall system frequency is 210 MHz, with the longest signal propagation time required for the SISO unit. Table 2 . Latency values for N =8 ,L = 3 or 4 and K = 1536, 4096 or 6144.
Field -Programmable Gate Array
As one can observe from Table 3 , the serial decoding performance is similar to the theoretical one. Let us consider, for example, the case L = 3 and K = 6144. Considering the theoretical latency of 4KL clock periods, the theoretical throughput is 17.5 Mbps. After implementation, the obtained result for the proposed serial architecture is 17.48 Mbps.
The following performance graphs were obtained using a finite precision Matlab simulator. This approach was selected because the same outputs as the ModelSIM simulator are obtained in Matlab, while the testing time is considerably smaller.
All the simulation results were generated for the Max Log MAP algorithm. The illustrations present the bit error rate (BER) versus signal-to-noise ratio (SNR) expressed as the ratio between the energy per bit and the noise power spectral density. Figure 20 presents the attained performances for the case of K = 512, N =2 ,L = 3 and QPSK modulation, using the three discussed decoding methods, i.e., the serial one, the parallel without overlapped split one and the parallel with overlapped split one. Figure 21 depicts the same performance comparison, this time for K = 1024 and N =4. Analyzing the results presented in Figures 20 and 21 , one can conclude that the decoding performance obtained, when parallel decoding with the overlapped split method is used, is almost similar to the one for serial decoding. In contrast, the parallel decoding without the overlapped split method generates some loss in performance when compared to the serial decoding. This degradation is dependent on the parallelization factor N.
Conclusions
This chapter presented the most important aspects related to the FPGA implementation of a turbo decoder for WiMAX and LTE systems. The serial turbo decoder architectures for the two systems have been developed and efficiently implemented, important results being obtained especially for the proposed architectures of the interleaver/deinterleaver. For LTE systems, the interleaver memory ILM has been introduced. In this manner, the interleaver process effectively works only outside the decoding process itself.
The ILM has been written together with the input data, while the previous block was still under decoding. It should be outlined that this solution allows the transition from the serial to the parallel decoder in an efficient manner, involving only values that are concatenated at same memory locations. The parallel approach requires the same storing capacity (the number of BRAMs) and a single interleaver, thus adding only an even-odd merge sorting network. This unique interleaver has been implemented in an efficient configuration that uses only comparators and subtractors and no multipliers and dividers 
Field -Programmable Gate Array 52
The parallel decoding performances have been compared with the serial ones. In this context, certain degradation has been observed. In order to eliminate this degradation, a small overhead is accepted by the overlapping split that is applied to the parallel data blocks.
