Abstract-During the last decade, Turbo codes have been taking an increasing importance in channel coding due to its good performance in error correction. One key component in Turbo codes is the interleaver/deinterleaver pair, often designed as reconfigurable coprocessors able to deal with requirements of large data length variability found in the newest communication standards. In this work we introduce a configurable interleaver architecture for the turbo decoder in 3 rd Generation Partnership Project (3GPP) standard. It is implemented under the idea of "iterative modulo computation". Additionally, the presented solution not only generates the interleaved addresses, but also deals with the flow of data streams through the interleaver. The architecture and FPGA implementation results are also presented.
INTRODUCTION
Nowadays, communications systems have been growing and developing very quickly. Applications in satellite communications, wireless networks, mobile telephony, internet, etc. continuously demand an increase in bit rates, while lowering bit error rates and power consumption. In this context, channel coding schemes have been developed and consist basically in adding redundancy bits to the original information that are utilized later on in the detection process by the receiver. Among those schemes, Turbo codes, that were introduced in 1993 by C. Berrou in [1] , have become a popular technique in communications research and a key component in recent communications standards, such as 3GPP mobile phones (that is based on the evolution of GSM systems) and recently, LTE standard, mainly because its performance is near to Shannon coding limit.
Turbo codes' high performance in error correction is heavily related to both iterative decoding and the use of interleavers. Interleavers help to increase the minimum distance in the code distance spectrum, which is intimately related to the correction capability of the code [2] ; therefore, designing Turbo codes requires a meticulous design of its constituent interleaver.
An interleaver is a device that rearranges the order of data or bit sequences in a one-to-one pseudo random format. The inverse operation to interleaving is called deinterleaving, devoted to restore the received sequence into its original order. The mapping rules for the interleaver/deinterleaver (I/D) processes are usually given in form of equations, tables and special architectures [3] . In Turbo codes, the interleaver is located, at the encoder side, between the two recursive systematic convolutional (RSC) encoders as shown in Fig. 1 . At the decoder side, interleaver and deinterleavers are required, as shown in Fig. 2 , for the Log-MAP-based iterative decoding turbo decoder, as the one presented in [4] . In this paper we focus on the design of the 3GPP standard Turbo code I/D that must support turbo coding at data rates between 384 kbps and 2Mbps. The main objective is to devise an architecture capable of managing every one of the 5074 different block sizes of data defined in the 3GPP standard, while maintaining low hardware complexity and small use of resources. This paper is organized as follows: In section II, the equations defined in 3GPP standard, over which our interleaver is based on, are introduced. In section III the architecture presented in [5] and how it should be completed to manage all sizes of data specified by the standard, as well as how to deal with data and not only with addresses is discussed. Section IV presents the proposed configurable interleaver architecture. A comparison of the proposed architecture with related approaches, as well as performance results of an FPGA implementation is given in section V. Finally, section VI shows the gathered conclusions.
II 3GPP INTERLEAVER ALGORITHM
From the 3GPP standard [6] , we can get the interleaver algorithm defined for turbo coding/decoding. In this algorithm input bits feed a rectangular matrix, and then some permutations are performed, where i, j are the indexing variables for row and column, starting from 0, top to bottom and left to right, respectively. The algorithm works as follows:
• Read out the addresses columnwise. As we can see from the algorithm, it is necessary to perform different kinds of operations to achieve the interleaving pattern, some of those operations are:
. In order to develop our interleaver architecture, we based our design on previously reported architecture [5] . Nevertheless, this architecture was improved in some relevant aspects that will be highlighted in the following sections.
III. 3GPP INTERLEAVER ARCHITECTURE
The interleaving process can be separated in two main modes. The first one, the pre-computation mode, has to be performed each time a change in the block size occurs; that means, for a fixed block size K, these operations have to be performed only once. This mode computes the number of rows (R), number of columns (C), the integer prime (p), the primitive root (v), the base sequence for intra-row permutation S(j), the sequence of minimum prime integers (q i ) and the permuted prime integers (r i ).
The second mode, called run time mode, calculates the permutation sequence U i,j and the interleaved address i_addr where we can write or read the data bits (in/from a data RAM) depending on whether interleaving or de-interleaving is performed. This run time mode has to be performed for each data block. Note that the set of addresses could also be computed and saved in memory only once, however, the approach presented here, based in two modes and computing addresses at run time mode, is what permits to reduce the implementation size of the I/D algorithm as compared to a memory or LUT (Look-Up Table) based approach, where interleaving addresses are saved in memory. Next, the two modes of the algorithm as well as their associated architectures are presented.
A. Pre-Computation Mode
First, we calculate R with logical functions. Parameters p and v are stored in pairs in a Look Up Table ( LUT), just like in the standard, so we read this LUT addressed via a counter generated by the controller based on equation (1) .
K≤R×(p+1)
To perform the operations in equation (1), we use the architecture proposed in [5] and shown in Fig. 3 . This architecture is capable of performing multiplication, addition and comparison, and it is controlled by m1-m5 flags generated by the controller.
Once p and v are obtained, we can proceed to get the number of columns (C). We know that C can take one of the three values p-1, p or p+1, and should be the minimum value that gets (R×C)≥K. The controller generates C and the architecture shown in Fig. 3 performs the comparison.
According to the algorithm, we have to calculate the prime sequences q i . However, it can be noticed that the q i sequences (as mentioned in [5] ) are almost the same in most cases and they only differ by one or two elements from other sequences. Based on this observation, we decided to place sub groups of q i sequence into a ROM and then choose one according to the p value.
To calculate the S(j)=(v×S(j-1))mod p sequence it is mandatory to perform the modulo operation. Fortunately, this operation is presented in simplified form in [7] , where an iterative numerical algorithm based basically on additions, shifts, comparisons and bit retrieval is presented. A flow diagram depicting this algorithm is presented in Fig. 4 . 
B. Run Time Mode
In run time mode we obtain the interleaved addresses by performing the inter row T(i) and intra row U i,j permutations, where the interrow order T(i) is stored in a LUT while the intra row order U i,j is stored in a RAM. With these parameters and the use of equation (2), the interleaving addresses i_addr are finally generated.
with i=0,1,…,R-1; j=0,1,…,C-1. Notice that in this mode, the architecture shown in Fig. 3 is used to compute C×T(i). The result of this computation points to the first element of each of the R rows in the rectangular matrix, and then by adding U i,j with the same architecture we obtain a displacement along every row. An example of this is shown in Fig 6. To obtain U i,j in run time mode we have to calculate the modulo operation (j × r i ) mod (p-1) which has the same form as (v×S(j-1) )mod p in the pre-computation mode. As a result, and since these operations are performed at different times, we can reuse the architecture of Fig. 5 to perform both operations with little modifications. As we can see in Fig 5. a "Circular Buffer" needed for iterative operations and a multiplexer used to enable this buffer in run time mode were added. In that architecture another multiplexer was added trough which signal q i is feed when functioning in the run time mode. There exists an architecture, shown in Fig. 7 , that perform modulo operation based on subtractions; its main block is enclosed in dotted lines in Fig. 7 . This architecture although works fine in the pre-computation mode, it fails in almost 12% of the cases in run mode [5] , this occurs when q(i) > 2P-1. In our architecture we solved this problem by modifying Fig 8.a to obtain 8 Fig. 7 we can perform all modulo operations required by the 3GPP standard.
.b which can perform subtractions like q(i) -(P-1), q(i) -2(P-1), q(i) -3(P-1) or q(i) -4(P-1) not only q(i) -(P-1). By substituting 8.a with 8.b in
With this modification the obtained hardware architecture works properly in both the pre-computing and run time modes.
From architecture of Fig. 7 and using b) from Fig. 8  instead of a) , we obtain the RAM address U i,j to read, from equation (3).
Ram_Adr(i,j) = [Ram_Adr(i, j-1) + Q_mod(i)] mod (p-1)
Furthermore, analyzing the architecture presented in [5] , it can be seen that, as most of the existing designs, it does not handle data streams but only addresses. Normally this should not represent a problem, but for 3GPP, exceptions exist in the interleaved addresses which we have to deal with. In order to solve this problem we added some blocks, the most important being a data RAM and exceptions ROM.
IV. PROPOSED ARCHITECTURE AND COMPARISON WITH OTHER INTERLEAVERS
In the 3GPP standard's interleaving algorithm, there are some exceptions that are reflected in the architecture shown in Fig. 10 with an "Exceptions Handling" block. There are three exceptions; the first occurs when we generate addresses bigger than K, these addresses are tagged as invalid. For example, for K=41 the rectangular matrix size is 5x10, then there would be 9 invalid addresses.
Because of that, we cannot generate a valid address per clock cycle (see Table 1 ). The second exception occurs when C=p+1, in this case we assign 0 to U i, (p-1) , and p to U i,p . The third exception occurs when both conditions C=p+1, and K=RxC, occur. In this case, besides assigning 0 to U i, (p-1) , and p to U i,p , we exchange U (R-1,0) with U (R-1,p) .
Another important block in the architecture is the "Modulo Computation" block that besides performing the modulo operations in pre-computation and run time modes, it also writes the S(j) sequence in the S(j) RAM. Connected to this block is the "Circular Buffer" block used in run time mode for recursive operations. In order to synchronize every block in the architecture, a "Controller" block generates all control signals, its state machine diagram is shown in Fig. 9 .
Most interleavers like the ones presented in [5] , [3] , [8] , [9] and [10] provide the interleaved addresses and a label indicating whether or not those addresses are valid. Our interleaver architecture besides providing the addresses, it deals with the input data stream. For example, if we receive an input data stream of any size between 40 and 5114 our interleaver takes it and rearranges it according to its corresponding 3GPP interleaved path. The architecture outputs a data stream already processed, even when exceptions are present. In order to manage data streams, we use a data RAM which increases the size of our design, but as it is mandatory for Turbo codes to store this data stream (in a RAM), a Turbo code architecture that includes our interleaver may put aside the RAM. Usually Turbo codes' interleavers work with the same block size K several times before changing it. When an interleaving operation is required the pre-computation mode is performed, where S(j) is calculated and placed in RAM. Then the Run time mode starts where interleaved addresses are calculated continuously meanwhile parameter K does not change. Only when K is changed, pre-computation mode is performed again. This is shown in Fig. 9 .
As we can see in Fig. 10 , that depicts the complete architecture of our design, we also incorporate an extra RAM, called I/O RAM for data movement when in exception handling. In interleaver mode this RAM is used to hold data input when invalid addresses are generated, waiting for a valid address in order to write this data to DATA RAM. In Deinterleaver mode, when garbage is read from DATA RAM, because a invalid address, the I/O RAM ignores this garbage data waiting for a valid data and then output this data. In this way, we can receive or deliver data in a consecutive way even when invalid addresses caused by exceptions are generated. This is another main difference with respect to [5] . In fact, in [5] it is not mentioned how this data swapping is implemented.
A maximum delay of 199 clock cycles is generated due to invalid addresses; this is the number of locations of the extra I/O RAM.
When we use our design as interleaver, we use de i_addr signal for writing the data stream in the DATA RAM and we read in a increasing order, and when we use it as deinterleaver, we write the data stream in a increasing order and we use i_addr signal for reading. To achieve this we use some multiplexers as shown in Fig. 10 , as an input of these multiplexers we have an exceptions ROM that works like a LUT, and it is addressed by the controller.
V. PERFORMANCE RESULTS
Finally, after designing our architecture, we compiled and downloaded it to the FPGA Cyclone II from ALTERA. Table 2 summarizes the results and compares them against other interleaver designs. As it can be seen from the table, our design occupies about 16.5% (5000 out of 30216 Logic Elements -L.E.-) of the available resources.
It is worth mentioning that although our design is bigger than others, it includes exceptions ROM, an I/O RAM and a DATA RAM as well as extra hardware to control them. In this way, our architecture is capable of working both, as an interleaver and as a deinterleaver, which other designs are not capable of. Another advantage of using this extra hardware is that our design can manage data streams, and really performs the interleaving/deinterleaving operation. This means that, if we receive a data block of any size, our architecture not only generates the interleaved/deinterleaved addresses but it also can perform data interleaving/deinterleaving as it is, i.e. without any modification. In this I/D, latency and throughput can vary depending on the block size, but for the largest block size, K=5114, the proposed architecture requires at much as 5140 clock cycles, which represents a realistic insignificant overhead.
V. CONCLUSIONS
In this paper we presented a fully functional 3GPP Turbo code interleaver/ deinterleaver architecture that receives an input data stream of any size established by the 3GPP standard and delivers this stream interleaved or deinterleaved depending on the user requirements. In this design we used a computer algorithm to calculate modulo operation, and we took advantage of using the same hardware in pre-computation and run time modes by multiplexing it. Finally by adding RAMs for data handling we achieved a complete architecture that can perform interleaving/deinterleaving operations as required by the 3GPP standard for Turbo codes. This can be seen as an advantage from a point of view of configurability.
