Absrmct -We describe an energysffkient approach for VLSI implementation of the Sd Generation Partnership Project (JGPP) turbo coding interlesver algorithm. Unlike previous implementations, this interleaver uses a tao-stage dedicated hardware datapth that exploits the iterative natore of the decoding process, to compote addres-an the fly, eliminating the overhead associated with progrsmmahle processors and pmompnted address storage. By separating the interleaving process into two stages, our architectore allows the preparatory phase to be tnrned off doring iterations, while the decoder engages only the real-time address computation phase, forther reducing power eommpption.
INTRODUCTION
of the same length. The algorithm can be summarized as follows:
Turbo codes I] have been shown to provide superior error performance amidst hostile transmission media. 3' generation systems based on the 3GPP air interface standard E1 must support turbo coding at data rates of between 3Mkbps to ZMbps, while maintaining low mobile station power. The error performance of turbo codes is determined by their distance spectrum, a parameter that is affected both by the encoder structure and the interleaver type and depth. Whereas external interleavers are customarily used to spread out burst errors to a fading channel (e.g. wireless channel), a non-uniform internal interleaver is employed in turbo codes, in order to randomize the data sequence from the encoder and achieve maximum entropy. The interleaver also ensures that the parity sequences generated by the two recursive systematic convolutional encoders are scrambled to appear as uncorrelated as possible. allowing the use of iterative suboptimal decoding algorithms in the decoders. Fig. 1 is a simplified representation of turbo decoding, Row-wise data input to an fit matrix, with zero padding if K e C .
of the rectangular matrix, based on a recursively constructed base sequences.
of the rectangular matrix, based on a well-defined inter-row permutation pattem T and a permuted least primes sequence r (obtained from T and an ordered least primes sequence 9). Column-wise data output from the rectangular matrix, with pruning.
2.
3, Inter.row we separate the interleaving process into a preparatory phase and a real-time address computation stage. ~h~ preparatory phase prepares all the parameters necessary for real-time address of A turbo decoder consists of soft-input soft-output (SISO) component decoders, which exchange information cooperatively to produce better estimates of the transmitted symbols over the noisy channel. The estimates, in the form of probabilities, are interleaved and deinterleaved between SISO computations.
While the interleavers 'are not the dominant power or performance consumers of a turbo decoding system, their presence impacts the throughput and memory requirements of the system significantly. By employing efficient algorithm simplification and hardware mapping, we seek to operate the system at the lowest possible frequency, thus reducing overall power consumption.
The 3GPP-defined turbo code interleaver algorithm comprises a series of mathematically complex processes that map an input sequence of length K (40 i K 9 5114) to a scrambled sequence . . computation, using nearly constant time for all sequence lengths. This process determines R, C, p. s, T, and r, as required hy the algorithm above. At the end of the preparatory process, a DONE register signals to the real-time address generator that the various operations have been completed, and the needed registers have been prepared. The real-time address generator then uses the prepared sequences to compute addresses on the fly. The two units communicate via simple register control protocols. The decoder processor interacts with the real-time address generation stage through control signals "Interleave" or "Deinterleave". During decoding iterations. the preparatory phase can be turned off, further saving power. The generalized real-time jh interleaved address is computed using equation (1) below, where the operators have their usual meanings. Fig. 2 is the general architecture of our hardware interleaver. 
RESULTS AND DISCUSSIONS
In certain coding experiments where only a few frame lengths are allowed, it is possible to store the entire interleaved sequence for each of the possible frame lengths in a small lookup table.
For the 3GPP interleaver, the range of allowable frame lengths (-114) is quite large, and the scrambled sequences are different for different K. In this case, a brute force approach that stores the interleaved addresses for each of the possible input sequence lengths would require roughly 25MB storage. It is also possible to implement the interleaver using a software programmable processor that prepares an address memory, given a particular value of the input sequence length. This approach would require at least a SK 13-hit word memory, ignoring the memory required to hold the programming instructions. Our architecture minimizes power and area, by removing storage memory.
We have designed the interleaver using a 0 . 2 5~ &layer metal CMOS process, and verified its functionality for sequences over the required range. Fig. 3 shows a layout of the real-time interleaver component, whose performance we compare with typical SK 13-bit ASIC memory storage. While the chip core occupies a 2.54mmz area (a greater percentage of the chip area is attributed to an unoptimized directly implemented divider component) at a comfortable 12.SMHz clock, we expect even greater overall energy savings. We have estimated energy savings at greater than +J for a fixed eight-iteration decoding p r w s s in a typical optimized ASIC core in a similar CMOS process. Removal of this memory should translate to an overall reduction of at least lmm' in interleaver area. In estimating the total energy saving per operation, we have neglected the DC and leakage components usually associated with random access memories, basing our estimates only on the capacitance driven per operation. Our approach, which accepts a slight penalty in computational complexity, further allows portions of the datapath not in use to he turned off, resulting in further reduction in power.
CONCLUSIONS
We have outlined associated challenges in implementing the 3GPP turbo code intemal interleaver and described some techniques used to overcome them. In particular, we have described architectural innovations that have allowed us to reduce interleaver latency, power consumption, and area.
We have made our estimates based on worst-case assumptions. While our estimates suggest improved performance gains, it is possible that when the distribution of frame lengths is uniform, the actual performance penalty of the software programmable approach may not be that excessive, compared to our approach. Furthermore, a software programmable approach allows one the flexibility of algorithm improvement and executable updates, a property that is lacking in our approach. Subsequently, actual physical chip measurements will be performed to obtain more accurate performance comparisons.
Turbo Codes," Proc. 1993 IEEE Int. Comm. Cod.
(Geneva, Switzerland, May 1993), pp. 10641070. 
2.

Decoder1
