Abstract-A great interest has been gained in recent years by a new error-correcting code technique, known as "turbo coding," which has been proven to offer performance closer to the Shannon's limit than traditional concatenated codes. In this paper, several very large scale integration (VLSI) architectures suitable for turbo decoder implementation are proposed and compared in terms of complexity and performance; the impact on the VLSI complexity of system parameters like the state number, number of iterations, and code rate are evaluated for the different solutions. The results of this architectural study have then been exploited for the design of a specific decoder, implementing a serial concatenation scheme with 2/3 and 3/4 codes; the designed circuit occupies 35 mm 2 , supports a 2-Mb/s data rate, and for a bit error probability of 10 06 , yields a coding gain larger than 7 dB, with ten iterations.
I. INTRODUCTION
F OR MANY digital communication services, bandwidth and transmission power are limited resources, and it is well known that the use of forward error-correction codes plays a fundamental role in increasing power and spectrum efficiency. However, Shannon showed in [1] and [2] that the development of error-correction techniques with increasing coding gain has a limit, arising from the channel capacity. Since [1] and [2] , code designers have been looking for new codes, approaching as close as possible the Shannon limit, but each increased coding gain comes at the expenses of the decoder complexity, and its practical feasibility must be evaluated for the available technologies.
Performances very close to the Shannon capacity limit are obtained by a class of new coding schemes, which are widely known as turbo codes [3] . This coding technique is made up of convolutional encoders connected either in parallel or series [4] , [5] to produce concatenated outputs. The bit sequences going from one encoder to another are permuted by an interleaver: this allows that low-weight codewords produced by a single encoder are transformed in high-weight codewords for the global encoder, thus achieving high coding gains.
The decoder consists of a net of interconnected deinterleavers and decoding stages, individually matched to the simple constituent codes. As an example, Fig. 1 shows the encoder and decoder for a simple parallel scheme: E1 and E2 are Manuscript received July 9, 1997; revised February 21, 1998 . This work was supported in part by the European Space Agency under Contract 12146/96, and in part by the Consiglio Nazionale delle Ricerche under the Programma di Ricerca Applicata Microelettronica-Legge 95/95 Project.
The authors are with the Dipartimento di Elettronica, Politecnico di Torino, 10129 Turin, Italy (e-mail: masera@polito.it).
Publisher Item Identifier S 1063-8210(99)04571-0. convolutional encoders, I an interleaver, 1/I the corresponding deinterleaver, and D1 and D2 are the decoding stages for E1 and E2. The decoding of the received sequences is performed iteratively, through an algorithm that processes the a-posteriori probability (APP) distributions associated to the information and coded symbols ( and , respectively). The decoding stages exchange with each other the obtained soft output information, which is a set of new probability distributions for and symbols; the decoding process is repeated iteratively until a satisfactory estimate of the transmitted information sequence has been achieved.
A first algorithm for trellis codes optimizing the APP's was introduced in 1974 [6] (referred to as the Bahl, Cocke, Jelinek and Raviv (BCJR) algorithm, in the following), but it did not find practical applications until recently because of its implementation complexity. A simplified APP algorithm has been recently proposed for the decoding of parallel concatenated codes [7] , [8] , while several studies currently in progress are aimed at finding the best convolutional encoders and interleaver structures [9] - [11] .
Although some very large scale integration (VLSI) implementations have already been obtained for particular turbo decoders, 1, 2, 3 systematic architectural studies with complexity analysis and cost/performance comparisons are not available in the literature. This paper is aimed at exploring the possible architectural solutions for a wide spectrum of turbo decoders and at evaluating their cost as a function of the key system parameters, like the number of states of the constituent codes, code rate, and used number of quantization bits. This paper also presents the VLSI design of a complete turbo decoder for deep space transmission applications, assumed as a case study. The results of the architectural study presented in the first part of this paper have been used to select the best solution in terms of cost and performance; then, the VLSI implementation has been obtained for a 0.5-m technology.
After a brief review of the decoding APP algorithm in Section II, the general architecture of a turbo decoder is discussed in Section III; the proposed VLSI architectures are detailed in Section IV and then compared in terms of required silicon area, latency, and throughput in Section V. The decoder design case study is presented in Section VI, while the obtained results are summarized in Section VII.
II. THEORY OF OPERATION
The presentation of the turbo codes theory and the description of the decoding APP algorithm are beyond the scope of this paper, as several tutorial papers have been published on the subject [12] - [14] . This section is aimed to summarize the most critical computations to be implemented in a turbo decoder according to the algorithm described in greater detail in [7] and [8] .
A generic encoder and corresponding decoding stage are shown in Fig. 2 : the encoder processes the input symbols in the output ones ; the decoding block receives the current estimation of the probability distributions of the encoder input and output symbols ( and , respectively) and returns new refined values for these distributions [ and ] . As the decoding stage manages probability distributions and gives at each iteration reliability information instead of a hard decoding, it is usually called a soft-input soft-output (SISO) decoder. According to the original BCJR algorithm, the probability distributions obtained as SISO outputs and referred to in the turbo code literature as "extrinsic information," can be evaluated as where the same notation adopted in [7] is used as follows.
1) Index indicates the time step and runs on the whole transmission length, while symbol represents a code state. 2) is a generic trellis edge, while and are the output and input coder symbols associated to the edge . 3) and indicate the starting and ending states for the generic trellis edge . 4) and are normalization constants. and are equivalent to the path metrics in the Viterbi algorithm: they are probability distributions accumulated in the forward and backward directions along the trellis according to the following updating relations: Several modifications to the algorithm are required for a practical implementation. First of all, instead of applying the recursive relations given above to the whole transmitted sequence, they are limited to a finite trellis depth, called sliding window [new data path (NDP)]. The second main drawback of the described algorithm is the large number of required multiplications, which are eliminated in the additive version of the algorithm [7] , introducing the following definitions: the updating equations given above for path and branch metrics take the form (1) which gives results very close to only when , that is when the signal to noise ratio is quite high. A recursive correction algorithm [15] is used for improving the performance in the presence of low signal-to-noise ratios (SNR's) (2) for with and . The algorithm requires the execution of two types of operations: a comparison with maximum selection and the evaluation of the following logarithm:
which is easily implemented as a look-up table. Introducing the operation for indicating algorithm 2, the basic APP relations are simplified as follows: (5) (6) These are the basic relations to be implemented in the decoding stage of a turbo decoder (SISO stage).
III. GENERAL DECODER ARCHITECTURE
A complete turbo decoder mainly includes two kinds of blocks: SISO stages, implementing the APP algorithm described in the previous section, and deinterleavers, which scramble the processed data according to the interleaving laws used in the encoder. Other blocks are required for the implementation of the decoder, such as random-access memory (RAM) for storing data through the iterations or synchronization circuits. These blocks can be interconnected in the decoder in many topologies [8] , depending on the encoder structure. As an example, an encoder obtained as the serial concatenation of two convolutional codes and an interleaver is described in Fig. 3 : E23 is a four-state 2/3 rate recursive systematic convolutional encoder, and E34 is a similar encoder with rate 3/4, so that the whole encoder has rate 1/2; the interleaver I permutes the incoming bit on block of length 12 K. E23 takes two information bits at a time (sequence u23) and returns three bits (c23). Each trellis transition is associated to the couple of input and output code symbols u23 and c23. The interleaver I reads a block of k bits and outputs them in a scrambled order with a latency . Finally, E34 takes three bits (u34) and encodes them in four output bits (c34); also for this code, trellis transitions are associated to couples u34 and c34. As the interleaver operates on the bit sequence processed by E23, a block of interleaved bits corresponds to 2/3 k information bits. The performance of this concatenated code and the VLSI implementation of the decoder are presented in Section VI. The resulting encoded bits are then carried to a quaternary phase shift keying (QPSK) modulator and transmitted on the channel. At the receiving side, a demapper evaluates the reliability values associated to the E34 trellis edges and is indicated in (3)- (6) as for the generic transition ; 16 probabilities values are obtained, referred to as the 16 possible symbols of E34. An block of these reliabilities must be stored in a memory to be reused in later iterations.
After demapping, the SISO34 stage of Fig. 4 (a) executes the APP algorithm described in Section II, returning probabilities, associated to the E34 input symbols [see (6) ]: as three bits are encoded by E34, eight probabilities must be computed. With this encoding scheme, reliabilities of E34 output symbols do not need to be evaluated, thus, (5) is not implemented. The deinterleaver 1/I performs the inverse permutation used at the transmitter side so that the decoding stage SISO23 can operate on a sequence of reliability values corresponding to the symbol sequence c23; note that the interleaver works on bits, thus, the reliabilities associated to the encoder symbols must be converted in bit probabilities.
When a whole block of 12k reliability values has been reordered, SISO23 starts decoding, evaluating the probabilities associated to both the input and output E23 symbols [ and ] ; the first ones can be returned as reliabilities of the decoded information bits. The decoder performances are strongly improved if more iterations are carried on; this implies that from SISO23 are interleaved, obtaining input values for SISO34, which repeats its work: reads from a memory a complete block of probabilities and computes new values. This decoder architecture makes use of a single instance for each of the two SISO stages and the required iterations are performed serially (serial decoding): this means that the overall decoding speed is slower than the speed of the SISO, where is the number of iterations. As an alternative, a parallel decoding can be adopted, where two SISO stages are allocated for each iteration and the global decoding speed is equal to that of the SISO. In this case, the feed-forward structure of Fig. 4 (b) is used; the architecture is a chain of blocks that process the data in pipelining and the last SISO23 stage in the chain returns the final outputs. The obvious drawback of the parallel solution is the cost coming from resource duplication. The serial and parallel decoding strategies presented here with reference to a specific example can also be adopted for other decoder topologies, including the structure of Fig. 1 , which can be implemented with single or multiple instance of D1 and D2.
An important role is played in the decoder by interleavers and deinterleavers, which usually add a large contribution to the whole cost, mainly for the following reasons.
1) The error-rate performance of the code is inversely proportional to the interleaving length [16] , [17] , thus, the interleaver implementation involves a large RAM.
2) The size of the memory devoted to the storage of the reliabilities computed by the demapper is also related to . 3) It has been proven that, at least for high SNR's, the best performances are obtained with randomly generated interleaving patterns: this means that the required sequence of addresses cannot be obtained through simple computations, but quite large read-only memories (ROM's) must be allocated. The straightforward implementation of an interleaver requires two word RAM memories, plus a ROM for storing the interleaving pattern: one memory is filled with the incoming data, which are then read out in the interleaved order, while a new block of data is written in the second memory. The read-write operations are then repeated, alternating between the two memories.
The complexity can be strongly reduced if a single RAM is allocated. In this case, the operations of writing a new data in one memory and reading an old data from the second one are replaced with a single read-modify-write access to the unique RAM. This implies that consecutive blocks of data are processed alternating between the sequential and interleaved orders, e.g., even blocks are written sequentially into the RAM and later read in the scrambled order, while odd blocks are written in the interleaved order and read sequentially.
IV. SISO ARCHITECTURES
If the main relations of the additive version of the APP algorithm are considered [see (3)- (6)], the same computational structure of the metric updating in the Viterbi algorithm can be easily observed; and play the same role as the path metrics in the traditional Viterbi algorithm, while and correspond to the branch metrics. The implementation of (3) and (6) implies the addition of three values, the comparison of a number of results equal the number of edges entering a trellis state and the selection of the greatest result [according to the algorithm in (2)], which must be propagated as the updated state metric; the SISO stage must, therefore, include add, compare, select (ACS) structures as main building blocks for the and updating and similar blocks are required for the calculation of the output probabilities and . These blocks will be referred as , , , and ACS sections. In this section, some architectures are proposed for the SISO implementation, corresponding to different cost-performance tradeoffs.
A. The Problem of Metric Normalization
Before examining the proposed SISO architectures, the problem of choosing the number of bits for the and representation and their normalization are discussed.
At each updating step, the input probabilities and (which are positive) are added to the current metrics and the largest values for each state are selected as new metrics: as a consequence, and increase their values from step to step. To avoid an arithmetic overflow, the direct solution is to find the smallest metric after each updating step and to subtract it from all the metrics; this method implies a high cost, as subtracters ( being the number of states) and comparators are additionally required for each or ACS processor. Moreover, the normalization increases the latency of the decoder.
The normalization of the and metrics at each updating step can be avoided by adopting a proper numerical representation of the metric differences. In fact, from operations 3 and 4, it is evident that metrics are updated after relative comparisons performed by the ACS sections and the result does not change if the same constant is added to each or metric. Moreover, it can be proven [18] that all possible differences between pairs of path metrics are upper bounded: in other words, path metrics grow from one step to the other, but their relative differences have an upper limit , which depends on the free distance of the code. Now assume that path metrics are represented as positive integers using bits, where is given by
As a consequence of this choice, the metric difference has a range from to so that it can be represented as a 2's complement value with bits and the sign bit has weight . This implies that the computation can be performed using modulo arithmetic without affecting the correctness of the result. Examples of difference calculation between metric pairs and are given in Fig. 5 for the case and . 
B. Full-Speed SISO Architecture
From the computational cost point-of-view, the most expensive part of the SISO algorithm is the updating of the values: for each updating, a number of calculations equal to the sliding window length (NDP) are required.
For this reason, a full-speed architecture must include one ACS section for the processing and a pipeline of NDP ACS sections. Two additional and ACS sections are in charge of the output probabilities calculation; moreover two shift registers of length NDP are used to continuously feed the ACS with delayed versions of the input probabilities: this architecture allows to start a new backward ACS iteration for each step of updating, without waiting for the conclusion of the previous iteration. A view of the general SISO architecture is given in Fig. 6 . The architecture is completely parallel and each ACS section requires adders and comparators, where is the number of encoder input symbols.
C. Modifications to the Full-Speed Architecture
The architectural cost can be reduced at the expenses of lower performance if hardware resources are shared.
A first possibility for decreasing the SISO complexity comes from the consideration that the input probabilities and are added in each ACS section: the number of adders can be halved if the input probabilities are summed once for all at the beginning of the ACS pipeline. This way, the number of adders per section becomes , while the number of comparators remains the same. On the contrary, the cost of the two shift registers will increase, as all required combinations of the input probabilities must be calculated and properly delayed.
A relevant cost reduction can be obtained if a reduced number of ACS processors is shared among states or consecutive iterations. In the first case, each ACS section will contain ACS processors, each one updating in sequence different states of the code. The complexity of the section is reduced of a factor near to , not considering the cost of the additional multiplexors required for feeding the ACS processors with the correct path and branch metrics; the same reduction factor holds for the throughput. A similar result is obtained when a single ACS section is used for performing consecutive steps in the backward iteration. A view of the SISO architecture with shared ACS sections is given in Fig. 7 .
D. Memory Architectures
The basic idea of this solution is to double the extension of the backward recursion [see (4) ] from NDP to NDP: in this way, after the initial NDP steps of the backward recursion, the computed values have a memory longer than NDP (in the sense that they include the contribution of more than NDP branch metrics through the trellis) and they can, therefore, be used to directly feed the output ACS. This solution has already been introduced in [14] as an alternative to the storage of the entire state metric history: in the following, a detailed architecture is proposed for this idea, comparing its complexity with the previously reported solutions.
To describe the proposed memory architecture, let us define as S the operation of storing a block of NDP input branch probabilities in a RAM; according to the APP algorithm, this block of written values must be read three times for performing the following operations: 1) operation A, which is the updating of NDP metrics in the forward direction [see (3)]; 2) operation B, which is the updating of NDP metrics in the backward direction [see (4)]; 3) operation P, which is the computing of NDP values of the output probability distributions [see (5), (6)]. These three operations are performed in parallel; in order to avoid the use of a multiple-port RAM, three separated memories of depth NDP are used so that the defined operations operate on them in a cyclic way. One additional memory is required for temporarily storing the computed 's.
In Fig. 8 , the operations performed on the three memories are indicated for a sequence of five phases, where a phase indicates the processing of a whole block of NDP input metrics. In the figure, the B operation is performed twice for each block: the first time values are updated for the first NDP steps, but not used for the SISO output calculation (only (4) is evaluated); the second time, the updating is continued for the NDP following steps and NDP SISO outputs are evaluated [(4), (5), and (6) are computed in parallel] in the same time.
Step 1) In phase 1, the first NDP long block of branch metrics is stored in RAM 1 (S1). Step 2) The second block of branch metrics is stored in RAM 2 (S2).
Step 3) The third block of branch metrics is stored in RAM 3 (S3), while the 's are calculated on block 2 (B2 unit) and the values obtained at the last iteration are taken as the initial values for the subsequent backward recursion. At the same time, 's are calculated for block 1 (A1) and stored in a separated memory for subsequent use.
Step 4) In phase 4, the RAM 1 content is read in the reverse order for sequentially calculating the current 's (B1) and the associated output probabilities (P1); this calculation also makes use of the 's stored in the separated RAM, which can be reused for the writing of new evaluated on block 2 (A2). In the same time, the initialization is performed on block 3 (B3 unit) and new NDP branch metrics are stored in RAM 1 (S4): read-modify-write access is, therefore, required for the memory.
Step 5) Phase 5 and the following ones repeat the same operations of phase 4, but the role of the three RAM's is cyclically shifted. As a result of this modification to the algorithm, a couple of ACS sections are allocated instead of the NDP long pipelining of Fig. 6 , thus, achieving a drastic cost reduction with respect to the previous reported architectures. A top view of the architecture implementing this version of the algorithm is given in Fig. 9 . SISO outputs are obtained in the reversed order, but the correct order can be restored by properly changing the interleaving law of the decoder. Alternatively, a last in first out (LIFO) block must be included in the decoding stage.
E. The Output ACS Sections
The and ACS sections implement (5) and (6), receiving and metrics, plus the input probabilities or . Branch and path metrics must be summed and the results must be compared for selecting the greatest values so that ACS structures are again required. The operations are extended to all and symbols in the encoder.
The operations performed in the output ACS sections raise a numerical problem, which has a strong influence on the final performance of the decoder. In fact, the output probability distributions calculated by the SISO will be used as the new input probabilities in the next SISO iteration; (7) implies that several bits used for representing the obtained output probabilities must be dropped before the next SISO processing. For example, if the branch metrics are represented as positive numbers on five bits and the path metrics in 2's complement, the SISO outputs will be represented with nine bits: as a consequence, they must be converted to the positive range and the number of bits reduced from 9 to 5, with the best possible resolution and without changing the ordering of the obtained values.
Generally speaking, in case of a high level of noise on the channel, the probability distributions at the SISO outputs tend to be uniform, while in case of a low noise level, a wider range of values will be obtained: in the first situation, it is more convenient to drop the most significant bits, while in the second one, it is mandatory to drop the least significant bits.
The simplest solution is to find the lowest output probability [ or ] and to subtract it from all the others: the resulting values will occupy a range coming from zero to the maximum probability difference. The number of bits can be reduced by first dropping all the most significant bits, which are equal for all the obtained differences, and then dropping least significant bits until the required bit length is reached: this operation can be viewed as the multiplication by a proper power of two coefficient, followed by a truncation. The SISO operates independently of the value of this coefficient, provided that it is the same for all processed data so that it can only be modified from one interleaving block to the other.
V. SISO COMPLEXITY EVALUATION AND COMPARISON
The cost in terms of silicon area has been estimated for the proposed architectures, starting from the synthesis results obtained for the constituent building blocks, which have been synthesized using a 0.5-m ST CMOS technology and the standard cells style of implementation; the resulting figures have then been composed to evaluate the described SISO architectures for different values of and for different numbers of bits used to represent the branch metrics . It should be noted that with fixed and , the backward recursion depth, NDP, and the number of bits for the and representation are also fixed. The resulting area occupation with and five bits are given in Figs. 10 and 11 . These figures also show the complexity of a "radix-4" implementation of the SISO, where two consecutive steps in the code trellis are collapsed to obtain a new trellis with a double number of transitions; this technique is adopted in some Viterbi decoders for relaxing the memory constraints and improving performance [19] , [20] , but it implies a higher complexity. For codes with a high number of states , the architecture with pre-summed metrics performs better than the full-speed one, although they offer the same throughput; on the contrary, for codes with a small state number, the additional cost in the shift registers is not compensated by the saved adders in the ACS sections. Structures with shared ACS processors are compared in these figures for the case of four iterations. This means that each ACS processor is shared among either four states or four consecutive sections. The value of four iterations has been chosen as it offers the best results among the evaluated cases (two, four, and eight iterations). Finally, the memory architecture saves between 70% and 90% of silicon area with respect to the full-speed architecture.
The memory architecture has a latency equal to NDP, while all the other solutions imply a lower latency, closely related to NDP. However, it is worth considering that the latency of the whole decoder is dominated by , the interleaver length (typically 4k-16k), which is normally much larger than NDP.
The architectures compared in Figs. 10 and 11 do not offer the same throughput and a cost-throughput comparison requires that the actual working frequency of the different solutions is evaluated. In order to obtain consistent cost and throughput estimations, modified versions of the proposed architectures have been obtained, introducing additional pipelining levels and adopting the fastest implementations styles for the basic building blocks. The modified architectures have been synthesized and compared in terms of both cost and offered throughput. Fig. 12 presents the obtained results for different architectures implementing a 1/2 rate SISO with four states: the memory structure, full-speed scheme, and two shared architectures, with two (ACS shar. 2) and one (ACS shar. 4) ACS processors shared among the code states. The axis ticks in the figure are throughput values normalized with respect to the clock frequency so that the value 0.25 means that four clock cycles are required for a valid output data. In the cost evaluation of each considered architecture, the best implementation has been taken for each throughput value; for example, the cost increment in the memory architecture at the normalized throughput 0.1 is mainly due to the change from ripple carry to carry-lookahead adders. Fig. 12 shows the throughput limits of the memory architecture for the considered technology; the throughput can be improved by adopting the full-speed architecture or its modifications, although only low-state number codes are worth considering with the 0.5-m technology. Figs. 13 and 14 give the cost-throughput comparisons for 2/3 and 3/4 rates; the area saving granted by the memory architecture is larger for lower rates and the best throughput can only be offered by the full-speed solution.
VI. VLSI DECODER IMPLEMENTATION
In this section, the design of a turbo decoder for deep space applications is described as a case study. The aim of the work was to demonstrate the feasibility as a single chip of a highperformance receiver for serially concatenated convolutional codes, including both the decoding part and synchronization circuits.
The performance target of the transmission system implies an SNR lower than 1.4 dB for a bit error rate (BER) of 10 and lower than 1.6 dB for ; the transmission data rate is 2 Mb/s and the modulation is a QPSK. The adopted coding scheme is that of Fig. 3 , with a four- state rate 3/4 inner code and a four-state rate 2/3 outer code; the interleaver length is 12k. The two constituent codes and interleaver size are the best possible choice meeting the BER constraints [22] and they have been selected among a large number of possibilities with the evaluation method described in [4] and [17] .
Due to the specified throughput, the general structure of Fig. 4(a) has been adopted, but in order to well exploit the allocated hardware resources, two blocks of data are processed concurrently: while SISO34 is processing a block of data, SISO23 will processes the previous block. The architecture also requires a double-input memory ( data) and two separated structures for the interleaver and deinterleaver.
As two blocks of data are iteratively decoded at the same time, all decoder sub-parts must be properly synchronized according to the operation timing described in Fig. 15 : being the interleaving length and , the latencies of the SISO's, the decoder operates cyclically on periods of length . The first received block of data (A1) is processed by SISO34, which returns the outputs with a delay equal to its latency ; the SISO outputs are written in the deinterleaving memory and then read in the scrambled order at the next cycle (time ). Block A1 is processed by SISO23 and written in the interleaving memory after a delay ; at the beginning of the next cycle (time ), this first block of data is read from the interleaver and processed for the second time by SISO34 (now block A2). As the latency of the two SISO's is negligible with respect to the interleaving length (NDP , k), the four main units of the decoder actually operate in parallel for most of the time.
The decoder architecture has been implemented as a structural very high-speed integrated-circuit hardware description language (VHDL) code and synthesized for a CMOS 0.5-m ST technology. The Cadence environment has been used for both simulation and synthesis. 
A. Interleaver and Deinterleaver
The interleaving length strongly affects both the errorcorrection capabilities and the decoder silicon area, thus, it is usually set at the minimum value compatible with the required performance; in the present case, 12k [22] .
Interleaver and deinterleaver have the same structure, which basically consists of a RAM memory with k words, to be read and written according to predefined sequences of addresses, stored in a ROM: for even data blocks, values are written sequentially and read in the interleaved order, while for odd blocks, the interleaved addresses are used for writing and the sequential order for reading.
Read accesses are always generated in even clock periods from an bit counter, while write accesses are generated in the odd periods; the same addresses of the read accesses are used for the writing with a delay equal to the SISO latency.
B. SISO23 (Outer)
For the SISO implementation, the memory architecture has been adopted. The and sections are implemented as shown in Fig. 16 . The indicated tree structure performs the sums of the branch and path metrics in the first stage, then the comparison with the logarithmic correction is performed in two steps using three comparators, three look-up tables, and three adders. Additional multiplexers on the inputs and pipelining registers allow to share this unit among the four states, according to the timing in Fig. 17 : five clock periods are required for the updating of all four states.
The synthesis results indicates that the SISO23 stage requires an area mm , including the three banks of RAM; the clock frequency to support a decoding rate of 2 Mb/s is 50 MHz.
C. SISO34 (Inner)
The memory architecture has also been adopted for the SISO34. In this case, for the ACS sections, the best path out of the eight possible ones must be selected for each state, as implemented in the tree architecture of Fig. 18 . The two cells in Fig. 19 are required, each one working on two levels of pipelining. Cell A receives the path metric of one state plus the proper and probabilities. Cell B complete the selection algorithm of (2). Six cycles are required for producing a new output and nine cycles for updating all four states.
From the synthesis, the maximum frequency is 94 MHz and the required area 5 mm for the whole SISO.
The global cost of the designed decoder is given in Table I , where the RAM cost is expressed in memory bit capacity, while the cost of the computational parts is evaluated in squared mm for a 0.5-m CMOS technology.
Simulations have been performed on the developed circuit described as the VHDL code, and the BER as a function of the SNR has been estimated. The simulation results after a number of iterations ranging from 1 to 10 are given in Fig. 20 (10 bits have been simulated) .
VII. CONCLUSIONS
Several architectural alternatives for the implementation of turbo decoders are described in this paper, ranging from fullspeed parallel solutions to small-area structures with shared resources. This study can be exploited in the development of high-performance receivers with different constraints of cost and throughput. As a specific design example, this paper also shows that a low-cost turbo decoder is feasible using standard CMOS technology with an area lower than 35 mm .
Future work on this subject includes the design of a second turbo decoder with very high-speed constraints. In this case, the full-speed architecture has been adopted and the technology mapping will be performed using a true singlephase clocking (TSPC) library, which guarantees operation frequencies greater than 1 GHz.
