Presented in this paper is a low-power architecture for turbo decodings of parallel concatenated convolutional codes. The proposed architecture is derived via the concept of blockinterleaved computation followed by folding, retiming and voltage scaling. Block-interleaved computation can be applied to any data processing unit that operates on data blocks and satisfies the following three properties: 1.) computation between blocks are independent, 2.) a block can be segmented into computationally independent sub-blocks, and 3.) computation within a sub-block is recursive. The application of block-interleaved computation, folding and retiming reduces the critical path delay in the add-compareselect (ACS) kernel of MAP decoders by 50% -84% with an area overhead of 14% -70%. Subsequent application of voltage scaling results in up to 65% savings in power for block-interleaving depth of 6. Experimental results obtained by transistor-level timing and power analysis tools demonstrate power savings of 20% -44% for a block-interleaving depth of 2 in 0.25µm CMOS process.
INTRODUCTION
Turbo codes [1] have shown remarkable receiver performance improvement over hard-decision decoding. As a result, turbo codes have been accepted as the coding technique in next generation wireless communication standards such Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. as Wideband CDMA (WCDMA) and the 3rd Generation Partnership Project (3GPP) for IMT-2000. Further, other turbo-based receiver techniques such as turbo equalizers [2] have been proposed in order to leverage the coding gains inherent in turbo processing.
Extensive research has already be done on low-power implementations of turbo code decoders and turbo equalizers [3] - [12] . These include low-power quantization of the log-likelihood ratio (LLR) [3] - [4] , early termination [3] - [8] , memory optimization [9] - [11] , and optimization of computational complexity [12] .
In this paper, we propose a low-power architecture for the turbo decoder of parallel concatenated convolutional codes (PCCC) where maximum a posteriori probability (MAP) decoder forms the main computational kernel of an iterative decoding process. Note that the above mentioned low-power techniques can be applied on top of the technique proposed in this paper for additional power savings. The proposed architecture is derived via the concept of block-interleaved computation followed by folding, retiming and voltage scaling. Block-interleaved computation can be applied to any data processing unit that operates on data blocks and satisfies the following three properties: 1.) computation between blocks are independent, 2.) a block can be segmented into computationally independent sub-blocks, and 3.) computation within a sub-block is recursive. The application of block-interleaved computation, folding and retiming reduces the critical path delay in the add-compare-select (ACS) kernel of the MAP decoder. Subsequent scaling of power supply results in significant power savings in turbo decoders.
The rest of this paper is organized as follows. Section 2 describes the low-power technique of block-interleaved computation. A review of the turbo decoding algorithm is provided in Section 3 followed by a description of the proposed low-power decoder architecture. Experimental results are discussed in Section 4, and Section 5 concludes the paper.
BLOCK-INTERLEAVED COMPUTATION
In this section, we describe how low-power operation can be achieved via block-interleaved computation followed by folding, retiming and power supply scaling. Consider the recursive architecture in Fig. 1 . Note that the architecture in Fig. 1 cannot be easily pipelined or processed in parallel due to the presence of the feedback loop. However, if the data in a block of length N is processed independent of other blocks and the computations in a block can be reformulated such that a block is segmented into computationally independent sub-blocks, then one can parallelize the architecture as shown in Fig. 2 , where the parallelization factor M is 2 and a block X is divided into two sub-blocks, X1 and X2.
If we now fold [13] the parallel architecture in Fig. 2 by a factor of M = 2, we obtain the folded block-interleaved architecture shown in Fig. 3 . Note that the folded blockinterleaved architecture is inherently pipelined. Therefore, an application of retiming [13] (see Fig. 4 ) results in reduction of the critical path delay by a factor of two over that of the original architecture in Fig. 1 . It is clear that voltage scaling can now be employed to reduce the power dissipation without impacting the throughput.
In summary, block-interleaved computation requires that processing be done in a sub-block-based manner within a block where computations within a sub-block are recursive and computations between sub-blocks are independent. In such a case, processing M sub-blocks simultaneously results in a folded block-interleaved architecture which is inherently pipelined. Subsequent retiming and voltage scaling can then be used to reduce power while maintaining throughput.
LOW-POWER TURBO DECODER AR-CHITECTURE
In this section, we first review the turbo decoding algorithm and its VLSI architecture. Next, we apply the technique of block-interleaved computation as described in section 2 to derive a low-power decoder architecture.
PCCC and its turbo decoding
The turbo coding technique considered in this paper is made up of two recursive systematic convolutional (RSC) encoders connected in parallel as shown in Fig. 5 . The bit sequences going from one encoder to another are permuted by an interleaver. The turbo decoder of PCCC contains two ( ) SISO MAP decoders which are associated with two RSC encoders as depicted in Fig. 5 . The decoding of the observed sequences is performed iteratively via exchanging with each other decoder the improved soft output information. Although two SISO MAP decoders process independent observed sequences, turbo decoders can be implemented via the serial architecture as shown in Fig. 6 , where one SISO MAP decoder is shared for a pair of decoders in a timemultiplexed way.
Numerous algorithms [14] exist that can be employed for SISO MAP decoders. Among them, the log-domain MAP algorithm (log-MAP) is popular as it formulates the computations as sums rather than multiplies [11] . If c is a block of N coded bits and y is a block of N channel output bits, then the goal of the log-MAP algorithm is to estimate the a posteriori probability LLR for the kth coded bit, which is defined as
The forward metric (α k (s)), the backward metric (β k (s)), and branch metric (γ k (s , s)) are defined as [11] :
where s and s are trellis states and the max * operation is implemented as * max(x, y) ≈ max(x, y) + ln(1 + e −|x−y| ). The LLR of the k-th bit is approximated as
In an turbo decoding algorithm, the updated LLRs are passed to the next iteration MAP decoder after being deinterleaved. Here, the detailed derivation is omitted and referred to [1] , [11] .
Turbo Decoder Architecture
If α and β recursions sweep over the all N received symbols y, a lot of storage is required. Hence, a sliding-window log-MAP algorithm [11] is employed in this paper as it minimizes the metric storage requirement. The sliding-window log-MAP algorithm can be derived via the property that the forward and backward metrics α k and β k converge after a few constraint lengths (K) have been traversed in the trellis, independent of the initial conditions. We refer to this property as the warm-up property and the warm-up period as L. The warm-up property is employed only for computing backward metrics in this paper as shown in Fig. 7 , where the warm-up and computation period are depicted using dashed and solid lines, respectively. To implement the log-MAP decoder whose data flow is shown in Fig. 7 , a typical decoder architecture depicted in Fig. 8 is employed. The architecture has computations for one branch metric, one forward recursion, and two backward recursions, delay lines to feed branch metrics into forward and backward recursions, a buffer to store backward metrics β 1 k and β 2 k , and the LLR computation block. The LLR computations can be implemented in a feedforward manner and thus this block does not limit the throughput. Note that the forward and backward recursions are computed via an array of ACS kernels in a state-parallel manner. The ACS circuit is shown in Fig. 9 , where the correction factor in (5) is implemented via a look-up-table (LUT). Path metric re-scaling to avoid overflows [3] is also employed. It is the ACS unit in Fig. 9 that limits the throughput and hence our ability to reduce power via voltage scaling.
Low-power Decoder Architecture
The low-power decoder architecture can be obtained by processing M sub-blocks of size N/M bits via sub-block interleaved computations. For simplicity, we refer to this architecture as block-interleaved architecture even though the interleaving is done at the sub-block level. This can be done by exploiting the warm-up property described in Section 3.2 for each of the sub-blocks. For simplicity, we will assume that the block-size N and the sub-block size N/M is a multiple of the warm-up period L. For example, a block size of N = 10L can be divided into M = 2 sub-blocks of size 5L. Thus, the L trellis sections between 4L-th and 5L-th nodes can be employed as a warm-up period for generating reliable forward metric estimates for the second sub-block. Figure 10 shows the data-flow for an example with N = 10L and M = 2 sub-block processing. Note that the beginning and ending node states of the trellis are known to the decoder by closing the trellis in the encoder and no warm-up period is required to compute metrics for the first and last trellis window. Figure 11 shows the folded block-interleaved ACS architecture for the proposed low-power decoder, where the blockinterleaving factor is M . The block-interleaved architecture processes trellis sections k, k Fig. 9 , the folded blockinterleaved architecture in Fig. 11 has M delays in the feedback loop of the ACS recursion. Thus, retiming can be employed to pipeline the critical path by M levels. The price we pay is an increase in the memory requirements and the pipelining registers. The ratio of the memory requirements of the conventional turbo decoding architecture and that of the proposed blockinterleaved turbo decoding architecture is given by
where BIN is the size of the buffer required to hold the inputs and LLR values, and BDL and B b are the buffer size of the delay-line and backward metric RAMs in Fig. 8 , respectively. For N = 1024, L = 16, M = 2, and K = 3, η is 1.1 as summarized in Table 1 and linearly increases as M becomes large. The increase in the logic complexity is also linearly proportional to M and the logic complexity ratio of two architectures is determined via synthesis as shown in Table 1 , where encoder polynomials, [5, 7] 8, [13, 15] 8, and [23, 35]8, are employed for K = 3, 4, and 5. We note that the hardware complexity is increased by 14% -25% and the delay reduced by 30% -43% for M = 2. In order to reduce power, we scale the supply voltage of the block-interleaved architecture such that the block processing time is equal to that of the conventional architecture. The conventional architecture requires N +2L cycles to process a block (see Fig. 7 ). The proposed architecture requires
cycles for processing one block (see Fig. 10 ). Equating the block processing times of the two architectures we get
where, τcri,p and τcri,s are the critical path delays of blockinterleaved and conventional architectures, respectively. Thus, we can reduce the supply voltage such that τcri,p is equal to τcri,s ×
N+2L N+2ML
. The propagation delay of CMOS circuits is given by
where CL is the load capacitance, α is the velocity saturation index, β is the gate transconductance, and Vt is the device threshold voltage [13] . Assuming that α = 2 in (10), and substituting (10) into (9), we get
where Vs and Vp are the supply voltages of conventional and proposed block-interleaved architectures, respectively. By solving (11), the ratio of supply voltages, defined as Vsc =
Vp Vs
, can be derived as 
Note that Vsc decreases more rapidly for the case when α = 1 than for α = 2. Therefore, more power savings are expected with technology scaling. The power consumption of the block-interleaved architecture can be expressed as
where Asc is the area increase factor due to increased memory requirement and Ps is the power consumed by the conventional architecture. Then, the power savings can be computed as
Under the assumptions α = 1 and α = 2 in (10), the critical path delay ratio, area overhead ratio, and power savings for K = 3, N = 1024, and L = 16, are depicted in Fig. 12 . It is observed that at a certain point, there is no further power savings due to the area overhead. Further, as the number of states (= 2 K−1 ) is increased, the power savings decrease because the area overhead increases rapidly for large values of K (see Fig. 13 ).
EXPERIMENTAL RESULTS
In this section, the design flow followed in this work is briefly described and experimental results obtained via transistor level timing and power analysis tools are provided for M = 2. Word-lengths of internal variables of a MAP decoder when applied to a turbo code decoder were determined via computer simulations. The word lengths were chosen such that the loss in coding gain due to quantization effects was less than 0.05 dB at a bit error rate of = 10 −4 . Several sliding window log-MAP decoders with different constraint lengths and ACS architectures were implemented in VHDL and their functionality was verified. Then, the designs were synthesized in Synopsys Design Compiler using TSMC 0.25 µm technology standard cell library. The synthesized design is placed and routed using Cadence Silicon Ensemble. PathMill was used to determine the critical path from post layout simulations. The supply voltage of the block-interleaved architecture was reduced until the delay of the critical paths in the block-interleaved and conventional architecture were found to satisfy τcri,p = τcri,s (N+2L) (N+2ML)
. NanoSim was used to measure the power consumption of MAP decoders and parameterized dual/single port SRAM cells [15] were employed to estimate the memory power consumption of input buffer, LLR buffer, and interleaver. Measured results on the critical path delay are summarized in Table 2 for Vs = Vp = 2.5V . Figure 14 Utilizing reduced supply voltages listed in Table 3 and the post-layout spice netlists, the power consumption of turbo decoding with the conventional and block-interleaved architectures was measured. For measurement purposes, randomly generated input patterns were used. The results are summarized in Table 4 , where power savings of 20% -44% are achieved. As predicted in Fig. 13 , the measurement results show that the power savings decrease as the constraint length, K, increases. The throughput in Table 4 is determined by the critical path delay assuming that 5 iterations are carried out.
CONCLUDING REMARKS
In this paper, we proposed a new low-power design methodology which exploits block-interleaved pipelining and voltage scaling. The ACS kernel is reformulated via the proposed design methodology thereby achieving power-savings of turbo decoders. Our analysis shows that the power saving is maximized at the interleaving depth, M = 6, and decreased after that point because the area overhead become dominant. Experimental results obtained by transistor-level timing and power analysis tools demonstrate power savings of 20% -44% for a block-interleaving depth of M = 2 in 0.25µm CMOS process. Further, the proposed approach can be applied on top of other low-power turbo decoding design schemes such as early termination. 
