Abstract-Very large scale integration (VLSI) design methodology and implementation complexities of high-speed, low-power soft-input soft-output (SISO) a posteriori probability (APP) decoders are considered. These decoders are used in iterative algorithms based on turbo codes and related concatenated codes and have shown significant advantage in error correction capability compared to conventional maximum likelihood decoders. This advantage, however, comes at the expense of increased computational complexity, decoding delay, and substantial memory overhead, all of which hinge primarily on the well-known recursion bottleneck of the SISO-APP algorithm. This paper provides a rigorous analysis of the requirements for computational hardware and memory at the architectural level based on a tile-graph approach that models the resource-time scheduling of the recursions of the algorithm. The problem of constructing the decoder architecture and optimizing it for high speed and low power is formulated in terms of the individual recursion patterns which together form a tile graph according to a tiling scheme. Using the tile-graph approach, optimized architectures are derived for the various forms of the sliding-window and parallel-window algorithms known in the literature. A proposed tiling scheme of the recursion patterns, called hybrid tiling, is shown to be particularly effective in reducing memory overhead of high-speed SISO-APP architectures. Simulations demonstrate that the proposed approach achieves savings in area and power in the range of 4.2%-53.1% over state of the art.
I. INTRODUCTION
T URBO CODES [1] and related concatenated codes [2] have proved to be extraordinarily effective error correcting codes to the degree that they have become known as capacityapproaching codes. The crucial innovation made by Berrou et al. in [1] , was the reintroduction of the concept of iterative decoding to convolutional codes, a concept that was first pioneered by Gallager in 1963 [3] in the context of low-density paritycheck codes and has largely been neglected since then. The near Shannon limit error correction capability has lead turbo codes to become the coding technique of choice in many communication systems and storage systems since their introduction. Some of these applications include, among others, the Third Generation Partnership Project (3GPP) for IMT-2000 [4] , Consultative Committee for Space Applications (CCSDS) telemetry channel coding [5] , and Wideband CDMA, which require throughputs Manuscript received August 5, 2001 ; revised June 25, 2002 . This work was supported by the National Science Foundation (NSF) under Grant CCR 99-79381 and Grant CCR 00-73490.
The authors are with the Coordinated Science Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: mmansour@mail.icims.csl.uiuc.edu; shanbhag@mail.icims.csl.uiuc.edu).
Digital Object Identifier 10.1109/TVLSI. 2003.816136 in the range of 2 Mb/s to several 100 M/bs. As these applications continue to evolve, the trend is shifting more and more toward stringent requirements on power consumption and processing speeds as part of standard design practice, which have prompted for more efficient turbo decoder implementations if they are to remain part of mainstream. Turbo codes are composed of an interconnection of component codes through interleavers, typically convolutional codes, and their decoders consist of an equal number of component decoders each of which operates on its corresponding codeword and shares information with other component decoders iteratively according to the topology of the encoder. The decoding algorithm in the component decoders is the maximum a-posteriori probability (MAP) algorithm typically implemented in the form known as the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm [6] . The main advantage of a MAP decoding algorithm over a maximum likelihood decoding algorithm such as the Viterbi algorithm [7] is that it produces optimum soft information which is crucial to the operation of these decoders. The BCJR algorithm was generalized in [8] into a soft-input soft-output a posteriori probability (SISO-APP) algorithm to be used as a building block for iterative decoding in code networks with generic topologies. The advantages of the SISO-APP algorithm over other forms of the MAP algorithm is that it is independent of the code type (systematic/nonsystematic, recursive/nonrecursive, trellis with multiple edges), and it generates reliability information for code symbols as well as message symbols which makes it applicable irrespective of the concatenation scheme (parallel/serial/hybrid), and hence will be considered in this paper.
Optimizing the SISO algorithm for hardware implementation is an essential step in the design of an iterative decoder. This paper addresses the complexity aspects of the SISO algorithm at the architectural level. Interleavers, which are the second major component of a turbo decoder, are not considered here. For a standard turbo decoder [1] composed of two SISO decoders, the SISO decoder throughput is typically up to 10-20 times higher than the turbo decoder throughput assuming 5-10 iterations per frame. For recent broad-band applications requiring throughputs in the range of 10-100 Mb/s, this translates to SISO decoder throughputs in the range of 100 Mb/s-2 Gb/s. Fiber optics-based applications are expected to stretch these figures well into the Gb/s regime. At these speeds, the challenge is to keep the power consumption within tolerable limits.
A. Related Work
Several VLSI implementations of turbo decoders have already emerged such as [9] - [13] , the field programmable function array (FPFA) mapped turbo decoder of the Chameleon project [14] , the code-programmable turbo decoder ASIC [15] 1063-8210/03$17.00 © 2003 IEEE from Sony, and the turbo decoder core from various commercial entities. At the turbo decoder level, most work on low-power and high-speed issues has focused on stopping criteria of the iterative decoding process, and efficient design of interleavers [16] , [17] . At the SISO decoder level, algorithmic optimizations aimed at reducing storage demand and increasing throughput were considered with the introduction of the sliding window algorithm [18] - [20] and the parallel window algorithm in [21] . Computational inefficiency issues of the SISO algorithm have been studied extensively: quantization of metrics and extrinsics [17] , [22] - [25] , metric normalization and scaling [9] , [10] , [17] , [22] , [26] - [28] , reducing log-likelihood ratio (LLR) computations [17] , and the logMAP simplification and accompanying and add-compare-select unit designs [15] , [17] , [24] .
The SISO algorithm has a well-known recursion computation bottleneck. Dataflow transformations were performed in [29] on the recursions of the SISO algorithm to reduce power consumption, and parameters were derived that characterize single and double flow architectures for the sliding window algorithm. High-speed SISO decoder architectures were first considered in [18] and [21] , but were limited to nonsystematic codes and the memory architecture employed was not optimized for low power. Later in [30] - [32] , these architectures were generalized to recursive systematic codes and memory optimizations were applied on the dataflow graph to reduce storage overhead.
In this paper, we perform a rigorous analysis of the dataflow optimizations of the SISO-APP algorithm for both the slidingwindow and parallel-window versions. The analysis is based on a generalized tile-graph approach [33] that models the resource-time scheduling of the forward and backward recursions of the SISO-APP algorithm. Unlike the dataflow graphs considered in [18] , [21] , and [29] - [32] , the tile graph considered in this paper is viewed as a composition of individual recursion patterns defined mathematically by a set of parameters that model storage and decoding delay effects. The problem of constructing a SISO-APP architecture is formulated as a three-step process of constructing and counting the patterns needed and then tiling them. This simple approach covers all the architectures proposed in the literature [18] - [21] , [29] - [32] . Moreover, the problem of optimizing the architecture for high-speed and low-power reduces to optimizing the individual patterns and their tiling scheme for minimal delay, storage overhead, as well as processing unit overhead. Using this approach, optimized architectures for both the sliding and parallel window algorithms are derived. The " " architecture of [30] - [32] is a special case of the parallel window dataflow graph with hybrid tiling of individual patterns proposed in Section IV, where it is shown that this architecture is not necessarily optimal with respect to the criteria defined above.
We begin the paper in Section II, with a tutorial review of the SISO-APP decoding algorithm, and then survey the existing algorithmic and architectural transformations employed to alleviate the delay and storage bottleneck of the algorithm. In Section III, we propose the tile-graph methodology for analyzing the recursions of the SISO-APP algorithm. The tile-graph approach is then applied to construct an optimized architecture for the sliding window algorithm. In Section IV, the tile- graph approach is extended to include lateral tiling of individual recursion patterns, and a new tiling scheme is devised to optimize the parallel window architecture. Synthesis results that demonstrate the effectiveness of the optimized architectures are presented in Section V. Finally, Section VI concludes the paper. The Appendix includes proofs for the supporting propositions and theorems used in Sections III and IV.
II. SISO-APP DECODING ALGORITHMS AND ARCHITECTURES: A TUTORIAL REVIEW
In this section, we present an overview of the dynamics of the SISO-APP algorithm followed by various architectures that implement the algorithm with different tradeoffs.
A. The SISO-APP Decoding Algorithm
Consider an ( , , ) time-invariant convolutional code . An encoder for is a finite state machine having memory elements that encodes a -bit data symbol into an -bit code symbol [see Fig. 1(a) ]. The parameter is called the constraint length of the code, and the code rate of is defined as . For every pair of current-state, input-symbol ( , ) there corresponds a unique pair of next-state, outputsymbol ( , ). To encode the sequence of data symbols , the encoder starts from state at time and performs a sequence of state transitions (1) that ends in state at time , in response to each of the data symbols shown above the arrows in (1) . The outputs of the encoder, shown below the arrows in (1), constitute the encoded sequence of code symbols . From (1), it is immediate that knowing the state transitions sequence uniquely determines and .
An efficient way of listing all possible sequences in (1) is a trellis, which is a state-transition diagram expanded in time. A section of the trellis for is shown in Fig. 1(b) with edges representing allowable transitions between states. For convenience, let the functions , , , and denote the starting state, ending state, input symbol, and output symbol, respectively, associated with an edge of the trellis.
The decoding problem can now be defined as follows: given a noisy version of denoted by , find the data sequence . There are two probabilistic solutions to this decoding problem. Maximum likelihood (ML) decoding determines the most likely connected path through the trellis that maximizes the probability . From , the most likely data sequence is easily determined using (1) . On the other hand, MAP decoding, which we consider here, determines by estimating each of the symbols independently using the observations . The th estimated symbol is the one that maximizes the posterior probability , and hence the name symbol-by-symbol MAP. The SISO-APP algorithm, a generalized version of the BCJR-APP algorithm [6] , is a probabilistic algorithm that solves the MAP decoding problem. The algorithm performs four steps on the trellis to decode the channel sequence .
1) Branch metric computations ( , ): For , the decoder computes code symbol channel reliabilities using the channel symbols . It also accepts an equal number of updated prior reliabilities about the code symbols which are generated by another decoder that estimates the sequence using a different observable sequence than . We denote both reliabilities by . Similarly, the decoder accepts updated data symbol prior reliabilities denoted by , one for each possible data symbol , which are generated by a similar decoder that estimates the sequence independently. The sum of the reliabilities is then associated as a branch metric with each edge of the th trellis section.
2) Forward state metric computations
: From and , the algorithm computes for each state , , a forward state metric by performing a forward sweep on the trellis. is related to the probability that the trellis reaches state at time given the past observations . The trellis is initialized for with forward state metrics that reflect that the encoder starts from state as (2) The subsequent metrics for are computed according to the following forward (or ) recursion: (3) where the function is defined as [24] 
and is a correction factor. Approximating with max results in close-to-optimal performance for medium-to-high signal-to-noise ratios (SNRs), with a degradation of approximately 0.5-0.7 dB for very low SNRs [8] , [24] . The correction factors are normally provided by a lookup table (LUT) if extra accuracy is desirable (in [24] it was shown that eight values provide close to ideal performance). Note that the computations in the -recursion are based on trellis edges rather than pairs of states which makes the computation independent of the code type (systematic/nonsystematic, recursive/nonrecursive, code rate more than unity). 3) Backward state metric computations : Another set of backward state metrics are computed starting from the final state and moving backward. The backward state metric of state at time is related to the probability that the trellis path passes through at time given the future observations . The trellis is initialized for with backward state metrics that reflect that the encoder ends in state as
The preceding metrics for are computed according to the following backward (or ) recursion: (6) 4) Reliability updates ( , ): Using the branch, forward state, and backward state metrics, the algorithm generates output data and code symbol reliabilities and , for . These reliabilities represent updated estimates of the input reliabilities and obtained using information from all received symbols except the th symbol being computed, conditioned on the code constraints. The reliability of the data symbol is updated by considering all edges of the th trellis section whose associated data symbols are . Similarly, is computed for the th code symbol . The reliability update equations are given by (7) (8) where
. These posterior reliability values need to be processed further in order to interface with interleavers in a concatenated code which typically operate on quantities related to bits rather than symbols. So posterior symbol reliabilities need to be converted to extrinsic bit reliabilities prior to interleaving/deinterleaving, and then back to posterior symbol reliabilities after de-interleaving/interleaving. Together (3), (6) , (7), and (8) are referred to as the key equations of the SISO-APP decoding algorithm, which can be implemented using the building blocks shown in Fig. 2 . A module that implements the output reliability in (7) and (8) is referred to as a -Metric Processing Unit ( -MPU), and a module that implements the -recursions in (3), (6) is referred to as an -MPU ( -MPU). The operation in (4) can be implemented as an add-(compare-select) (ACS)(CS) logic plus a correction term [see Fig. 2(a) ], similar to the ACS logic used in implementing a Viterbi decoder. Fig. 2(b) shows the -MPU and -MPU blocks constructed using a binary tree of depth with ACS blocks as leaves and CS blocks as internal nodes. The MPUs shown operate on a trellis section and compute metrics in parallel. Fig. 2(c) shows the branch metric ACS logic block ( -ACS) used in (7) and (8) . The block can be implemented in either a modular or a cascaded mode. In the modular approach, the -ACS blocks are connected in a binary-tree like fashion of depth for and for to construct the -MPU as shown in Fig. 2(d) . In the cascaded mode, the -ACS blocks are cascaded together in a chain of length for and for , as shown in Fig. 2 (e) on the left, or a single -ACS block can be derived by folding the architecture on the left to iteratively compute output reliabilities as shown on the right in Fig. 2 (e). The cascaded mode requires less logic at the expense of an increase in latency. For convenience, the block -MPU ( -MPU) in 
B. SISO-APP Decoder Architectures
Architectures for SISO-APP decoders are best described in terms of dataflow graphs (DFGs) which provide flexibility in exploiting resource-time tradeoffs without impacting the design style [34] . Moreover, optimizations can be easily exposed on a DFG through dataflow analysis. Fig. 3(a) shows the trellis and the corresponding SISO-APP decoder DFG aligned below the trellis sections of an ( , 1, 3) code using the models of Fig. 2 . The trellis sections run from left to right across the frame symbols. In the DFG, the time index runs from top to bottom. The input branch metrics are supplied from the top. The -metrics are computed and stored in the FIFO buffers from left-to-right by the -MPUs from time 1 to . At time , output reliabilities are produced from right to left, in reverse order with respect to the trellis sections, by the -MPU using the stored -metrics and the initial -metrics, then the -metrics are updated. For full throughput architectures, storage is needed to align the input and output metrics as well as to store the intermediate input metrics. Decoding delay (or latency) is proportional to the height of the graph, and storage requirements are proportional to the number of buffers in the graph. For nonpipelined architectures, these buffers represent hardware utilization as a function of time, and hence the appropriate metric to consider is storage lifetime rather than the actual number of buffers. In either case, the term storage is used to refer to both contexts. Fig. 3(b) is a simplified representation of the DFG. The primary objective is to perform dataflow optimizations on this DFG to minimize both delay and storage requirements.
The SISO-APP algorithm, as characterized by the key equations, requires that the whole sequence of symbols be received before decoding can start. Depending on the communication setup, we distinguish among three modes of communication: terminated-frame mode, truncated-frame mode, and continuous mode. In the terminated-frame mode, normally used for short frames, the initial and final states and of the encoder that initialize the and -recursions are known to the decoder, and therefore (2) and (5) apply. The DFG shown in Fig. 3 (b) corresponds to the terminated-mode SISO-APP. The algorithm is not efficient for high-speed applications due its serial processing bottleneck which requires buffering a complete frame before decoding it (e.g., VDSL, WLAN, satellite communication, storage systems demand constituent SISO decoder throughputs in the range of 100 Mb/s-2 Gb/s).
In the truncated-frame mode, the constraint on is relaxed and only is assumed to be known at the decoder. This removes the drawbacks of the terminated-frame algorithm in terms of storage requirements and decoding delay which are a consequence of delaying the -recursion flow schedule time units with respect to the -recursion so that the initial conditions of the -recursion are satisfied. The sliding window (SW) SISO (SW-SISO) algorithm proposed in [18] and [19] mitigates the high storage demand by employing multiple -recursion flows instead of a single flow. In [20] , a similar approach was proposed but it differs in the relative timing and storage requirements of the and -recursions. The idea is to run the -recursion on a portion of the input frame instead of the whole frame. It can be shown empirically [21] that after a recursion depth , known as the metric warmup depth that is a function of the code parameters ( , , ) and the channel SNR, metrics converge to their reliable values and are therefore valid metrics. The metrics produced during the warmup phase are invalid metrics.
There are three versions of the SW-SISO algorithm (SW-SISO-I, SW-SISO-II, and SW-SISO-III) known in the literature which differ in the -recursion initialization, recursion warmup depth, and the number of valid metrics produced per -recursion flow. In SW-SISO-I [19] [ Fig. 4(a) ], the -recursions have a warmup depth of and produce only one valid -metric per -recursion flow. The first -recursion is scheduled steps after the -recursion flow, with the following -recursion flows scheduled one step apart. The -recursions are initialized from the -metrics or uniformly as (9) In SW-SISO-II [ Fig. 4(b) ], however, the first -recursion flow is scheduled steps before the -recursion flow, resulting in no storage requirements for the and -metrics. The -recursion is initialized to uniform metrics as (10) Finally, SW-SISO-III [2] , [18] [ Fig. 4(c) ] is the same as SW-SISO-II except that each -recursion flow produces a group of valid metrics rather than a single valid metric.
The three DFGs offer a frame length worth of improvement in decoding delay (up to a constant plus a multiple of ) over the DFG in Fig. 3(b) , with obvious savings in memory requirements for SW-SISO-II and SW-SISO-III. SW-SISO-II and SW-SISO-III also achieve a good compromise between communication performance by operating on long frames, and VLSI efficiency by scheduling the recursions in a way to cut down storage (in Section III-A we provide an optimized recursion scheduling method). However, VLSI performance is still unacceptable for high-speed applications such as those mentioned earlier because decoding delay is still proportional to (e.g., a throughput of 200 Mb/s requires a -MPU with propagation delay of 5 ns which is infeasible in current VLSI technology). This leads to the third mode of the SISO-APP algorithm, the continuous mode. In this mode, encoding is performed assuming that the message frame is infinitely long. Due to storage limitations, the decoder must divide the message to be decoded into frames of length say . For these frames, however, and are not known at the decoder, and consequently the key equations cannot be directly applied to decode the frames. In this case, approximations have to be made to initialize both the and -recursions. Architectures for this mode are discussed in Section IV.
III. OPTIMIZED SLIDING WINDOW ARCHITECTURES
In this section, we perform scheduling optimizations of the SISO-APP decoding algorithm that aim to minimize hardware storage and computation time. As already seen in Section II-B, SW-SISO-III shown in Fig. 4 (c) has smaller decoding delay than SW-SISO-I, but its storage requirements are still greater than SW-SISO-II. The analysis below shows under what conditions on the parameters ( , , ) in Fig. 4 (c) are both the decoding delay and storage requirements minimal.
A. Tile-Graph Synthesis and Analysis Methodology
We propose a tile-graph methodology [33] that performs dataflow analysis on the sliding window DFG, which is composed of a single -recursion flow and multiple -recursion flows that "slide" over the -recursion flow, to come up with an architecture that incurs minimal decoding delay and requires the least metric storage area. 1 To this end, a DFG that decodes code symbols is divided into smaller flow graphs, referred to as recursion patterns, where each recursion pattern decodes code symbols, and optimize these recursion patterns instead. An optimized DFG can then be constructed using the optimized recursion patterns. The motivation behind this approach is the following. The -recursion flows in the DFG are independent of each other, and each -recursion flow is composed of a metric warmup phase of length followed by a valid metric computation phase of length . Therefore, each -recursion flow can be paired with a portion of the -recursion flow of length to form a recursion pattern. Fig. 5(a) shows how the SW-SISO-III DFG shown in Fig. 4(c) can be decomposed into recursion patterns identical to pattern shown in Fig. 5(b) . Conversely, the SW-SISO-III DFG can be constructed by tiling the recursion patterns diagonally. Similar comments apply to the other SW-SISO DFGs considered in the previous section. A DFG constructed by tiling recursion patterns is called a tile-graph. Consequently, minimizing delay and storage area of a DFG translates to finding the most compact diagonal tiling of the individual recursion patterns. In Section IV, we consider lateral tiling of recursion patterns. Note that the proposed tile-graph method differs from other methods in the literature [18] , [21] , [29] - [32] in its simplicity, regularity, and generality, since is based on recursion patterns and their tiling scheme as opposed to complete dataflow graphs with multiple recursions, which are less intuitive to analyze. For example, the single-flow structures of [29] can be constructed and analyzed in the same way as the SW-SISO DFGs of Fig. 4 , while the double-flow structures in the same paper are handled by reflecting the DFGs of Fig. 4(b) and (c) to the right and combining each DFG and its image into a single window after proper alignment.
B. Sliding-Window DFG Construction
A recursion pattern of a sliding window architecture of the SISO-APP algorithm can be configured using three parameters, 1) is the difference between the starting time of the -recursion flow and the ending time of the -recursion flow, 2) is the number of metric computations needed to initialize the -recursion flow, already defined as the metric warmup depth, and 3) is the number of valid metric computations performed by the -recursion flow in relation to the total number of metric computations. Then, the problem of constructing and analyzing the performance of a sliding window architecture can be summarized in three steps; 1) construct a single recursion pattern with parameters ( , , ), 2) determine the number of patterns from ( , , ) and , and 3) tile the patterns diagonally.
1) Recursion pattern construction: It can be shown (see Lemma 1 in the Appendix) that the coordinates of points of a recursion pattern , such as the one shown in Fig. 5(b) , can be uniquely determined from the parameters ( , , ). Using simple algebraic and geometric manipulations, the coordinates of points are given by (11) Moreover, it is not difficult to show that feasible recursion patterns can be realized from the parameters ( , , ) if and only if the parameters ( , , ) satisfy one of the following six mutually exclusive constraints:
and (12a) and (12b) and (12c) and (12d) (12e)
where , , , and . These constraints result form comparing with and , as well as with . The resulting feasible recursion pattern configuration space is shown in Fig. 6 . The recursion pattern in Fig. 5(b) has parameters ( , , ) and satisfies constraint (12d). Using (11) and the constraints in (12a)-(12f), together with the resulting pattern geometries of the feasible recursion patterns in Fig. 6 , the dimensions (width and height ) of any feasible recursion pattern are given by (see Lemma 2 in the Appendix) (13) 2) Determining the number of recursion patterns: The number of recursion patterns can be determined from the number of decoded symbols per pattern, , and the total number of symbols, , as , assuming is a multiple of . The last pattern must be reconfigured to have the warmup depth parameter as . As an example, the DFG in Fig. 5(a) having requires three (3, 3, 2)-patterns, and one (3, 1, 2)-pattern as the last pattern.
3) Tiling the recursion patterns: The last step is to assemble the constituent recursion patterns to construct the architecture. Since the recursion patterns are tiled diagonally, the tiling separation between adjacent patterns needs to be determined. Consider the two adjacent patterns and shown in Fig. 5(a) , and let ( , ) designate the coordinates of point of pattern . Then, the quantity is the horizontal offset of pattern from pattern [see horizontal line with double arrowheads in Fig. 5(a) ]. Since the forward recursion in pattern runs for steps, the forward recursion in pattern can start steps later, resulting in . The quantity is the vertical offset of pattern from pattern [see vertical line with double arrowheads in Fig. 5(a) ]. If the patterns are allowed to overlap, this offset is the distance from point to , resulting in . For the cases or [i.e., Fig. 6(a) , (b), and (f)], the patterns can also be tiled in a nonoverlapping way such that a pattern on the right is completely below its neighbor on the left. In this case, the offset is the distance from point to , resulting in . Tiling with overlapping patterns leads to more compact DFGs, and hence it will be considered in the rest of the paper (although the analysis performed later applies equally well to nonoverlapping patterns). The patterns in Fig. 5(a) have a horizontal offset of , and a vertical offset of .
C. Delay and Storage Optimization
Next, we analyze the performance of the sliding window architecture again by referring to the constituent patterns. By performance we mean the decoding delay incurred by metric computations as a result of the flow of operations represented by the constituent pattern, as well as the total metric storage (lifetime) needed as a result of the relative ordering of the forward and back recursion flows. The decoding delay of a recursion pattern is simply its height plus 1, where is determined by (13) , and the delay between two adjacent patterns is the vertical offset between them, or . Therefore, the total decoding delay of an architecture composed of patterns is vertical offsets plus the delay of the first pattern. Using (13) , is given by (14) Typically, in the sliding window DFGs, the recursion patterns are not fully pipelined due to the prohibitive storage overhead for input and output metrics, but rather a pair of and -MPUs are allocated to each recursion pattern. Hence, the appropriate storage criterion to consider would be the storage lifetimes of the state metrics. The total state metric storage lifetimes , for both and , of a recursion pattern , with point designating the ending point of the -recursion flow [see Fig. 5(b) ], is given geometrically as the area of the region defined by the points (see dark grey regions in Figs. 5 and 6 ). Lemma 3 in the Appendix shows that the total metric storage (lifetimes) of an ( -pattern architecture is given by (15) if , and by if . Equations (14) and (15) characterize the performance of any sliding window architecture of the SISO-APP algorithm in terms of the parameters ( , , ) and . Note that is considered an independent variable, and is determined by the code parameters ( , , ) and the SNR, and is independent of the geometry of the pattern. Therefore, to minimize and , we consider only the parameters and .
Theorem 1:
The decoding delay and storage (lifetimes) functions are jointly minimized when or . The minimum decoding delay is or , and . The proof of the theorem is in the Appendix. Fig. 7 shows the normalized metric storage lifetimes with respect to the DFG in Fig. 3(b) and the corresponding minimum decoding delay as a function of for different values of , with being or . The stair case line designates the trajectory of the values of for which both and are minimal, and the dash-dotted line corresponds to the ideal lower bound. Fig. 8 shows an optimized DFG for a rate-3GPP turbo code of length and constraint length . The optimum dimensions of the resulting recursion pattern are and , assuming . The minimum decoding delay and the minimum metric storage lifetimes requirements are 5131 and 38 874 clock cycles, respectively. Note that although the configuration has less storage requirements per pattern, the resulting total storage requirements across the whole frame are larger.
IV. ADVANCED PARALLEL WINDOW ARCHITECTURES
In this section, we extend the recursion pattern optimization results of Section III to include lateral and a hybrid of diagonal and lateral tiling of the individual recursion patterns. All recursion patterns considered in this section are optimized patterns having . From Fig. 8 , it is apparent that the minimum delay of achieved by the optimized SW DFG can only be improved, and more importantly made independent of the frame length , if the patterns are tiled laterally rather than diagonally. The resulting DFG is called the parallel window (PW) DFG. We assume in this section that the PW DFG is fully pipelined, and hence we include in the analysis below the extra storage needed for input and output ( , ) reliability metrics, in addition to the state metrics. It should be noted when the SISO-APP decoder is fully pipelined, the external interleavers become the performance bottleneck. Further structural regularity must be imposed on the interleaving scheme in order to keep up with the SISO-APP decoders in terms of throughput.
Parallel window DFGs were first considered in [21] and later modified in [30] - [32] (under the rubrics of " " and " " architectures) to minimize state metric storage. In terms of our tilegraph design methodology, the " " architecture corresponds to lateral tiling of PW DFGs which is the subject of the Section IV-A. The " " architecture corresponds to a special case of hybrid tiling (Section IV-B) of PW DFGs with recursion patterns per window. However, as we show in Section IV-B, although this special case of the " " architecture does indeed outperform the "D" architecture, it does not necessarily yield an optimized architecture (in the sense of minimizing total metric storage) as claimed in [30] - [32] .
A. Lateral Tiling of Recursion Patterns
The optimized sliding window DFG derived in the previous section considered only diagonal tiling of the individual recursion patterns. This was a consequence of the constraint imposed by the SW-SISO algorithm to initialize the -recursion in every pattern using the exact -metrics from the preceding pattern and not through approximate -metrics obtained using an -recursion warmup phase as was the case with the -recursion. In this section, we drop this constraint and consider the SISO-APP algorithm in continuous mode. In this mode, both the and -recursions of the SISO-APP when applied to decode any portion or window of a code frame are not initialized with exact metrics that are derived starting from the beginning and end of the frame, but rather with approximate metrics derived using only symbols to the left and right of the pattern as shown in Fig. 9 . The same approximation used in (10) to initialize the -recursions can now be applied to initialize the -recursions as well (16) Consequently, each -recursion flow must have its own metric warmup phase. Referring to the optimized SW DFG in Fig. 8 , the -recursion flow can be split into multiple (smaller) -recursion flows. The individual recursion patterns become independent, and therefore they can be tiled laterally. Fig. 10(a) shows the resulting recursion pattern which is similar to the recursion patterns used in Fig. 8 except for the adjustment due to symmetry, as well as an additional warmup phase for the -recursion flow. Since the recursion patterns are to be tiled laterally, they can equivalently be redrawn such that the two warmup recursion phases initialize the and -recursions in adjacent windows as shown in Fig. 10(b) . The window width therefore corresponds to the parameters . If , say for some positive integer , the two warmup phases are folded inside the pattern into segments of length each, and the boundary points of each warmup phase are initialized outside the window as shown in Fig. 10(c) . Fig. 10(d) shows the result of tiling the compact patterns laterally for the case . The PW DFG for can similarly be obtained. We assume that windows process the whole frame (with ) or a portion of it (with ) in parallel, and hence the name parallel window DFG. A PW DFG with lateral tiling is the " " architecture of [18] , [21] , [30] , [31] . Regions with light grey shades in Fig. 10(b) -(d) represent storage for input branch metrics ( ), those with dark grey shades represent storage for both input and state metrics, while unshaded regions represent storage for output branch metrics ( , ). The total storage for input and output metrics are denoted respectively by and . Comparing the PW DFG to the optimized SW DFG in Fig. 8 , the decoding delay of the PW DFG,
, is now a function of the window height which is much smaller than the decoding delay of the SW DFG, , which is a function of the frame length. However, apart from the savings due to the adjustment , there is only a slight improvement in state metric storage requirements over the optimized SW DFG since now the last -metric of a recursion pattern need not be saved extra clock cycles to interface with the succeeding pattern [compare the dark shaded regions in Figs. 8 and 10(d) ]. More specifically, the total state metric storage (lifetimes) of the PW DFG operating on a frame of length and having window size is given by (24) in Table I (see Lemma 4 in the Appendix). Equation (24) implies that the state metric storage requirements per window increase quadratically with the window width W, which is equivalent to the parameters or of the recursion pattern in the case of lateral tiling. Hence, to decrease the state metric storage requirements for a given , this suggests grouping more recursion patterns with parameters into the window, in which case the storage requirements per window become quadratic in (or ) rather than , at the expense of an increase in the number of MPUs. In terms of storage for input metrics, the combined area of the shaded regions in Fig. 10(d) is given by (25) in Table I . The total output branch metric storage represented by the unshaded regions in Fig. 10(d) is given by (26) in Table I . Moreover, the total number of and MPUs required is even odd (17) and the number of and MPUs is .
B. Hybrid Tiling to Reduce Storage Requirements
Using a single recursion pattern to form a window of size in the PW DFG is just one instance of lateral tiling. Referring back to Fig. 8 , multiple diagonally-tiled recursion patterns can also be grouped together to form a window of size , which in turn can be tiled laterally with similar windows to construct a complete PW DFG. We call the grouping of multiple diagonally-tiled patterns into a window and then tiling the windows laterally hybrid tiling. Consequently, a PW DFG employing hybrid tiling is parameterized with two parameters, the window size and the number of recursion patterns per window , with . Fig. 11(b)-(d) shows three examples of the PW DFG with hybrid tiling for the same window size having , 3, and 4 recursion patterns per window, respectively. For comparison, a PW DFG with lateral tiling (i.e., for ) is also shown in Fig. 11(a) . The parameters ( , , ) of the recursion patterns in the windows are ( , , ). Note that for , extra buffers are needed so that the first pattern in each window interfaces with the last pattern of its adjacent window. Fig. 11 (b) with corresponds to the " " architecture in [30] - [32] . As before, regions with light grey shades in Fig. 11 represent input metric storage, those with dark grey shaded represent both input and state metric storage, while unshaded regions represent output metric storage. Note that protruding warmup recursions to the right of the window are trimmed since the adjacent window provides appropriate initial values to the warmup recursions.
The decoding delay in all cases is . In terms of state metric storage requirements, as increases the area of the regions with dark grey shades in Fig. 11 representing state metrics decreases at least by a factor of . Since this area in Fig. 11(a) grows quadratically with , the savings are significant. The overhead is an increase in storage for warmup metrics by a factor of , an increase in storage (for ) to interface adjacent recursion patterns in the window [see shaded regions with right inclining hatches in Fig. 11(b)-(d) ], as well as extra storage to interface the first and last recursion patterns in the window [see shaded regions with left inclining hatches in Figs. 11(b)-(d) ]. Table I shows a breakdown of the storage requirements of a single PW DFG using hybrid tiling. The total state metric storage of the PW DFG with hybrid tiling operating a frame of length with window size and recursion patterns per window is given by (29) in Table I for , and by (24) for (see Lemma 5 in the Appendix). In terms of storage for input metrics, as increases the area of the regions with light grey shades above the -recursion flow in Fig. 11 remains constant while the area of the dark shaded respectively. Regions with light grey shades represent storage for input metrics, those with dark grey shades represent storage for both state and input metrics, while unshaded regions represent storage for output metrics. As increases, the areas of the dark grey shaded regions decreases at least by a factor of =2. The overhead is the extra storage for interfacing adjacent patterns (shaded regions with right inclining hatches) and the first with the last pattern (shaded regions with left inclining hatches), as well as the increase in the area of the unshaded regions. However, since the number of output metrics to be stored per trellis section is smaller than the number of input metrics, trading storage for input metrics with output metrics is favorable.
regions below it decrease, resulting in a reduction of storage for input metrics. This quantity is given by (30) in Table I .
In fact, this reduction comes at the expense of an increase in storage requirements for output metrics as increases since the dark shaded regions are traded with unshaded regions representing storage for output metrics in Fig. 11 . A simple observation shows that the sum of the storage for input and output metrics is equal to the area of the whole window which is constant with respect to . However, since the number of output metrics that needs to be saved per trellis section, for serially concatenated codes and otherwise, is less than the number of input metrics that needs to be saved, , trading storage for input metrics with storage for output metrics is favorable. 2 The total storage for output metrics is given by (31) in Table I .
In general, hybrid tiling of PW DFGs for requires more and MPUs compared to lateral tiling, but the same number of and MPUs . The total number of and MPUs required is given by (32) in Table I (see Lemma 6 for proof) which increases with . To optimize the PW DFG for storage, we consider the objective function representing the sum of the storage requirements-state, input, and output metrics-each weighted by the appropriate number of metrics to be stored per trellis section. To incorporate the effect of the MPUs, the functions and are included, weighted by and representing the storage equivalent (either in terms of power consumption or silicon area) of a single , -MPU and a single , -MPU, respectively, corresponding to a complete trellis section (18) 2 The only exception is when k = 1 for serially concatenated codes (assuming n 1), which almost never occurs since good serially concatenated codes are typically designed such that the inner decoder (the one that generates reliabilities for code symbols) has k 2.
Theorem 2: For a given window size , there exists an optimal such that is minimum. The optimum number of recursion patterns per window is given by the integer floor of or (19) and the resulting total storage is lower bounded by (20) where , , are constants that also depend on , and .
For proof of (19) and (20), and the definitions of the these constants, the reader is referred to Theorem 2 in the Appendix. In Fig. 12 , we plot versus for various values of of a PW DFG constructed with hybrid tiling of recursion patterns for , derived empirically using 0.18 , 1.8 V CMOS technology. The plots are normalized to the storage of a PW DFG with lateral tiling having the same window size. The code parameters are those considered in Fig. 7 . The minimum values attained for each are marked with squares, and the solid line corresponds to the ideal and as given by (19) and (20) . As shown in the plots, the storage is minimal when , 4, 4, 4, 8, 8 for , 48, 64, 72, 96, 128, respectively. This demonstrates that the special case of the " " architecture [30] , [31] corresponding to is not necessarily optimal. Moreover, increasing the number of recursion patterns beyond the optimal is not effective, and even in some cases counterproductive.
V. SIMULATIONS RESULTS
In this section, we evaluate the various parallel window DFGs presented in Section IV. The exact tradeoffs among these DFGs and the optimal structure and parameter settings can only be evaluated through simulations and not analytically using models for memory and datapath. Hence, the SISO-APP decoder was implemented in VHDL using the proposed tile-graph approach featuring PW DFGs with hybrid tiling of recursion patterns. The simulations are based on a 0.18-1.8-V 5-metal-layer CMOS technology parameterized macro-cell library [35] , and the Synopsys tools were used for placement and routing, as well as power estimation. The library includes optimized implementations of the MPUs (in terms of individual transistor sizes) as well as custom ring-buffer implementations of the FIFO buffers. Power estimation is based on a randomly chosen noisy frame used as input to the decoder. Two turbo codes were considered: a length 1024, ratecode , and a length 5114, ratecode , having generator polynomials and , respectively. The effects on performance of metric quantization and recursion warmup depth under different window sizes were first determined by simulating the codes using the PW-SISO algorithm with fixed point representation and eight iterations per frame, Hence all MPUs are implemented using 6-bit datapath, and 4-bit and 6-bit ring FIFO buffers are used for branch and state metric storage, respectively. In addition, metric normalization was employed to avoid overflow especially at high SNR, and 4-level logMAP correction factor LUTs were used.
Figs. 14 and 15 show area results for and , while Figs. 16 and 17 show power consumption results, using the approach of PW DFG with hybrid tiling of recursion patterns. The deliverable throughputs per window at 50-MHz range between 3.2 Gb/s and 12.8 Gb/s. In each case, six window sizes are considered, , 48, 64, 72, 96, 128, and for each window size up to eight recursion patterns per DFG were simulated. The plots show a breakdown of area and power consumption among (starting from the bottom) state metrics, input metrics, output metrics, and MPUs. The case corresponds to lateral tiling of patterns or the " " architecture of [18] and [21] , while the case corresponds to hybrid tiling with two patterns per DFG or the " " architecture of [30] - [32] . The figures demonstrate that the proposed tile-graph methodology implementing the PW DFG with hybrid tiling always outperforms the " " and " " architectures known in the literature in terms of silicon area and power consumption for window sizes more than 32. Moreover, the tile-graph approach is particularity effective for larger window sizes as opposed to smaller window sizes. The optimum number of patterns per window as predicted by (19) accounted for in the model used for optimization especially for large . Note that the deviation of the flat regions in the plots from the optimum values is very small, and hence the smallest or closest integer divisor of can be chosen as the "optimum" .
As increases, the state metric storage requirements decrease considerably-between 60.4%-85.4% in power consumption over " ", and 27.6%-71.22% over " ". Power consumption due to input metrics on the other hand decreases by 7.54%-17.05% over " ", and 2.1%-8.10% over " ". The output storage overhead however increases with -58.04%-217.5% overhead with respect to " ", and 9.94%-114.09% with respect to " ". However, since the output requirements roughly constitute around 3.83% of the total requirements, this overhead is insignificant compared to the savings achieved. Finally, the overhead in terms of MPUs over " " and " " is between 2.94%-56.25%. Proportional savings/overhead in terms of area are achieved. Fig. 18 shows the overall savings achieved by the hybrid tiling approach in terms of area and power consumption over the " " and " " architectures, as a function of the window size for both and . One drawback of the hybrid tiling approach is that it does not effectively address the input metric storage requirements which become the dominant storage factor in terms of area and power. Other techniques aimed at mitigating this effect such as extrinsic metric quantization and companding, coupled with resource sharing and more efficient memory architecture, as well as standard circuit techniques such as clock gating and voltage scaling should be pursued.
VI. CONCLUSION
We have proposed a tile-graph methodology for the synthesis and analysis of SISO-APP decoders used in turbo codes. The methodology addresses the storage and delay recursion bottlenecks of the SISO-APP algorithm at the architectural a new parallel window DFG was proposed based on hybrid tiling of recursion patterns and was shown to achieve savings in area and power in the range of 4.2%-53.1% over existing techniques. 
APPENDIX
Lemma 1: The recursion pattern defined in Fig. 5 (b) has three degrees of freedom: the scheduling aperture , the metric warmup depth , and the decoded output size . The coordinates of the four extremities are given by (11) .
Proof: Consider the pattern in Fig. 5(b) with point designating the starting point of the backward recursion after warmup. The forward recursion must run for steps along the symbols, while the backward recursion must run for an additional steps. Assume the starting point A of the forward recursion a reference point. Then the separation determines the ending point of backward recursion, while determines both the ending point of the forward recursion and point of the backward recursion.
determines the warmup starting point of the backward recursion. Therefore, the four extremities of the pattern are completely determined relative to each other. To fix the pattern in the plane, the exact location of one of the extremities must be fixed. Using (21) and (22), the joint minimizer of both and is given by Case 2) If is odd, then when (23) As in the previous case, the lower bound on is However, in this case taking to satisfy both (21) and (23) results in being off from by 1. This can be neglected for all practical purposes. Lemma 4: The total storage-state, input, and metric-requirements of the PW DFG with lateral tiling operating on a frame of length with window size are given, respectively, by (24) (25) (26) where .
Proof: Referring to (15), the result from Theorem 1 yields . The PW DFG requires extra buffers for -metric warmup. However, due to lateral tiling, the (first) term in (15) Now, considering two cases for even and odd, the result in (24) follows. Next, referring to (11) the area of the unshaded regions corresponding to is equal to when is even, and when is odd. Subtracting the area of the whole window from (26) and scaling by the number of windows, , yields (25) .
Lemma 5: The total storage-state, input, and output-requirements of the PW DFG with hybrid tiling of recursion patterns and parameters ( , ) operating on a frame of length are given respectively by (29)- (31) with equal to (24) , , and appropriately defined parameters , , , .
Proof: Referring to Fig. 11 , is given by the area of the regions with dark grey shades. For the first patterns, we apply Lemma 3 with parameters . The storage requirements for the last pattern are for -recursion warmup, and for intermediate and -metrics. Finally, the first and last recursion patterns must be interfaced, requiring buffers. Hence, the resulting total storage for state metrics is (27) Note that for an extra term of must be added for -recursion warmup [compare Figs. 11(a) and 11(b) ]. The storage of the protruding warmup recursions to the right of a window must be subtracted since the adjacent window initializes the recursions. This quantity is equal to (28) where if is even and , or if is odd and . If and is even, then , and if and is odd, then . The parameters , , 2, are given by . Next, subtracting from (27) and considering two cases for even and odd, (29) follows.
The total storage for output metrics is the area of the unshaded regions in Fig. 11 . Each pattern requires or output buffers depending on whether is even or odd. In addition, each pattern except the last stores output metrics multiplied by the number of patterns below it, giving . Since the first pattern on the left does not produce an output metric for the first element, buffers must be subtracted. Summing up all terms gives (29)- (31) . Finally, the total storage for input metrics is simply the area of the window minus (31) .
Lemma 6: The total number of and MPUs of the PW DFG with hybrid tiling of recursion patterns and parameters ( , ) operating on a frame of length is given by equation (32) at the bottom of the page, and the number of , -MPUs is , where , and , , , are defined in Lemma 5.
Proof: Each recursion pattern requires MPUs for warmup, and or and -MPUs each if is even or odd, respectively. The -MPUs corresponding to the protruding warmup recursions to the right of the window must be subtracted. This quantity is given by (28) . Multiplying by the number of windows and recursion patterns, (32) follows. Finally, since a window processes trellis sections, a total of and -MPUs are needed. Theorem 2: For a given window size , there exists an optimal such that is minimum. The optimum number of recursion patterns per window is given by the integer floor of (19) . The resulting total storage is lower bounded by (20) .
Proof: The total storage requirements of the PW DFG with hybrid tiling including the storage equivalents of the MPUs is given by Using Lemma 5 and assuming , gives the optimum number of recursion patterns as the integer floor of where and . Next, evaluating for the first value of yields the lower on the total storage of where
