Implementation constraints imposed on iterative decoders applying message-passing algorithms are investigated. Serial implementations similar to traditional microprocessor datapaths are compared against architectures with multiple processing elements that exploit the inherent parallelism in the decoding algorithm. Turbo codes and low-density parity check codes, in particular, are evaluated in terms of their suitability for VLSI implementation in addition to their bit-error rate performance as a function of signal-to-noise ratio. It is necessary to consider efficient realizations of iterative decoders when area, power, and throughput of the decoding implementation are constrained by practical design issues of communications receivers.
Introduction
Error correction algorithms are frequently evaluated by their bit-error rate (BER) vs. signal-tonoise ratio (SNR) performance. In practice, the implementations of these algorithms are constrained by the formats and throughput/latency requirements of specific communications standards. A practical implementation of a given algorithm in either hardware or software is implementation because it replaces multipliers with adders. However, evaluating the sum in the log-probability domain requires a combination of exponential and logarithmic functions. In order to simplify the implementation, the computation can be approximated with the maximum value of the input operands, followed by an additive correction factor, which is determined by a table lookup.
An example of the sum-product algorithm processed in log-probability domain is the addcompare-select (ACS) recursion in a maximum a-posteriori (MAP) decoder ( Figure 1a) . The "add" operations evaluate the logarithm of two product terms while the "compare" and "select" operations approximate the logarithm of a sum of exponentials. This approximation (called the max-log-MAP) leads to an implementation loss of about 0.5dB in a turbo decoder system. However, adding a correction factor to the output of the ACS can restore the coding gain within 0.1dB of the MAP decoder performance [2] . This correction factor can be provided by a table lookup, based on the difference of the two sums. Similar to commonly used Viterbi decoders, the throughputs of MAP decoders have been limited by the implementation of the ACS structure due to the single-step recursion that prevents pipelining.
Another example of the sum-product computation in the log-probability domain can be found in LDPC decoders (Figure 1b where p n represents the probability that a bit x n is equal to 1. The use of log-probability domain simplifies the evaluation of the product, but also requires the table-lookup It has been shown that iterative decoders operating in the log-probability domain can frequently achieve good coding performance with arithmetic precision of just three to five bits. This implies that the lookup tables can be efficiently implemented with simple combinatorial logic functions that directly implement the required function.
In addition to the calculations of marginal posterior functions, practical decoder implementations can lower the energy consumption per decoded bit by applying stopping criteria to the decoding iterations. This is done, for instance, in turbo codes by noting that the absolute log-likelihood values of all bits in a decoded block have exceeded a preset value, thus indicating sufficient confidence in the decoded output. Alternatively, the iterations in an LDPC decoder can be stopped when all parity check constraints have been met.
Message-Passing Requirements
The other key implementation feature of an iterative decoder is the interconnect network required to facilitate the exchange of messages between nodes in a factor graph. While each node in the graph is associated with a certain arithmetic computation, each edge in the graph defines the origin and destination of a particular message. The implementation of the messagepassing requirements, however, appears in different forms for turbo and LDPC decoders. The turbo decoder consists of a concatenation of MAP decoders separated by interleavers that permutate the sequence of inputs. Interleaving facilitates the exchange of messages between nodes that are adjacent in more than one of the underlying graphs. Although interleaving of messages can be performed through a direct-mapped network of interconnects for realization of a high throughput interleaver, this will potentially result in intractable routing congestion due to the irregular nature of the interleaving network. In practice, the interleaving function is executed by writing the relayed messages sequentially into a random access memory array, and by reading them out through a permuted sequence. The order of addresses used in the readaccess can be stored in a separate read-only memory (ROM), or computed on the fly. The latter method requires the addresses to be deterministically reproducible, and exploits the regularity in the interleaver structure to calculate addresses using, e.g, simple shifting or modulo division [3] .
Likewise, an LDPC decoder is required to provide a network for messages to be passed between a large number of nodes. A direct wiring of the network leads to congestion in the interconnect fabric due to the disorganized nature of the defining graph. The congestion can be circumvented through the use of memory. Unlike the interleavers used in turbo codes, which have a one-to-one connectivity, LDPC graphs have at least a few edges emanating from each variable node. The number of edges is several times larger than that in an interleaver network, and hence requires a larger memory bandwidth.
The practical implementation of a message-passing network is dependent on the structural Yeo properties of the graph. In general, the construction of good iterative codes requires large numbers of nodes whose interconnections are defined by graphs that are expanders and have a large girth [4] . These graphs tend to have a disorganized structure, which complicates the implementation of the message-passing network by requiring long, global interconnects or memories accessed through an unstructured addressing pattern. More recently, graphs with structured patterns have emerged, [5] , and they simplify the implementation of the decoders.
Iterative Architectures
The A common implementation technique that achieves higher throughputs uses multiple windows [6] . This approach parallelizes the message-passing algorithm over several subsets of bit sequences by partitioning the recursions into a number of independent overlapping windows. The overall throughput of a turbo decoder is also dependent on the implementation of the interleaver. Due to the size of the interleaver, the common approach uses memory blocks, but also places the memory access time in the critical path. Memory access is approximately 2ns
(general-purpose single-ported 32kb memories in 0.13 µm CMOS technology), significantly more than the 1ns required to add two pairs of two short wordlength numbers and select the maximum result in the ACS decoding logic (0.13µm CMOS ASIC design). Decreasing average memory access time by increasing the number of I/O ports is unsuitable because it leads to quadratic growth in memory area. Hence, the memory access determines the serial symbol rate.
LDPC Decoder Architectures
In LDPC decoding, there is no interdependency between simultaneous variable-to-check and check-to-variable computations. Parallel LDPC decoders will benefit from throughput and power efficiency, but will require the implementation of a large number of processing elements together with message passing within a congested routing network. In order to ease the difficulty in routing, a common approach is to partition a design into smaller subsets with minimum overlap. However, due to irregularity in the parity check matrix, design partitioning is difficult and yields little advantage. An example of a parallel LDPC decoder [7] (Figure 3a) for a 1024-bit rate-½ code requires 1536 processing elements with an excess of 26,000 Yeo In a serial LDPC decoder (Figure 3b ), the task of message computations is serialized into a small number of processing elements, resulting in a latency of several thousand cycles.
Furthermore, the large expansion property of LDPC codes with good asymptotic performance leads to stalls in processing between the rounds of iteration due to the data dependencies. In order to capitalize on all hardware resources, it is more efficient to schedule all available processing elements to compute the messages in each round of decoding, storing the output messages temporarily in memory, before proceeding to a subsequent round of computations.
Although serial architectures result in less area and routing congestion, they lead to dramatically increased memory requirements. The size of the memory required is dependent on the total number of edges in the particular code design, which is the product of the average edge degree per bit node and the number of bits in each block of LDPC code. For example, a serial implementation of a rate-8/9 4608-bit LDPC decoder with variable nodes, having an average edge degree of four will have more than 18,000 edges in the underlying graph. It would have to perform 37,000 memory read or write operations for each iteration of decoding, which limits the total throughput. This contrasts a turbo decoder, whose memory requirement is largely dictated by the size of the interleaver required to store one block of messages. Given the same block size, the memory requirements for the serial implementation of an LDPC decoder are several times larger than that of a turbo decoder. Yeo problem by using time-shared hardware and memories in place of interconnect. This serial method currently limits the internal throughput of turbo decoders to 6.5Mbps [8] and LDPC decoders to 56Mbps [9] .
Platforms for Iterative Decoding
Custom ASIC is well suited for direct mapped architectures, offering even higher performance with further reduction in flexibility. An LDPC decoder [7] implemented in 0.16µm CMOS technology achieves a 1Gbps throughput by fully exploiting the parallelism in the LDPC decoding algorithm. The logic density of this implementation is limited to only 50% to accommodate a large on-chip interconnect. In addition, the parallel architecture is not easily scalable to codes with larger block sizes. For decoding within 0.1dB of the capacity bound, block sizes with tens of thousands of bits are required [10] . With at least 10 times more interconnect wires, a parallel implementation will face imminent routing congestion, and may exceed viable chip areas.
Current ASIC implementations of turbo decoders [3] are serial, targeting wireless applications. Yeo Decoding throughput is 2Mb/s with 10 iterations of the two constituent convolutional decoders.
A high throughput ASIC turbo decoder, limited by the interleaver memory access, should be able to decode at throughputs over 500Mb/s.
An alternative form of custom ASIC platform is based on analog signal processing [11] . Initial analog implementations of MAP decoders have reduced silicon area and achieved a high decoding throughput. Analog decoders operate in the probability domain, evaluating the sumproduct algorithm in its raw form, as opposed to operating in the log-probability domain in which digital decoders operate. Probabilities, represented as currents in bipolar junction transistors, are multiplied using a 6 -transistor Gilbert cell that takes advantage of the exponential relationship between collector current and base-emitter voltage. Currently, analog MAP decoders only exist in fully parallel architectures, with small block sizes up to a few hundred bits. For a serial architecture with larger block sizes, a track-and-hold circuit can be employed to implement a one-step recursion similar to the ACS recursion in digital implementations. Despite their benefits, analog implementations are sensitive to process and temperature variations, and are difficult to test in production. Analog circuits are also less scalable with improvements in process technology. Figure 4 and Table 2 provide summaries of the computational platforms and their performance.
Impact of Code Construction on Decoder Architectures
The desire for large SNR gains frequently conflicts with the requirements for low complexity and high flexibility of the decoder. In most classes of iterative decoders, the properties that Yeo iterations. In general, the BER performance of a code improves as the value of these numbers increase. However, being a block code with good expansion properties, decoding commences only after the final symbol in the block is received. A large block size not only imposes heavy computational and memory requirements on the decoder, but also leads to extended latencies.
Likewise, a large number of iterations increases the decoder latency and power while lowering the effective throughput. In serial architectures, which are already slower due to the limited number of processing elements, each additional iteration requires more pipelined hardware in order to sustain the throughput rate. This results in further increase in area. Decoders that require above 1000 iterations or 10 7 processing elements [10] are not suitable for parallel architectures. Large codes are much more easily mapped onto serial architectures, but will result in extended decoding latencies.
Turbo Codes
The properties that are specific to the implementation of turbo decoders are the code constraint lengths of the constituent convolutional codes, as well as the design of the interleaver. of additional area and power usually offsets any benefits of a complex compute-on-the-fly strategy.
LDPC Codes
In terms of implementation, LDPC codes can be differentiated along the lines of whether the code has a structured graph, a uniform edge degree (regular codes), and the maximum edge degree of both check and variable nodes.
The structure of an LDPC graph, analogous to the structure of the interleaver in a turbo code, directly affects the implementation of the message-passing network. A graph with disorganized connections will either lead to routing congestion in parallel architectures, or address indexing issues in serial architectures. However, there exist examples of LDPC codes with highly structured and regular graphs. These codes are based on properties of finite fields [5] , and exhibit a natural cyclic structure, which can be exploited to allow the use of fast and small shift registers. Column splitting on these codes also yield added parallelism between memory accesses in serial architectures with a limited number of parallel processing elements [13] . The demonstrated rate-½ (64,32) code has a block size of 8190 bits, and achieves a bit error rate of 10 -5 at 1.8dB away from the theoretical bound.
The edge degree of a code corresponds to the number of inputs or outputs on the processing elements. This property has similar effects as the constraint length of a constituent convolutional code in turbo systems because it determines the relationship between a variable node and its adjacent neighbors. Practical implementations of LDPC decoders, particularly Yeo parallel ones, benefit from a regular code with a small maximum edge degree in order to avoid detrimental arithmetic precision effects and the complexity of collating a large number of inputs and outputs at the processing elements. LDPC codes generated from the method of density evolution [10] face these issues because the codes have maximum variable degrees in the order of a hundred. To avoid truncation of the results, the output of a 100-input adder would require at least 7 bits more than the wordlength of its inputs, effectively doubling the wordlength of a 5-bit decoder implementation. In addition, a parallel architecture of the decoder would have to emanate at least 100 signal buses from some of the processing elements.
Despite one of the best-reported performances, at only 0.0045dB away from the theoretical Shannon bound, such codes are not particularly suited to the realization of a parallel VLSI decoder.
In serial LDPC decoder architectures, the edge degree of the variable nodes determines the total number of edges in the graph, which affects the size of the memory required, as previously noted. A method for reduction of the memory requirement in LDPC decoders involves a combination of a staggered decoding schedule and an approximation of the variable-to-check message computation [13] . A small number of processing elements compute a serial stream of check-to-variable messages. Unlike traditional decoding methods, each variable-to-check message is approximated by the log-likelihood ratio of the variable and marginalization of the variable-to-check messages is not performed. The log-likelihood ratio is updated (incremented) as soon as there is any available corresponding check-to-variable message.
Compared with the classical message-passing algorithm, this method allows decoders with area or power constraints that limit the number of iterations to five or less, to benefit from more Yeo than 75% reduction in memory requirement, at the expense of less than 0.5dB loss in coding gain. The decoder has a memory requirement that is dependent only on the total number of variable nodes in the block. It is noted that the staggered decoding will not achieve the same asymptotic results as LDPC decoding under belief propagation.
In order to produce viable real-time iterative decoding, future decoders have to aggressively exploit the power and throughput advantages of parallel architectures. However, these have to be preceded by techniques in code construction that address the complexity of routing parallel
implementations. Codes suitable for use in communications receivers may have to trade off the SNR performance for improved partitioning properties that would assist the routing problem.
An example in this direction makes use of simulated annealing to minimize the total length of interconnect while maximizing the total girth of the graph [14] .
Conclusion
Implementations of iterative codes have to address the arithmetic requirements for a variety of sum-product algorithms and provide a network for message passing. While the computational requirements of iterative decoders have received attention in the research on iterative decoder implementations, the realization of the network to facilitate message passing has been less emphasized. In general, the underlying graphs describing interleavers used in turbo codes and the parity check pattern in LDPC codes have an unstructured nature, which leads to routing congestion if they are directly mapped into a network of interconnects. These issues are not only specific to turbo and LDPC codes, which this paper has primarily focused upon, but also Yeo [3] Interleaver addresses computed on the fly. Implementation was optimized for low power. 500Mbps high throughput MAP decoder is theoretically feasible.
Custom ASIC (Analog) Parallel Analog MAP decoder in BiCMOS technology [11] Interleavers not included.
Sensitive to process and temperature variations. Difficult to test in production. Not scalable with improvements in process technology
