Abstract-Analog soft iterative error control decoders use analog computation to decode digital information. Analog soft-information processors offer the power of iterative decoding with a very small transistor count, enabling fully parallel decoding at low cost. They can be implemented in several technologies: BiCMOS and SiGe analog decoders are suitable for high-speed applications beyond 10 Gbit per second. CMOS analog decoders, by contrast, are best suited for low-cost, low-power applications with throughput up to 1 Gbit per second. CMOS analog decoders can also be designed for micropower operation, making them uniquely applicable to emerging applications such as sensor networks and implantable devices.
I. INTRODUCTION
Analog error control decoders have been successfully demonstrated for a variety of codes. The first analog decoders implemented the Viterbi algorithm, and provided very efficient, low-cost implementations of the Add-Compare-Select portion of the algorithm. Analog Viterbi decoders attracted considerable interest, with excellent results, e.g. [1] .
More recently, it was observed that analog information processing could be used to implement fully-parallel soft iterative decoders based on the Sum-Product Algorithm [2] , [3] , [4] . The resulting circuits can be used to implement a variety of decoders, such as Maximum A-Posteriori (MAP) decoders [5] , [6] , [7] , Turbo decoders [8] , LDPC decoders, Block Product Decoders [9] , [10] , etc. For the sake of simplicity, we limit our examination to LDPC-style decoders, using BPSK transmission on AWGN channels.
A. Modern iterative decoders
An iterative decoder is constructed from two or more soft-information component decoders which operate exclusively on probabilities for both their input and output. After decoding, each component decoder computes extrinsic information probabilities. The decoders then exchange this extrinsic information and decode again. Because of this information sharing, the decoder's calculation tends to improve after each decoding iteration.
The well-known Turbo codes are constructed by concatenating two convolutional codes, separated by an essentially random interleaver, which scrambles the order of the inputs between each decoder. In effect, the interleaver causes one decoder's output to be statistically independent of the other's. The quality of the extrinsic information, and consequently the quality of the code itself strongly depends on the interleaver.
A Low-Density Parity Check code (LDPC) can be thought of as the concatentation of many repetition codes with many single parity-check codes. Its structure is shown in Figure 1 .
Each box in Figure 1 represents a function node (or a constraint node). The boxes labeled with '=' denote a constraint of equality, and those labeled '+' denote a constraint of even parity. The equality and parity boxes can be thought of as local repetition and parity-check codes, respectively. A layer of equality nodes is connected to a layer of check nodes by a possibly random pattern of edges.
The circles in Figure 1 are known as variable nodes, which represent observable information (namely, the decoder's input and output). Each variable node represents one bit in a codeword. Some of them are information bits, some are parity bits. Taken as a whole, the graph represents the logical constraints between the information and parity bits.
For decoding, each node in the graph is mapped to a simple probability processor. A function node receives messages -consisting of probability information -on each of its edges. It then updates its outgoing messages based on its functional constraints.
For example, consider an equality node with three edges, A, B, and C. The functional constraint is A = B = C. The node receives messages P A and P B , which are probability masses for A and B, respectively. It then computes the output for edge C, which is P C (1) = η · P A (1) · P B (1) and P C (0) = η·P A (0)·P B (0), where η is a normalizing constant, chosen such that P C (0) + P C (1) = 1.
A check node has similar behavior. The functional costraint for a check node is A ⊕ B ⊕ C = 0, where '⊕' denotes modulo-2 addition. The check node's output for edge C is thus P C (0) = P A (0) · P B (0) + P A (1) · P B (1) and P C (1) = P A (0) · P B (1) + P A (1) · P B (0). These simple sum and product operations are all that is needed for iterative decoding. This algorithm, elaborated in [11] , is appropriately known as the sum-product algorithm.
After transmission on a Gaussian channel, soft probabilities for each bit arrive at the variable nodes. All messages in the decoding network are initialized to {.5, .5}. The channel probabilities are forwarded to the equality nodes, which update their messages on each edge. The equality nodes then transmit to the check nodes, which update their messages. The check nodes then transmit back to the equality nodes, and so on. After many such iterations, the final messages are sent to the variable nodes, where the most probable bit is selected.
B. Translinear circuits
Most analog decoders are based on translinear circuits. A translinear device is an idealized transistor in which the device's current is an exponential function of the voltage between two of its terminals. An MOS translinear device, for example, has three terminals -the Gate (G), the Drain (D), and the Source (S). The symbol for an MOS device is shown in Figure 2 . There are two types of devices, Positive and Negative, in which the Source and Drain terminals are oppositely oriented. Current in an MOS device flows only between the Drain and the Source, and is approximately determined by the following model:
where I DS and V GS are represented in normalized units, and all devices are assumed to have the same fixed shape and size. The actual device model is of course more complicated, but (1) is sufficient for introductory purposes. Bipolar transistors behave as translinear devices, as well as MOS devices when biased in their subthreshold operating region.
Because of this exponential relationship, a translinear circuit may be fully analyzed using a few closed Kirchoff voltage loops. In these loops, we only traverse the gate-source voltages of transistors. Half of the voltages will be "forward" (with the loop), and the other half will be "backward" (against the loop). We denote the set of devices with forward voltages drops by M f , and the set of devices with backward voltage drops as
Equation 3 is known as the Translinear Principle.
It is evident from (3), as elaborated in [2] , that translinear circuits can be used for multiplication of analog currents. On the other hand, addition of analog currents is accomplished by simply shorting wires. Therefore, by generating analog currents which are proportional to probabilities, the basic computational elements for sum and product operations are provided by translinear devices.
Reliability soft information maps directly into the space of signals used in translinear circuits. When currents are proportional to probabilities, it follows immediately that voltages represent logprobabilities, and differential voltages represent loglikelihood ratios. By slightly altering the perspective of analysis, the same circuit can be said to implement probability and log-domain processing simultaneously.
C. Analog Sum-Product circuits
It is a simple exercise to construct translinear circuits for Sum-Product decoding of LDPC codes. Two types of nodes are required: parity-check and equality. All nodes can be decomposed into a cascade of nodes with degree three. It turns out that only one circuit is needed to implement both of these nodes. A node's function is fully determined by a small connectivity block [3] , [2] .
An example circuit is illustrated in Figure 3 , which shows a circuit for one direction of probability propagation through a node with edges X, Y and Z. The inputs are X(0) = Iu · Pr (X = 0), X(1) = Iu · Pr (X = 1), and so on. Iu is a designated current which represents a probability of one. Outputs are produced for edge Z. There are numerous possible variations to Figure 3 for a variety of technologies and operating conditions.
The internal currents in Figure 3 , labeled I ij , represent the products of X and Y inputs, given by I ij = X(i) · Y (j). Additions are performed by simply connecting wires together. The final results are passed through a pair of P-type current mirrors, which re-orient the currents for input to another node.
A complete decoder is constructed by interconnecting many instances of Figure 3 . The code's Tanner graph (or factor graph) is mapped directly to a physical circuit: nodes become instances of Figure  3 , and edges become wires. While Figure 3 is the most common topology for analog decoding, there are several other approaches in the literature [12] , [13] .
D. CMOS analog decoders
These circuits use MOS transistors biased in their weak-inversion, or subthreshold, operating region. In this region, the device current is typically less than 100nA. This operating condition has several consequences:
• The transistor is never "turned on." Conventional digital designers often refer to weakinversion currents as "leakage." • The power consumed in the transistor is in the nano-Watt range or less.
• The transistor is slow: speed is directly proportional to operating current, which is very low. High throughput is obtained through parallelism.
Use of CMOS technology offers several advantages. Because of the simplicity of translinear processing circuits, a fully-parallel iterative decoder can be implemented having complexity similar to a harddecision digital decoder for the same code. CMOS transistors can be made much smaller than bipolar transistors, reducing the physical size (and therefore cost) of the circuit. CMOS fabrication processes already have lower cost than BiCMOS, so CMOS analog decoders can be made at very low cost. CMOS decoders also have greater potential for use in system-on-chip (SoC) designs. An SoC ASIC ideally includes all RF and baseband signal processing alongside other digital circuits for a variety of applications. These digital circuits are usually CMOS, so integration of a CMOS decoder is preferred. Interestingly, the slow speed (i.e. low bandwidth) of subthreshold MOS transistors is an asset for SoC designs, because the analog decoder inherently rejects high-frequency interference from neighboring digital and RF circuits. The analog decoder also produces no high-frequency interference of its own.
A CMOS analog decoder, with a block length of 256 coded bits or more, can be designed (for present technologies) to operate with a throughput of 500 Mbit per second or more. If a lower speed is desired, then the operating current Iu is simply reduced accordingly, with a corresponding reduction in power consumption. For extreme low-power applications, and for instances where V dd < 1V , a low-voltage topology similar to that of Figure 3 has been devised [14] . These low-voltage circuits can be integrated in SoC designs with micropower digital circuits, which are designed to require minimal supply voltage. Low-voltage analog decoders can potentially operate up to 1 Mbit per second, and offer extremely low-energy operation.
II. ANALOG DECODING ARCHITECTURES
Analog decoder implementations are steadily growing in size. The largest decoder to date has been implemented by the authors with a coded block-length of 256. While this decoder has not yet been tested, we may describe several architectural considerations which began to mature during its design. Analog decoders have unique needs for interfacing with other receiver components, and are affected in a unique way by a range of defects, mostly arising from mismatch. A thorough analysis reveals that all of these challenges can be solved.
A. Interfacing with a larger system
Information is most often communicated serially across a communication channel. Analog decoders most often decode all bits in parallel. Channel samples must therefore be stored as they arrive, so that they may all be presented together for decoding.
The clear solution, used by several designs, is to employ an array of Sample-and-Hold (S/H) circuits, consisting of switched capacitors to store incoming voltages [5] , [15] , [16] .
A S/H input array has been shown by the authors to be effective if the incoming channel information is expressed as a sequence of differential voltages, which correspond directly to log-likelihood ratios. In this case, parasitic effects such as clock feedthrough (or charge-injection) have no discernible impact on performance. Another parasitic phenomenon -substrate leakage -has also been shown to have no effect on the stored differential sample. Substrate leakage does have an effect if the data is stored for a very long time (i.e. if the block length is extremely large), but the authors have shown elsewhere that millions of samples can be stored before performance is adversely affected [10] .
In a sophisticated communication system, several components may precede the decoder in a receiver design. If the receiver consists only of a demodulator which outputs analog log-likelihood ratios, then an analog decoder can be directly used. If other processing must occur prior to decoding, then there are two options. First, we might perform all preprocessing stages using analog circuits. This is an attractive option in principle, in that it eliminates the need for a power hungry analog-to-digital converter in the receiver.
A second approach allows a more graceful integration between the analog and digital domains. We may simply insert a digital-to-analog converter at the front of the S/H array. If a powerful code is to be used, then the gains acheived through analog decoding outweigh the additional burden of a DAC. Some previous analog decoders have employed DACs at the input, using a separate minimally-designed DAC for each channel sample. This approach exhibited poor results [4] , [17] . A single well-designed, highquality DAC is a far superior solution, allowing analog decoders to be directly inserted in an existing receiver design.
A complete architecture for analog decoding is shown in Figure 4 . The inputs are first converted into analog differential voltages, which correspond to LLRs. These analog voltages are then loaded step-by-step into an array of S/H registers. When a complete block is received, all samples are loaded into a second stage of S/H registers, which store the analog information for decoding. After decoding is finished, the analog outputs are presented to an array of comparators, whose binary decisions are forwarded to a bank of shift-registers. The decoded results are output serially from the head of the shiftregister chain.
B. Evaluating decoder performance
As with any physical implementation, analog decoders do not represent the sum-product algorithm with perfect fidelity. Even the best analog circuits exhibit a variety of small distortions. It is difficult to say whether non-ideal behavior will significantly impact performance of the decoder. One might also speculate that non-ideal behavior could lead to unexpected error flares or floors. The authors have shown, using the method of importance sampling, that this is not the case [18] .
As an example of non-ideal behavior, CMOS analog decoders can be biased to operate above threshold, where the translinear principle no longer works. The advantage of doing this is a potential order of magnitude increase in throughput. Operating above threshold causes a large reduction in the circuit's accuracy. In spite of this, experiments have found that above-threshold decoders perform well, losing up to .5dB in E b /N 0 .
Full simulation of an analog decoding circuit, using SPICE, is computationally intensive. In fact, for block lengths greater than~256, it may be infeasible to simulate the decoder at all. The only recourse is to use simplified models in a highlevel description language. The behavior of a node such as Figure 3 can be modeled using lookup tables, for example. But even this technique proves too complex to measure low error rates in a large decoder.
For block codes and Turbo product codes, importance sampling provides a useful alternative [19] . Importance sampling works by biasing the channel noise toward significant error events. When a minimum-distance codeword x m is known, the mean of the channel noise is shifted to the midpoint between x m and the actual transmitted codeword x. The resulting mean-shifted noise density function is written f * (x), whereas the original zero-mean Gaussian noise density is written f (x).
The probability P m of the x m error-pattern is calculated as
where 1 x m (x) is an indicator function returning 1 if the error pattern is x m , and 0 otherwise. The resulting estimator is unbiased, and the number of samples required for a given precision depends only weakly on the SNR.
With small codes with a well-understood codeword geometry, this method can be used with SPICE simulations to provide extremely accurate predictions of decoder performance. As an example, importance-sampling simulations for an analog (8,4) Hamming decoder are shown in Figure  5 . To produce these results, a SPICE simulation file is automatically generated for each sample. Only a few hundred samples are needed for each data point. Figure 5 shows results for subthreshold and above-threshold operation, along with abovethreshold measurements from a physical decoder.
For larger codes, importance sampling still applies, but SPICE may be unusable. In this case, the system may be modeled using a high-level analog hardware description language. This was demonstrated in [20] , and the results are shown in Figure 6 for a (16, 11) 2 Turbo product code. Both full and punctured versions of the code are represented. Also shown is simulated data from an Advanced Hardware Architectures decoder for the full (16, 11) 2 Turbo Product Code. Importance sampling allows evaluation of analog decoder performance during the design stage, which would otherwise be impossible. For small codes, it allows a level of detail which would not otherwise be available. In addition to this, importance sampling allows simulating a decoder's performance at very low error rates.
C. Mismatch and large decoders
It is well known that the performance of all analog circuits, usually measured by precision or accuracy, is reduced by random variations in device characteristics, known as mismatch. Mismatch has not been found to be a serious problem for the analog decoders implemented to date. All of these decoders, however, are for simple codes with short block lengths. The question remains whether mismatch will "build up" as the decoder's size increases, resulting in increasing performance loss or even unexpected error floors. The counterpoint to this fear is that a decoder should be robust against "internal noise," just as it is robust against channel noise.
Both of these arguments are conjectures only. The authors have conducted a concrete analysis, based on Density Evolution [21] , [22] for LDPC decoders. Density Evolution makes the important assumption that the code's size is arbitrarily large in order to avoid intractable signal correlations. If mismatch has a destructive effect which scales with code size, then Density Evolution should find mismatch at its worst.
To implement the Density Evolution analysis, we employ the Gaussian approximation, in which we need only track the mean log-likelihood ratio after each iteration [21] . To compute this mean while accounting for mismatch variations, an adaptive Monte Carlo integration, based on the Vegas algorithm [23] , is performed over eight dimensions. Two of those dimensions represent the incoming messages, and the other six are mismatch parameters. By tracking the mean, we are able to determine whether the decoder converges to error-free behavior. Beyond a certain signal-to-noise ratio, called the threshold, decoding is error-free. Below that threshold, decoding is not error-free.
Using this analysis, we obtain the threshold in dB, τ (σ m ), as a function of the standard-deviation of mismatch, σ m . Using τ 0 = τ (0), we define the threshold loss as ∆τ (σ m ) = τ (σ m ) − τ 0 . The threshold loss is found to be nearly the same for a variety of regular LDPC ensembles, including all those evaluated in [21] . The threshold loss vs mismatch is shown in Figure 7 . The results demonstrate that mismatch can build up catastrophically for large decoders, but only when σ m is large. Mismatch better than 25% is easily obtained, even in sloppy designs. We can only conclude that mismatch poses no threat to the performance of a sufficiently powerful analog decoder. 
D. Comparators and offset errors
A complete analog decoding architecture must employ one or more comparators to convert the analog soft outputs into digital decisions. It is somewhat easier to design a good low-speed, lowenergy comparator than a high-speed one, so a large array of output comparators is prefered. All comparators exhibit an unwanted input offset voltage, which is mainly due to mismatch. There are several techniques for compensating offset in comparator designs, but for analog decoders we are interested in a minimalist approach. A very simple, low-energy comparator design may exhibit a large variance of offset voltages. We now address the effect this will have on performance.
As in the Density Evolution analysis, we employ the Gaussian approximation, in which we assume that the log-likelihood ratio at the decoder's output, X, is Gaussian distributed, and that its mean is proportional to its variance. We also assume that the offset voltage is Gaussian distributed with zero mean. Because a differential voltage is directly proportional to a log-likelihood ratio, we may speak in terms of the offset log-likelihood, X o . We further assume that the decoder can employ a built-in selftest in which any offset magnitude greater than a specified limit L is detected, and such chips are discarded as failures. Usually L should be chosen to be a fraction of the decoder's maximum possible output.
Based on these assumptions, we may say that the zero-offset error probability, P e , is given by
where Q () is the well-known Gaussian error integral function. When the offset is accounted for, we find that the error probability with offset, P e (X o ), is given by
where G () is the Normal density function. Figure 8 displays the ratio
as a function of offset standard deviation (expressed in the LLR domain). Results are presented for several zerooffset BERs. If the offset deviation is less than two, then the BER is increased by a constant factor independent of SNR. It therefore seems that a large range of offsets are acceptable.
In practice, the offset deviation depends on a variety of factors including the exact circuit design, the fabrication technology, and the signal scaling between LLRs and differential voltages. These factors can be evaluated for any particular design. In general, as Figure 8 demonstrates, the limiting concern in comparator design is not performance but yield. Suppose we want a failure rate of one chip per thousand, and each chip has one thousand comparators. Because we discard any chip in which a comparator's offset magnitude exceeds L, we must achieve a comparator failure rate better than one in a million. The decoder determines L, and the comparator design must be adjusted accordingly to meet yield requirements.
III. OUTLOOK
We seem to be standing on the verge of a revolution in digital receiver processing, with iterative algorithms poised to replace other, conventional detection and decoding algorithms. Iterative receiver processing of signals from intersymbol interference channels, as well as from code-division multiple access (CDMA) and multiple-input multiple output (MIMO) channels have already been studied in a series of papers, and their strong potential to outperform conventional receiver structures has been well documented.
In [24] it is shown that simple cancellation type receivers, combined with judicious choices of softoutput error control codes and rate/power allocations, can achieve the information theoretic capacity of random CDMA channels. Extensions of analog processing techniques to such cancellation receivers is immediate. The signals to be subtracted are modulated by the soft-bits of the different users. These soft-bits are generated asb = Pr(b = 1) − Pr(b = 0), which can easily be accomplished with analog circuitry.
