Abstract-Several programmable analog interleaver architectures for iterative decoders are proposed. The architectures are evaluated in terms of transistor count, path resistance, path capacitance, and programming logic. Interleavers built out of networks consisting of three layers of small crossbars are often deemed the best, reducing both switch count and capacitance by over 70% for an interleaver size of 100, as opposed to full crossbars, while maintaining full programmability.
I. INTRODUCTION

S
EVERAL families of very powerful error-correcting codes, whose performances on additive white Gaussian channels approach the Shannon limit, have recently emerged. Among these are Turbo codes [1] , serially concatenated convolutional codes [2] , and low-density parity-check codes [3] . Decoding algorithms for such codes follow an iterative procedure to successively improve estimates of decoded symbols. Unfortunately, such iterative procedures lead to long decoding latencies and to high power consumption-undesirable for wireless devices. Hence analog techniques are now being considered as an option for decoding [4] - [6] . To date, most analog decoders take advantage of the low silicon requirements of the associated circuitry in order to parallelize the decoding process. In other words, the trellises or factor graphs [7] on which soft-output algorithms operate are laid out in silicon. This has proven to be quite effective and has led to very-fast and low-power analog maximum a posteriori (MAP) decoders [8] . Fig. 1 shows a typical turbo decoder block diagram. Turbo decoders and other such iterative decoders require interleavers-preferably programmable-which permute the decoded symbols from one soft-output decoder for use by another soft-output decoder. In Fig. 1 the interleavers are denoted (interleaver) and (deinterleaver). In an analog decoder, the interleaver performs a spatial permutation of the soft information from one soft-output decoder, Manuscript to be used by the other decoder. The permutation is denoted , and the permuted bits are determined such that . One of the challenges of building an analog-iterative decoder involves the design of these analog interleavers capable of performing any desired permutation of its symbols. The concepts involved in the design of analog interleavers are similar in many ways to those encountered in switching, for example in broadband photonic switching [9] and in multiprocessor memory architectures [10] .
In this paper, several architectures for programmable analog interleavers are discussed and compared. The paper is organized as follows. Section II describes in general terms some implementation issues specific to analog interleavers. Section III introduces crossbars as a method of generating programmable analog interleavers. Section IV proposes several programmable analog interleaver architectures, based on networks of crossbars. Section V evaluates and compares the architectures based on several metrics. Section VI provides a delay analysis of the interleaver designs. Section VII concludes this paper.
II. IMPLEMENTATION ISSUES
A. Temporal Versus Spatial Interleaving
In a digital decoder design, an interleaver is normally implemented using SRAM memories. As the first soft-output decoder in a Turbo decoder produces its extrinsic information in serial order, it writes the data in the same order into an SRAM memory. Then the second soft-output decoder reads the information in a permuted order, the order either being stored in a separate memory block or being generated on-the-fly using algorithmic techniques. Using the former method, different permutations can be realized by storing a different reading order in the separate memory block. A similar procedure is used for the deinterleaving process, using the reverse order of input and output pointers to the memory.
We could extend this notion of interleaving in time to the analog domain by designing an analog memory cell and to read and write from that cell in a way analogous to the digital memory cell. The analog memory cell would likely store a voltage on a capacitor. However, this would discretize the decoding process in time and would, therefore, eliminate many of the benefits of continuous-time analog decoding.
Thus, instead of designing an interleaver which performs permutations in time, we will concentrate on interleavers which perform permutations in space, where symbols at one location on a chip are permuted to another location using switches.
B. Fixed Versus Programmable Interleaver Design
The current state of the art analog interleaver is a fixed design implemented using wires. The design permutes a subset of the code's information bits in a regular, predetermined fashion [11] . If all information bits in a code are permuted this way, then wires per interleaver are required ( wires are required if differential signaling is used).
In order to function with various standards, an interleaver should, however, be programmable to perform several, if not every single out of possible permutations of its inputs. Hence, in this paper, we advance the state of the art by concentrating only on programmable interleaver designs, and specifically those designs on which all combinations of permutations can be programmed.
C. Voltage Mode Versus Current Mode
The analysis of current mode circuitry is quite different from the analysis of voltage mode circuitry. In a current mode circuit, currents which are used by more than one subsequent circuit must be copied, whereas in a voltage mode circuit the voltage can be used as input to several subsequent circuits. In our analysis, we will assume that currents are copied before entry into a purely passive permutation network. Furthermore, we assume that each input to the permutation network is connected to exactly one output, thus making current copying unnecessary within the permutation network.
For this purpose we assume that switches are composed of a pass transistor, which passes either voltages or currents. Since pass transistors are bidirectional in nature, the same programming memories can be used to program an interleaver and its corresponding deinterleaver. More sophisticated switches like transmission gates could of course be used instead of pass transistors with little effect on the analysis herein.
With this type of permutation network, we can implement turbo decoders since they satisfy the one input to one output criterion, as well as other codes such as low-density parity-check codes which do not satisfy the criterion but where inputs being copied to multiple outputs can be copied before entry into the network.
III. INTERLEAVER DESIGN USING CROSSBARS
A. Crossbar Design
For the purpose of designing programmable spatial interleavers, we will use networks of complete crossbars with inputs and outputs, termed of size . In a crossbar, a separate switch connects each input to each output. The negative metal-oxide-semiconductor (NMOS) pass transistor is chosen as a switch because of its easy programmability, its bidirectional nature, as well as its ability to deal with both voltage-mode and current-more signals. The state of a switch is stored in a flip-flop, located close to the switch, and is programmed at startup in a serial fashion.
A crossbar of inputs and outputs can be built using switches per channel; for example, a turbo decoder with an interleaver-deinterleaver pair, which uses differential signaling would have four channels. In such a crossbar, each signal passes through exactly one transistor to reach any output from any input. Furthermore, each signal is loaded by the capacitance of transistor sources and drains, and, thus, the total parasitic capacitance is , where corresponds to one source or one drain capacitance. Such a crossbar, having four inputs and outputs, is depicted in Fig. 2 .
B. Programmability
Two methods of programming the crossbar switches are available. The first one uses one flip-flop per switch. The other method uses fewer flip-flops, but uses decoders to program the switches. Using the first method, a flip-flop is placed near every switch. The output of the flip-flop programs the switch. The flip-flops are programmed serially at power-up, and can be set up as a scan chain for ease of testability. Using this method, a total of flip-flops and no extra decoding logic are required. This method is the one depicted in the crossbar in Fig. 2 .
Since each input in a crossbar is connected to exactly one output (as specified in Section II-C), we can use a one-of-decoder for each input to program which switch connected to that input is active. A crossbar requires one such decoder for each of the inputs, each decoder having itself inputs. Thus, flip-flops are used. An upper bound on the number of transistors in the decoding logic is , assuming a static CMOS-based decoder, an example of which is depicted in Fig. 3 . It is an upper bound since only signals need to be produced by each decoder, and not necessarily all combinations. The static CMOS decoder is used since it has a strong pull up that maximizes the voltage range of the pass transistor which it controls. Fig. 4 depicts a flip-flop/decoder arrangement for the crossbar. Note that this method might be impractical in terms of layout for large crossbars since wires must run in parallel from each decoder to each row of switches; the multiple crossbar networks of Section IV-B-IV-E should not run into problems, though.
IV. INTERLEAVER ARCHITECTURES
In this section, we explore several interleaver architectures composed of one or several crossbars. Designs will be evaluated and compared in terms of five metrics.
The first metric is the total number of switches per channel used. A smaller number of switches usually implies a lower silicon area.
The second metric is the number of switches a signal must go through from any input to any output. A greater number of switches incurs a larger amount of circuit nonidealities such as parasitic resistance or noise.
The third metric is the total parasitic capacitance along any path through the interleaver. This metric is quantified in terms of the number of transistor sources and drains on any path, where the capacitance of one source or one drain is denoted . For the purposes of this metric, switches are assumed to be NMOS pass transistors, though the analysis is easy to extended to other types of switches. Wiring capacitance is ignored, though it should be emphasized that it could increase total path capacitance for large designs.
The fourth metric is the total number of flip-flops used to program the interleaver. A flip-flop/decoder arrangement, as discussed in Section III-B is assumed. Otherwise, the same number of flip-flops as switches per channel are used. In addition to this fourth metric is a fifth, related metric which counts the total number of transistors in any logic used to decode the programming bits. The decoder logic is assumed to be built in static CMOS, as depicted in Fig. 3 .
Most of the designs described below contain restrictions on the number of inputs and outputs. In cases where it would be advantageous (in terms of transistor count, path length, or capacitance), a frame can be buffered with zeros, using the next larger available interleaver size; some known dc voltage or current can be applied to unused inputs. Five interleaver architectures are described in the next subsections.
A. Design Using One Crossbar
An obvious interleaver design uses one single crossbar where . Of course, it uses switches. Each signal passes through exactly one transistor, and is loaded by source or drain capacitances. The flip-flop count, assuming decoders, is . The number of transistors in the decoding logic is bounded by . In the next subsections, we see how to drastically reduce the transistor count and capacitance by allowing an increase in the number of pass transistors along any path.
B. Design Using Networks of Butterfly Switches
The second programmable interleaver is based on sorting networks. A sorting network is a combination of parallel processors which can sort a sequence of length . If a sorting network is capable of sorting any sequence of length , it is obvious that it is also capable of producing any permutation of its inputs.
One such sorting network, realized using parallel butterfly switches is described in [12] ; note that a butterfly switch is simply a crossbar where . It is constructed using several levels of butterfly switches, with the wires between subsequent levels performing a known, fixed permutation. Each butterfly switch is programmed using one configuration bit stored in a shift register. All permutations of the bit positions are possible by using levels, each containing butterfly switches. An interleaver designed using butterfly switches uses switches per channel. Each path from input to output crosses pass transistors. The total source/drain capacitance along any path is . Fig. 5 shows a transistor-level implementation of a butterfly switch and how it is programmed. Fig. 6 shows a complete interleaver of size . Though perhaps excessive in terms of path length, this interleaver has some interesting very large-scale integration properties, namely the reuse of a simple building block-the butterfly switch. Also, since butterfly switches only have two states, they can be programmed using one flip-flop each. No extra logic is required if we assume flip-flops which produce both and outputs. The total number of flip-flops is given by .
C. Design Using Three Levels of Small Crossbars
The third interleaver architecture is designed using a network of intermediate sized crossbars, and is similar to another sorting network first published during the 1960s and used in telephone switches [13] . If is a square number, it is always possible to build a network, consisting of a total of crossbars of size , which can implement all permutations of . The crossbars are organized into three layers of size crossbars. Between successive layers, each crossbar contains exactly one connection to each crossbar on the next level. That is, the th crossbars th output (where and ) on the first level is connected to the th crossbars th input on the second level, with the same interconnection pattern between the second and third levels. Such a network for is shown in Fig. 7 . The switch count per channel for this style of interleaver is . Each signal passes through exactly three transistors. The total source/drain capacitance is . The number of flip-flops is given by and the transistor count in the decoding logic is bounded by .
D. Hierarchical Design
If is the square of a square number (that is, , an integer), it is possible to recursively replace each -sized crossbar from Section IV-C with its own network of -sized crossbars. Such a network contains switches per channel, distributed over crossbars, has a path length of nine, and has a source-drain capacitance of along any path. Such a simplification may only be useful for large sizes of . The flip-flop count is given by and the transistor count for the decoding logic is bounded by .
E. Improvement on the Design Using Three Levels of Small Crossbars
In Section IV-C, a 3-level interleaver architecture composed of crossbars of size was introduced. Each level contained crossbars, with the restriction that must be a square number. The concept can also generalized to the case where is a composite number [14] . Consider a network containing crossbars of size on the first level, crossbars of size on the second level, and crossbars of size on the third level, where . Between successive levels, each crossbar transmits one of its outputs to each of the crossbars on the next level. That is, the th crossbars th output (where and ) on the first level is connected to the th crossbars th input on the second level; the th crossbars th output (where and ) on the second level is connected to the th crossbars th input on the third level. We call these networks. Fig. 8 depicts such a network for , with and . This type of architecture uses switches per channel. The path length is always 3 transistors, and the path capacitance is . The flip-flop count is given by and the number of decoding transistors is bounded by .
V. COMPARISON OF INTERLEAVING ARCHITECTURES: A QUANTITATIVE APPROACH
We now compare the five interleaver architectures presented above, by counting total number of switches, path length, path capacitance, flip-flop count, and decoder logic complexity. We compare the architectures for example interleaver sizes of , and . Results are given in Tables I-V . Zero padding is used where is not an allowed size in the given architecture. The networks are frequently the best in terms of switch count and capacitance. For example, with , a network uses about 70% fewer switches and has about 70% less path capacitance. Fig. 9 plots the switch count for all architectures described in Section IV versus interleaver size for all interleaver sizes between 2 and 256. Zero buffering is assumed where the interleaver size is not allowed in the given architecture, or if it would lead to a smaller design in terms of switch count. Fig. 10 plots the capacitance per path, in terms of unit capacitance. Again, the networks are clearly winners.
VI. ELMORE DELAY INTERPRETATION
We have now seen that by tolerating an increase in the number of switches a signal must cross on its path through the interleaver, we can significantly decrease interleaver size and path capacitance. By using networks, path capacitance is reduced by 70% for an interleaver size of , as opposed to a full crossbar. However, the advantage of decreasing the capacitance by two-thirds while tripling the resistance is not immediately obvious. The advantage can be explained in terms of Elmore delay [15] . If the capacitance and resistance in a resistance-capacitance (R-C) network are not lumped into one single resistor and one single capacitor, but distributed over a few distinct nodes, then the delay is not as large as might be expected by calculating a simple R-C delay, and can be calculated more accurately by the Elmore delay.
Let us first examine the R-C network corresponding to a single crossbar, where the total path resistance is and the total capacitance due to sources and drains is . The capacitance is distributed so that are located on either side of the resistor, as illustrated in Fig. 11(a) . Also, the network is driven by a voltage source with finite impedance .
The Elmore delay of this network is . We now shift our attention to the R-C network corresponding to a network, as illustrated in Fig. 11(b) . Each resistor is equal to . Here the capacitance is distributed onto four nodes. The first and last node each get , whereas the two internal nodes each get . This is a direct result of the structure of a network, where twice as many sources and drains are located on the two inner wiring networks as on the inputs and outputs. In this case the Elmore delay is . In order to compare to , the following function is plotted in Fig. 12 :
In the function, the ratio of network capacitance to crossbar capacitance is and the ratio of source to switch resistance is . The ratio of Elmore delays is , where implies that the network is faster. Fig. 12 plots for and . Values of are obtained from path capacitance, as calculated earlier, and values of are dependent on specific switches and circuits driving the interleaver, though we would expect . Fig. 12 . Plot of , the ratio of Elmore delay for a P; Q network to the Elmore delay for a full crossbar.
The shaded region in Fig. 12 indicates where it is advantageous to use networks, since the delay through the interleaver is less than that through a crossbar. It is important to note that the region is greater than that contained by .
VII. CONCLUSION
This paper has explored five novel architectures for programmable analog interleavers.
The first is based on a single crossbar. The other four are based on networks of crossbars. The second consists entirely of butterfly switches; it only requires very simple programming logic, but each signal has to pass through a large number of transistors. The third method, valid for interleavers with a square number of inputs, requires signals to pass through three levels of crossbars, each crossbar having a number of inputs equal to the square root of the total number of inputs for the interleaver. A fourth method proposes to recursively decompose the smaller crossbars from the third method into even smaller crossbars; however, signals have to pass through a greater number of transistors.
The final method, and the one which seems the most promising in terms of several metrics including switch count and path capacitance, involves the decomposition of the interleaver size into two factors and , and to build a three-level network of crossbars the size of the factors. Significant decreases in total transistor count and in capacitance are achieved.
