Abstract-Circuits with feedback paths are significantly slower than comparable circuits without the feedback. The feedback also implies data dependency which voids usual parallel implementations, further exacerbating the throughput problem. This paper discusses a new high-throughput solution for systems wilh finite-level feedback values. As an example, we consider coding and signal processing systems for optical communications, which usually have very simple feedback. Our melhod uses architectural techniques, and requires no detail circuit tuning for high speedup. We demonstrate the method by realizing a 2 micron CMOS layout of a himode 3B4B line coder. Simulation estimates that, using standard cell design, the chip achieves a coding rate of 1.4 Gb/s. Other design options are discussed.
[I51 for classification of prior line codes.) Simple coding techniques like these are helpful because they can compensate nonideal characteristics of optical communications systems. Other coding applications include binary signaling o f ternary or quaternary alphabets, data access control, and general user-defined applications.
In view of throughput characteristics, general coding and signal processing systems can be categorized into either feedforward or feedback systems. In a feedforward system, the input propagates through the circuits to the output along a unidirectional path. In contrast. a feedback system contains closed loops within the circuits, i.e., a feedback path exists between the output and thc input.
A feedback path places a more stringent throughput limit than a feedforward unidirectional path. For example, consider a unidirectional path from the input to the output. Depending upon the circuits' state, the propagation delay may vary from a maximum 7max to minimum T,,,~". According to Fig. 1 (a) , the circuit can operate cor- rectly if the inputs are spaced at least A T = 7,,,,, -7,in apart 1161. Therefore, an upper bound on a feedforward system's throughput is l / A 7, which we call latchless pipeline bound. In a feedback loop, the propagation delay still may vary from T , , , to 7,in. The timing should always allocate 7,ax for aligning the feedback with the next input, as shown in Fig. 1 (b The throughput discrepancy between feedforward and feedback systems becomes even larger in terms of available solutions. In feedforward systems, higher throughput it possible through parallelism. For example, since conventional block codes require no feedback, a coding rate higher than the 1 / A T bound is possible with interleaving, as shown in Fig. 2 (a) , or with serial-parallel-serial conversions, as shown in Fig. 2 (b) . The throughput rate of the parallel systems in Fig.   2 is ideally three times the individual rate of the block coders. However, the two well-known architectural techniques in Fig. 2 do not apply to feedback systems in general. As we will discuss in Section 11, these parallel techniques lengthen the critical path and propagation delay of feedback systems, and thus counteract the advantages of parallelism. A first-cut solution to improving the 1 /7,,,, bound is to start with the architectural techniques in Fig. 2 . Unfortunately. replacing the block coders directly with FSM coders in Fig. 2 (a) does not generate correct coding functionality. To correct this, we must route the previous FSM's output to the next FSM's input, as illustrated in Fig. 3 (a) . The input again should synchronize with the previous output. For example, as the input stream switches from the top FSM coder to the second FSM coder in Fig. 3 (a) , the second FSM coder needs to wait for the top FSM's output to code the input. This implies the input rate is still 1 /rman, i.e., no throughput improvement at all. Similarly, modification of serial-parallel-serial conversions as in Fig. 3 (b) does not improve the throughput.
To overcome this functionality-throughput impasse, we can cxpand thc FSM to one that processes multiple inputs simultaneously. ' Table I The increase in complexity inevitably lengthens the r,,,,, and therefore counteracts the throughput improvement. In practice, two types of implementations with reasonable 7n,dh exist. The first typc is to construct circuits directly from the new state transition diagram. The additional input wircs and logics induce higher complexity and heavier circuit loading. As a result, 7,,, increases. but not proportional to the expansion factor. The second is to tabulate for all combinations of K inputs in a PLA or memo~. Similarly. 7 , , < increases due to additional loading. Empirical results show that for an expansion factor K , the increase in 7,,, usually tracks between O(log ( K ) ) and O ( K ) asymptotically. Circuit sizing can help reduce increases in rmax'.
The exponentially growing cornplcxity and the increasing 'T,,, are two limiting factors to the expansion method.
As a well-known architectural technique, the method has been successful for small throughput improvemcnts in relatively simple FSM's. But in situations where the required speedup K is high or the alphabet size is large, we nccd a method with a complexity proportional to the speedup, rather than exponential, and with a rmax insensitive to the increase in complexity. We will present such a method in thc next section.
THE POST-SFLECTION METHOD

A . The Algorithm
Consider the operation of an FSM. As time evolves. the state transition sequence follows a path in a full tree.
as illustrated in Fig. 6 (a) . For example, if the first three inputs to the FSM coder in Fig. 4 arc (0, I , O) , the state transitions correspond to the darkened path in Fig. 6 . Since the tree soon gets intractable after a few inputs, we prefer describing our method with trellises like Fig. 6 (b) . A trcllis Fig. 6 (b) results from merging identical nodes at the same tree level within Fig. 6 (a) . Alternatively, we can generate the trellis by expanding the state transition diagram in Fig. 4 against the time index. The state transition sequence again maps to a path in the trellis, whilc each transition corresponds to extending the path by one arc from the current node to a nude at the ncxt stage.
The throughput limitation comes from data dependency between path extensions. That is, in order to extend the path, we need to know the current node position in the trellis, which in turn depends on the previous path extension. If data dependency in path cxtcnsions can he relaxed from between adjacent trellis stages to between K stages, different parts of trellis can be traversed in parallel. This way we can achieve a throughput higher than the I /rmaX bound.
More specifically, consider breaking trellis into blocks of length K. as shown in Fig. 7 . Data dependency between processing different blocks can be relaxed by computing for all possible initial node positions for each block. Because path traversing always starts from every 'Without sizing, r,,,dx increases rapldly after Kgrows beyond a threshold. which I\ wlirrl the expatitled PLA uvrrluads the precharge circults 
SPLll NODES SPLnNODE9
possible initial node position, traversing results from the previous block are not necessary. The parallelism is twofold-among different initial node positions and different blocks. Notice that the number of parallel blocks is unlimited in principle, which translates to a large speedup. The algorithm starts with processing blocks of inputs in parallel, and generates for each block N path segments, one for each initial node, as illustrated in Fig. E (a) . To determine the correct path, we need to link the path segments sequentially through the blocks. The linking only concerns the initial node and the ending node of each path segment, not the detail information within. Therefore, as shown in Fig. 8 (b) , each path segment can be regarded as an arc between its initial node and ending node during linking. This is exactly the same formulation as the original path traversing, but now extending the path by one arc corresponds to processing one block of inputs. Because the processing in Fig. 8 (b) is the only part that contains data dependency, the throughput bound has been relaxed from 1 /rmaX to K / 7 , , , , where K is the block length.
. . .
(b)
processing is to find path segments for each block In parallcl (h) I'he XC- We can verify that the above algorithm indeed achieves a K times throughput improvement. The throughput of block parallel processing is bounded by 1 / A T , and can be further improved as in Fig. 2 if l / A r < K/rmax. The throughput of linking is again bounded by l/rmax. Because the formulation is Fig. 8 (b) is the same as the original FSM, the 7,,, is also the same as the original.
As each block contains results for K inputs, the throughput rate is K times better than the original.
The complexity of the algorithm depends on the implementation, which in turn depends on its architectures and level of hardware sharing. In terms of amount of computations, the algorithm requires (N + 1 / K j FSM operations per input for an N-state FSM with block length K .
Interestingly, the overhead ratio (N + 1 / K ) is insensitive to the speedup K, i.e., near O(I). In some caseb, the ovcrhead can be reduced to (log
In the following, we show an architecture that fully exploits the efficiency of the algorithm.
B. The Architecture
As parallelism exists among processing of different blocks and different block initial nodes (states), the architecture can take many forms. Here we present a parallel pipelined architecture suitable for coding systems in optical communications.
the post-selection architecture in Fig. 9 implements the algorithm in Fig. 8 . Each processing unit urruy (PU array) in Fig. 9 is a linear array that processes a block of K inputs for a specific block initial state. Each PU array generates K outputs and a block ending state, which are sent to a selector. The selector uses the previous block ending state to determine which PU array's output should be selected. That is, if the previous block ending state is So, then the selector chooses the output from the PU array with initial state So. In terms of our algorithm, the selector implements the linking process, whereas the PU arrays implement parallel block processing. While a PU array computes only for a specific block initial state, different blocks may share the PU array through pipelining.
The PU array is a multilevel logic circuit that trans- forms K inputs into the corresponding K outputs and block ending state for the given initial state. Although the PU array can be synthesized directly from its 110 description, the complexity is usually too high due to combinatorial inputs. A better implementation is to cascade multiple FSM's together with input and output skew, as shown in it takes the input of the corresponding stage and the state output from the FSM at the previous stage, and computes the new state value and the output. Notice that the longest fccdforward path in the cascaded FSM is the state data path. However, because the cascade can bc fully pipelined, the PU array will not become the throughput bottleneck. For example, each FSM can be decomposed into two-level AND-OR logic gates. Pipelining at the gate level is possible by inserting pipeline latches between the AND and OR gates. Although this is hardly necessary, it illustrates the real throughput limit is in the selector in Fig. 9 .
The throughput of the post-selection architecture is K/r,,,, where r,,, is the (worst-case) propagation delay of the selector. 'The complexity is KN FSM's, plus pipeline latches and a sclcctor. Therefore. we conclude the architecture indeed has a complexity proportional to the speedup. as opposed to the exponential complexity of the expansion method.
1V. DESIGN EXAMPLE
We use an example to demonstrate the method and its design options. Consider the 3B2T-RBS code, which uses the optimum 3B2T line code in Table I1 with relarive biphase signaling (RBSj [ l l . (The 3B2T stands for coding three binary inputs into two ternary outputs.) Relative biphase signaling (RBS) transmits a ternary signal with two binary signals in a state-dependent way. While [ I ] suggests implementing RBS for the 3B2T code with a four-state FSM, a simpler two-state FSM in ' Table 111 supports the same RBS rules. Therefore, the 3B2T-RBS code is equivalent to the bimode 3B4B line code in Table IV When implemented with a two-phase standard cell PLA, our simulation at the switch RC level estimates that the bimode 3B4B coder in 2 micron CMOS (lambda = 1 pm) runs at about a 70 MHz clock, or 210 Mb/s. (T,,, = 13 ns for an N = 2, A = 3 FSM in our previous notation.) A speedup of seven ( K = 7 ) can improve the coding rate to a 1.4 Gb/s optical rate.
For comparison, we check the feasibility of the expansion method. When expanded by 2 and 3, the FSM becomes a bimode 6B8B coder and a bimode 9B 12B coder, respectively. The bimode 6B8B coder and the bimode 9B12B coder are 3.1 and 12.1 times larger' than the bimode 3B4B coder, in fair agreement with the theoretical value A = 3 and A * = 9. Simulation shows that the clock rate drops to 50 MHz (T,,, = 19 ns) for the bimode 6B8B coder and to 14 MHz (7,,, = 68 ns) for the bimode 9B12B coder. The clock rate decreases because larger PLA's require longer time to precharge and evaluate, as we can ' We compare the core size of the coders.
~-~ ~~
see from the PHI1 (precharge) and PHI2 (evaluate) waveforms in Fig. 1 I (a)-(c) . Thus, the bimode 6B8B coder achievcs a 6 * SOM = 300 Mb/s rate, which means a speedup of 1.4 for expansion factor 2. The bimode 9B12B codcr runs at 9 * 14M = 126 Mb/s, which means no speedup at all for expansion factor 3 and higher. Thus, the limitation to the expansion method is more than just the complcxity overhead. Unless we have fast precharge and evaluation circuits to keep up with the size increase of the PLA, a speedup higher than 1.4 cannot be achieved with direct FSM synthesis in this technology.
In contrast. our algorithm and post-selection architecturc can achieve the seven times speedup easily without any FSM redesign. As the bimode 3B4B coder is a twostate FSM, the post-selection architecture needs two PU arrays, each of which has seven cascaded PLA's. The two PU arrays share the same input pipeline skew', as shown in the layout in Fig. 12 . Our design uses Magic and Oct CAD tools with LagcrIV design manager [20] , and requires no physical chip editing like sizing transistors or defining feedthrough cells. Simulation estimates that the PLA clock remains at two-phase 70 MHL (13 ns). The input and output pipeline latches are also clocked at 70 MHz. The reason why we can maintain the same clock rate is that the PLA's in cascade have the same loading effects as the original PLA. The selector on the left runs faster than the pipeline clock (77e, < I O ns < 13 ns).
Thus, the speedup is exactly seven times for a throughput of 7 * 3 * 70M = 1.4 G b / s . (In fact, because we obtain the speedup through architectural techniques, the chip should always be seven times faster than the actual circuit speed of a bimode 3B4B coder.) The overall size is only 2.5 X 2.5 mm, which explains why we did not use hand design to minimize the chip area.
We expect an area reduction of at least 30% with partial custom design.
Alternative implementations with different clock rates are possible. To run at a faster clock rate, each FSM coder in the PU array can use two-level AND-OR gates with pipelining transmission gates between them. Clock rates at about 100 MHz are possible, but care must be taken in adjusting clock skew and compacting pipelined AND-OR gate cell. Although this design is more demanding, the improvement in clock speed means lcss cascaded stages in the PU arrays, less routing area, smaller ~nultiplexer bit width, and less IiO pads.
Implementations at clock rates slower than the original coder's clock rate also exist. This approach is a combination of the post-selection method and the expansion method. For example, if we first expand the bimode 3B4B coder by two, the result is a bimode 6B8B PLA coder at 50 MHz. Each PU array in the post-selection architecture can cascade five bimode 6B8B coders in a pipeline for a throughput rate higher than 1.4 Gb/s. Although the number of cascaded coders is fewer, each coder is larger than the original. The low clock rate design is helpful if the original FSM coder is relatively simple. In this case, the clock rate slowdown and thc PLA expansion is not seri- ness. However, if the clock rate is too slow, probably a separate clock and interface is necessary for chip 110.
Summarizing, high speedup implementations exist with clock rates higher, lower, and the same as the original. The choice depends on applications. Generally speaking, if the original FSM is complicated it is advantageous to implement at higher clock rates. The dual argument is also true: a low-clock-rate implementation is suitable for simple FSM's. Otherwise, the post-selection architecture should cascade the original FSM's directly.
Previous discussion also applies to composite feedback systems. If a system contains multiple steps of FSM coding or multiple feedback loops, the design procedure is to apply our method to the throughput-limiting loop recursively. Alternatively, different loops can be combined as a big FSM or cascaded multilevel logics before the postselection method is applied. This often leads to a better utilization of chip area and IiO.
V . A SOLUTION FOR COMPLEX F H I -. D H A~K SYSTEMS
Most feedback coding systems in optical communications have relatively few feedback states. However, in general, the number of slates can be large, which renders the post-selection method ineflective. This suggests the following alternative algorithm.
Consider again the trellis blocks in Fig.   7 . Instead of processing multiple path segments first and finding the right segment later, we change our strategy to finding the correct block initial state first before processing the block. That is, similar to the idea in 2) Compute recursively the block initial states using 3) Process each block from its initial state.
the result in step 1 .
Note that both steps 1 and 3 are feedforward computations, meaning that higher throughput is (always) possible.
Step 2 requires a feedback of the previous block initial state, and thus is the throughput-limiting step. The architecture in Fig. 13 embodies the algorithm. Algorithm steps l , 2 , and 3 map to the precomputation, the modified FSM, and the PU array in Fig. 13 , respectively.
In step 1, dependency between the current and the next block initial state is a function of the K inputs in the current block. As each block initial state has at most N possible next block initial states, the dependency can be described with an N-level indicator (with log (N) bits). The precomputation circuits can be synthesized directly from its logic function, such as using multilevel AND-OR gates [ 191.
Step 2 itself is a new FSM whosc input is the Nlevel indicator from the precomputation and whose state is the block initial statc of the original FSM. Thus, the modified FSM in Fig. 13 has the same N states as the original FSM. but the input alphabet size is extended from A to N . Using the block initial state generated by the modified FSM, the PU array completes the block proccssing in step 3 using a cascade structure as before.
Because the precomputation and the PU array can be fully pipelined, the throughput of the architecture is K/r,,,,,*, where T , , ,~ is the propagation delay of the modified FSM. Since the modified FSM is more complex than the original, the throughput improvement is less than K , or K ~~,~~/ r~, ,~~~ to be exact, wherc T~~~ is the propagation delay of the original FSM.
The complexity of the precomputation and the modified FSM depends on applications. A more sophisticated complexity analysis and architecture details can be found in [19], wherein the resemblance and the mathematical relation between the FSM architecture in Fig. 13 and the well-known block-state filter structure is also discussed.
VI. CONCLUSION We have shown architccturdl techniques for umall-state feedback circuits that significantly improve the throughput without requiring circuit design efforts or advanced technologies. The method is flcxible in terms o f achievable implementations and speedups.
For higher speedup and more complex feedback systems, our methods outperform the conventional expansion method in terms of speed and die area.
