ABSTRACT: Low-power error-correction is required for 3rd generation digital wireless devices. Adaptive-reduced state sequence detection (A-RSSD) modifies a Viterbi decoder to use far less computational effort than is typical. RSSD neglects the oldest p bits of the encoder's state machine, treating the code as if it were of length K' = K-p. Through successive reduction of p, decoding can proceed with more effort until a frame is correctly decoded. This paper describes the only known VLSI implementation of A-RSSD. The presented architecture is an adaptive strength, state-parallel, bit-serial structure. It features soft-decision, continuous stream traceback decoding, with K ranging from 3 to 11. As such it employs between 4 and 1024 ACS units. The Branch Metric Computer and ACS units are mostly convcntional, while special consideration must be given to branch label generation, sub-state estimation, and ACS interconnection structure. Other low-power techniques are also applied, specifically with respect to clock gating, and traceback RAM structure. Design tradeoffs are discussed, and performance estimates are presented.
I. INTRODUCTION
An introduction to convolutional coding and the conventional Viterbi algorithm are offered in section 11. Reduced state sequence detection and its adaptive counterpart are introduced in sections I11 and IV, followed by a short look at Viterbi implementation structures in section V. Specific VLSI considerations for RSSD and A-RSSD comprise sections VI and VI1, concluded by initial synthesis and simulation results.
It. CONVOLUTIONAL CODING
Forward error correcting codes improve communication system performance in the presence of interference and/or noise. Through coding, systems can operate in harsh environments and signal re-transmission is often prevented. Convolutional codes intentionally introduce controlled interference based on previously transmitted data. This initial hashing of information provides greater resilience to noise, spreading the information over multiple bit periods. The challenge of the decoder is to determine the original data from the noisy sequence of channel symbols. Cellular mobile channels are one of the particularly important applications currently using convolutional codes.
A convolutional encoder, figure I , uses shift registers and XOR operations to generate the channel symbols. The rate (R) of a code is the ratio of channel bits (n) to data bits (k). In cellular systems, typical rates are 1/2, 1/3, or 1/4. For spectral efficiency there is interest in higher rate codes, 213 for example, however the performance of such codes does not normally warrant the increased receiver complexity. The constraint length (K) refers to the number of input symbols that have an effect on the transmitted channel symbol. For example, a code with a constraint length of K=l 1 will transmit n channel bits based on the current symbol and I O previous symbols or (1+10)*k bits. A transmitter memory of length m=K-I , and depth k is required. The n channel bits are dependent on only certain bits of the transmitter's memory, expressed in the form of n generator polynomials. Each K bit generator defines the bits to be XORd, forming the corresponding channel symbol. The selection of these polynomials is done to maximize the free distance between any two codewords.
As the constraint length of the code increases (corresponding to
A
Decoding Convolutional Signals
To determine the original data stream, the Viterbi algorithm performs Maximum Likelihood Sequence Estimation (MLSE) on the received signals. The algorithm traces a number of paths (corresponding to different data sequences) through a trellis, calculating their probabilities, and comparing them against each other. The more likely contender paths are extended for further analysis in the next symbol period. The algorithm is normally divided into stages. Branch metric calculation determines the probability of particular transitions in the trellis. Add-CompareSelect (ACS) processes are then used to sum the incumbent path's probability to the current transition probability, compare to find the best contender into each state and extend the winning path for the next cycle.
Branch Metric Generation
The transmitter is modeled as a state machine with 2" states. For each possible state of the transmitter memory, the decoder uses the generator polynomials to hypothesize the n output symbols produced for each of the 2' possible inputs. Each of the input options is assigned a branch, representing a state transition from the current state to the one the encoder would assume if the hypothesized symbol proves accurate. Each of these branches is logically labelled with the n expected channel symbols produced for such a transition (figure 2.)
For each s t a t e . o u w t brancl? labels 6 1 -E determined calised by t h e input.
1 P G "51.ii t t; r nenory s t a t e u t tint t i l .
T r o n i r n i t t s r
f o r tach po5zibie t r a n s i t i o n n e n o r y s t o i c a t t,ne t
Figure 2: Branch Label Generation
For each symbol period, a comparison is made between the received symbols and the labels of each branch. A branch's probability is credited or penalized based on how closely the hypothesized symbols match those from the channel. (Collins, in [I] , has shown that it serves to either credit or penalize a branch, but there is no gain in doing both.) The penalty assigned to each branch is referred to as it's branch metric, and hence transitions with lower metrics are considered more probable.
Hard or soft decision schemes are used in computing the branch metrics. With hard-decisions, the incoming samples are quantized as a logical 0 or 1. The assigned penalty is then the free distance between the received channel symbol and the hypothesized label. Soft decisions quantize the input signal to multiple bits, representing how strong each sample is. For each bit where the branch label disagrees with the sign of the quantized signal, the corresponding magnitude is added to the branch penalty. In this way, a strong disagreement between expected and received symbols penalizes a branch more than an input that only slightly disagrees. Figure 3 shows an example of soft decision metric calculations. Overall, soft decisions perform approximately 3dB better than hard decisions.
Add-Compare-Select
A path metric refers to the accumulated branch metrics of a series of transitions over time. The ACS unit at each receiving node adds each of incoming branch metrics to their incumbent path metrics, resulting in 2k newly contending path metrics. The ACS selects the lowest path metric for extension as the winner of that stage, eliminating the others. It must also save an indication of which path won the comparison at each time step.
Traceback Operation
After an input frame is processed, the node with the lowest accumulated path metric is the winner. This node corresponds to the last state of the transmitter. Its most-significant bit(s) are then the last k data bits of the frame. Having stored the winning branch into each state for all t, it is possible to re-trace the transitions back through the trellis. The traceback starts at the winning state, determining the source behind each transition. In each step the most significant k bits of the source are the decoded data. The traceback memory contains the winners of each node and is of size 2m*k*fn bits. For large frames (fn>15K) there is a memory efficient technique referred to as stream decoding. Once 5*K symbols have been processed, it is assumed that all contending paths have converged somewhere in the trellis. At t=lO*K traceback begins from an arbitrary state. Once it reaches back to input symbol 5K, the traceback produces valid data through to the front of the queue. This method produces 5K data symbols with 1 OK traceback operations. The decode can be pipelined however using one traceback unit from 10K down to 5K, while another traces from 5K down to the initial symbol, providing uninterrupted, though reversed, data.
REDUCED STATE SEQUENCE DETECTION
Each ACS normally takes 2' input metrics, adds them to their branch metrics, and outputs the winning result to multiple ACSs for the next stage. Anderson's [2] reduced state sequence detection (RSSD) involves grouping multiple states into classes according to free distance constraints. This grouping reduces the implementation complexity since there is now only one unique output metric per class. As a performance metric, Anderson was concerned with achieving optimal error correction while minimizing decoder complexity. For good codes (in terms of free distance), the classes trivially degenerate to include only one state. This meant that it was impossible to achieve the same error correcting power with fewer states. Matolak and Wilson [3] revived RSSD with new motivation.
They simulated various complexity decoders applied to a universal, high-constraint length code. Fundamental to their notion is that a decoder with only 2m-p states can decode a symbol sequence typically requiring 2"' states. To perform this reduced complexity decoding, they revert to Anderson's RSSD, but form classes based on simplified criteria. They group states with their (m-p) most-significant bits in common, equivalent to neglecting the p oldest bits in the transmitter's memory. Their work can be generalized to higher rate codes by neglecting the p oldest symbols. This class grouping results in 2m-p 'super-states', each containing 2p sub-states. An example of this grouping for a native K4 code with RSSD-p, p=l, is shown in figure 4.
IV. ADAPTIVE RSSD
Simmons, in [4] , proposes combining various strength RSSD decoders with CRC error checking to significantly reduce the power requirements of a Viterbi decoder. The adaptive decoder estimates the received frame with very few states (large p) and, if unsuccessful, makes another attempt with more states.
The algorithm was coded and simulation results, in terms of frame error rate, are plotted. The results were generated for a rate %, universal KI 1 code with generators (3346,2751),,,, frame sizes of 1024 bits, under a Rayleigh fading channel with Using a simple scheme where if unsuccessful double the effort is applicd, the power used is:
Where P,.,, represents the power consumed for an x state decode, and PE,.,, is the probability of error for an x state decode. As an example of the power savings potential, at SINR = 8dB:
Prtl=P4.,t+PE4.,t* Ps.,,+ PE~.,t*Plb-~t+. . . Pt11 = P4 + 0.16*2P4 + 0.012*4 P4+0.005*8 Pd ... = 1.42 P A P4 is the power consumption consumed for a 4 state RSSD decode and the power consumption is linearly dependent on the number of states used in the decode. In current cellular communication schemes a K9 (256-state) decoder is in constant use. Assuming a similar structure as the RSSD decoder and corrected for an approximate RSSD overhead of 4O%, a standard scheme uses 38.25 P4. Thus a power reduction factor of 27x is realized using the adaptive technique. As the number of users on a cellular channel increase and data becomes more prominent, higher constraint lengths are expected. A similar analysis as above reduces the power consumption of a KI 1 decoder by a factor of 108x.
V. VITERBI IMPLEMENTATION
Viterbi implementation decisions depend on the constraint length, code rate, bit rate, frame size, area, and power requirements of the target system.
The number of ACS operations in a standard Viterbi decoder is 2k(K-1), each operation with complexity proportional to 2k. If the total number of operations can be performed within a symbol period, a serial decoder implementation using only one ACS is adequate. However, as the constraint length, code rate, or data rate increase, more parallel versions must be considered.
The main advantage of the fully state-parallel implementation is that the supported data-rate is independent of the constraint length of the decoder, exchanging die area for decode speed. The state-parallel, bit-serial approach offers the advantage of significantly reduced area over the bit-parallel approach, with only marginal decrease in decode speed as the critical timing path is typically much shorter. Example 1 contains a typical decode scenario.
With respect to power consumption, each approach will have similar dissipation for a given data rate. The frame size will normally determine which traceback scheme is employed. Small frame sizes use frame-by-frame decoding. Large frames, or no frames at all, lend themselves to a streamed traceback approach to reduce memory and area requirements. State-parallel (bit-parallel, bit-serial) 
Example: State-serial vs

VI. RSSD IMPLEMENTATION
As in standard Viterbi, for each symbol that enters the decoder, contender paths are evaluated and penalized according to their branch labels. To determine the branch labels under RSSD, the decoder must somehow estimate the p-bit sub-state of the transmitter. The branch labels in the decoder are no longer static as in all previous implementations, but are selected or generated on-the-fly based on the (m-p) reduced description of the super-state and the estimated p bits describing the sub-state.
As in Collins, The configuration, then, of a generic rate I/n butterfly with p bit reduction is shown in figure 6 .
The n-bit labels entering the MSBo node of the butterfly are denoted as lbllsbO and lbllsbl corresponding to the contributing source super-state. In the typical Viterbi algorithm these labels are constant and based on the state and input bit(s) causing the transition. In the RSSD case however, they must be determined based on the super-state IDS (x,O) and (x,l) and the estimated sub-states (y) and (z), such that by applying each generator to the concatenated string abedbinary. If a rate I/n code is used with leading 1's in the generators, the branch labels into the MSBl node of the butterfly (corresponding to an input data bit of 1) are the complement of those to the MSBo node and thus don't require significant consideration. To dynamically generate the branch labels, the strings (O,x,O,y), bit and (O,x, 1 ,z) bit can be passed from the source node to the butterfly and ANDed with the n generator polynomials, XORing the results to give the two n bit branch labels. It is noted, however, that (0,x) is common to both nodes in a butterfly and thus a portion of the n-bit labels can be precomputed as gen,..p+l(O,x) and stored in the receiving butterfly. These savings allow two optimizations. The source nodes no longer need to maintain information regarding their own super-state id's (x,O/l), and it reduces the computations that need to be done onthe-fly. The labels are now calculated as:
~recom~ (0x) then completes the branch computation in preparation for the An alternative to the presented generation scheme uses either ROM cells or register banks to address the proper branch labels, considering the estimated p bit sub-states y and z. This presents power savings as the branch labels are precomputed and generator lines would not need constant switching. This approach would be reasonable only for small values of p however, since n*2P memory cells are required per butterfly.
A significant amount of area in a parallel Viterbi implementation is consumed by the wiring between ACS units. This is the motivation for co-locating the two nodes with different MSB into a butterfly (as in figure 6 ) such that only one wire must run from each source node to carry the path metric. This wiring density is one of the primary reasons to shy away from bit-parallel Viterbi implementations. To reduce wiring, rather than add separate connections for the y and z lines used in label generation, it is reasonable to multiplex them onto the LSBo and LSB, feeds normally used for the path metric bits. No loss is encountered with respect to cycle count since the branch label generation phase must always precede metric calculation.
The preceding discussion assumes that the estimated substates, y and z are known. In fact this requires that a scheme be used to track the movements through the trellis. From figure 6 the solution is suggested. During the label generation phase of p+l cycles, the sub-states (y. z) from time to are passed over the multiplexed LSBoil feeds. Depending on which branch becomes the ultimate winner, the new estimated sub-state becomes (0,y. Isb)p bits or (I,z.Isb)p bits. To implement this source tracking, a p bit shift register for each feed is enabled during label generation. It captures the old source, prepends a 0 or ldepending on the feed, and discards the least significant bit. The discarded bit represents the oldest bit, no longer considered by the transmitter. Following the ACS steps, the winning branch is saved for each node in the butterfly. This winner now selects which of the two candidate tails, Y' =(O,Y-lsb)p bits or z'=(I,Z.lsb)p bits to shift out and repeat the cycle for the next stage's label generation.
The apparatus for source tracking and dynamic label generation are shown in figure 7 , with a 'black-box' ACS unit. One control signal is used to enable either the label generator and source tracking, or the ACS unit. Another control signal is used to designate cycle p+l for bit-stuffing. To implement RSSD, the approximate overhead per generic rate l/n ACS is: standard ACS cycles.
~.
p+n/2 DFF, n+l AND gates and 3 2x1 Mux. Additional costs are the need for n-generator and 2 control lines to be bussed through the decoder. Decode time increases by 1 cycle (bit-parallel), or p+ 1 cycles (bit-serial) per ACS operation. Considering modifications to Collins particular bit-serial implementation discussed further in this paper, for a KI 1 code running with p=8 (4 states), the overhead for each ACS (in terms of cell count) is 40 % (neglecting control circuitry).
To implement RSSD with a state-serial decoder the substates must be saved in memory between operations. The RSSD adaptation can also be applied to higher rate codes but is not discussed in this paper.
The transparency of this approach is particularly appealing. All of the surrounding circuitry of the decoder (ACS, traceback memory, decode structure, 10 buffering, etc.) can be of arbitrary architecture and approaches the trellis as if it were of a standard 2"-p states.
VII. ADAPTIVE-RSSD IMPLEMENTATION
To implement Adaptive-RSSD there are a number of design considerations. Delay of the decoder is partially dependent on SlNR and must be addressed in terms of buffering and system specifications. A simple implementation would use multiple autonomous decoders of various RSSD reduction levels, however one re-configurable decoder presents certain advantages. Traceback and decoding also need to be examined to determine the most efficient solution in terms of area, cost, delay and power consumption.
With regards to delay, there are three data sources of concem. Delay tolerant data is typically high rate and presents no problems. Broadcast data (eg. video/digital radio) can initially be delayed but must then be uninterrupted. This is solved through output buffering. Delay intolerant data, such as voice, must be uninterrupted and can not experience significant delay. lf the decoder uses an asynchronous interface, it can be clocked much faster than required for the incoming data. Assuming a voice rate on the order of 14.4 kbps, there is 69us between bits, whereas it only takes 0 . 4~s for the proposed decoder to process each bit (at 50 Mhz.) Should the decoder fail in its initial RSSD attempt, which by necessity waits for every input bit, the subsequent attempts can proceed on the buffered channel data very quickly, adding only negligible delay.
The simplest way to implement A-RSSD is with multiple, independent decoders of different reduction factors, p. Each decoder would consist of 2"-p-' butterflies and contain its own traceback memory and logic. With multiple units it is possible to pipeline frames. One may undergo RSSD-8 initial decoding while another, more troubled frame, may be in it's third attempt at RSSD-6. This technique is especially valuable, however, for high volume, multi-channel Viterbi decoders such as those encountered in a cellular base station. Decode 'jobs' can be scheduled onto appropriate RSSD-p decoders based on channel properties and already encountered errors. Such a scheduling algorithm is straightforward and not considered in this paper. An example hardware implementation, on a single IcmXlcm die in 0 . 1 8~ technology can support approximately 10 KI 1 users or 40 K9 users. A sample configuration based on K11 A-RSSD can simultaneously support 512 users at RSSD-8 (K3'), 256 users at RSSD-7, 128 users at RSSD-6 ... with up to 5 users using full RSSD-0 (KI 1) decoding. Even considering the RSSD overhead, the same die that supported only IO Kl1 channels can now support up to 1024 K11 channels.
In a single channel application, rather than have separate RSSD decoders for each level of effort, area requirements can nearly be halved if the decoder circuitry can be reconfigured, on the fly, to operate at any reduction level. With this scheme special consideration must be given to the butterfly interconnection structure, traceback lengths and memory interface, and the sub-state estimation registers in the RSSD addon units.
As we reduce the ACS nodes used, raising p, there is no possible selection of units having the necessary IO connections to their neighbours (refer to figure 4 .) It becomes necessary to re-configure the connections between nodes as we change decode strengths. As the butterfly interconnection network already poses a wiring problem, and with the goal of low power consumption, it is important to minimize the capacitance and overhead of any switching network. To alleviate these problems, the ACS nodeshutterflies chosen for computation at a particular RSSD-p are those with their p least significant bits zeroed. Under this scheme, for a KI 1 decoder with 512 butterflies and maximum reduction of RSSD-8, butterfly-0 and 256 are used in every level of decoding. When decoding with p' 8, butterflies-128 and 384 enter service, at p<7, butterflies 64,192,320,448 are added, etc . This selection of nodes has two advantages. Only half of the connections ever need to be re-configured as we change p. This switching is done using an inexpensive hierarchy of MUXs, selecting from 8 inputs on butterflies 0 and 256 down to one on butterflies 2,6,12,16 ... 508. Odd number butterflies are only ever used in RSSD-0 decoding and therefore never require reconfiguration. Referring to figure 4 , the re-wiring can be conceptually viewed as signal 'promotion' of the LSBO feeds. The other advantage to choosing nodes with the least significant bits zeroed, is that the precomputed portion of the branch label, precomp(Ox),, is still valid when the same butterfly is used in a larger trellis.
To use the same butterfly in different states of reduction, when operating in anything other than their native reduction state, the source-tracking shift registers of figure 7 must be right justified. This adds additional overhead into the RSSD patch but is mitigated by using multiplexed flip-flops normally intended for verification purposes.
During each symbol period, each active node stores its winning branch into traceback memory. This requires anywhere from 4 to 1024 k-bit parallel writes, depending on the stage of reduction. One option is a custom high ported RAM, specifically designed to reduce power consumption when in the higher, more likely, reduction stages. The density and complexity of such a specialized RAM would suffer however. A more practical solution uses standard RAM cells. An interface unit bridges the RAM to the ACS units. This interface converts the one massive parallel write into 1,2,4,8,16 or 32 serial 32-bit writes. It specifically allocates addresses to each butterfly such that when in the most probable states of reduction only a single word of memory is addressed -resulting in considerable power savings. Two read cycles/bit are additionally required for the pipelined traceback units. In each read cycle, one of 1024 read-select lines is raised by the traceback logic to indicate which bit to read. The memory interface is responsible for translating this request into an address for the RAM, and masking the appropriate bit to supply the traceback logic. The RAM must perform up to 34 data cycles (RSSD-0) in each symbol period. RSSD-0 takes 9 cycles to perform the ACS operation. The RAM must be clocked at least 32/9=3.56 times faster than the ACS units so as not to be the bottleneck. An alternative to raising the clock frequency is to make the RAM configuration more granular, adding more RAM cells so that the 32 writes can be performed in a more parallel fashion.
VII. SYNTHESIS RESULTS
A prototype K11, rate %, 8 stage asynchronous A-RSSD, state-parallel, bit-serial, 3 bit soft-decision decoder has been designed, synthesized and tested with 0 . 1 8~ TSMC supplied standard cells. The block diagram of the system is shown. It uses one re-configurable block of 5 12 butterflies. The traceback memory and IO buffer cells are from Virage Logic corp, provided via CMC. Metric generation is based on Collins bitserial implementation and the resultant butterfly is shown. Metric reduction is accomplished via a wired-and to indicate that all active metrics are above the maximum dynamic range. Once noted, clearing the MSB and inverting the 2"d MSB of each metric in the system implements a constant subtraction. The serial comparator is believed to be new, and uses only 5 cells.
The streaming pipelined traceback addressing scheme is also taken from Collins and requires storage of 3*TB-LEN*2m bits. Minimum traceback length is 5*K_Max = 55, and so TB-LEN of 64 was chosen for simplicity, making the memory storage requirements 3*64*1024 = 24 kB. A variable traceback length is possible. For RSSD-8 a TB length of 10 will suffice, cutting the delay and reducing the memory cycles by 1-(fn+30)/(fn+192), or 13% for a 1024 bit frame. The traceback logic is separate from the ACS array and consists of 2 flip-flops per state, wired in the same manner as the ACS units themselves. As in Collins, these units implement reverse flow token-passing to indicate which node is on the winning path.
Area requirements of the entire design, pre-layout and routing, are broken down by function. The power consumption estimates for a 1 Mbps data rate were obtained by appropriately annotating switching activities and analyzing the compiled design with Synopsys Power Compiler. The power consumption scales linearly as the data rate changes.
The maximum decode speed is constrained by the commercial RAM cells. Documented to operate at 166 Mhz, the ACS units can be clocked at uu to 46.5 Mhz. At RSSD-0 it
