Abstract-A novel architecture design to speed up the Viterbi algorithm is proposed. By increasing the number of states in the trellis, the serial operation of a traditional add-compare-select unit is transformed into a parallel operation, thus achieving a substantial speed increase. The proposed architecture would increase the speed by 33% at the expense of a faily modest increase in area, thus becoming an attractive approach in high-speed applications. A simple example is shown to illustrate the proposed algorithm in maximum-likelihood sequence detector. A comparative synthesis is made to compare the proposed architecture with other approaches, and synthesis simulations confirm the projection of the throughput gain. Also, the proposed algorithm is extended to the block-processing architecture, and we show that an additional 50% speedup is achieved.
I. INTRODUCTION
T HE Viterbi algorithm (VA) was first introduced in 1967 as a means to decode convolutional codes [1] . And later, Forney [2] showed that the VA can be applied to implementing a maximum-likelihood sequence estimation (MLSE). Since then, the VA has been widely used in many communication systems in both the maximum-likelihood (ML) convolutional decoder and ML sequence detector.
Many high-speed applications adopt the VA operating at several hundreds megabits per second, and this operating clock speed has been increasing constantly. These examples include the ML sequence detector (MLSD) in magnetic recording systems and convolutional decoders for error correction. So there has been a strong need to achieve a higher speed, and this motivates the work presented here.
The main operation unit performing the VA is called an add-compare-select (ACS) unit. However, due to the feedback loop, the ACS unit is considered as the bottleneck in actual implementation of high-speed applications. Several architectures have been proposed for speeding up the VA operation. In [3] , they increased the speed throughput of the VA by having multiple ACS units in a parallel implementation, thus trading area for speed. With similar approaches, Thapar et al. [4] and Black et al. [5] also extended the block processing by applying one stage of lookahead to both the ACS and traceback recursions, resulting in a radix-4 ACS and a radix-16 traceback iteration. This architecture is referred to as "block processing." In this letter, we propose a new architecture of the ACS unit [6] that speeds up the VA by increasing the states of the trellis. By reformulating the VA, the proposed architecture provides an alternative approach to high-throughput design. We will refer to the proposed architecture as the "double state." By having the double state, a serial operation of the ACS unit can be transformed to a parallel operation. This new approach enables us to speed up the ACS operation with a fairly modest increase of area. We also extend the double-state technique to the blockprocessing architecture. The overall speed analysis shows that when the double-state technique is applied to the conventional Viterbi processor and the block processor, a speedup factor of 33% and 50% is expected, respectively.
This letter is organized as follows. Section II briefly describes the VA and the ACS unit. The new architecture for the fast VA is proposed in Section III. In Section IV, a simple example for the proposed architecture is provided for MLSE, and projection for area and speed is presented along with the actual synthesis data which confirms the estimates. Also, a comparison between the proposed architecture and the block-processing approach is made. In Section V, we extend the proposed double-state approach to the block-processing technique. Finally, Section VI discusses the summary.
II. VITERBI ALGORITHM (VA)
In this section, we will briefly explain the VA and also introduce the notations used throughout this letter.
Consider the MLSD case first. Assuming the channel response polynomial is given where denotes a delay operator, the VA recursively optimizes the most likely path by accumulating the branch metric (BM) for each state where the number of states is determined by . Here, represents the size of the input alphabet and denotes the channel memory length. BM for each transition in a trellis is computed using and its trellis input corresponding to the transition. In each state of a trellis, the previous state metric (SM) and the corresponding BM are added together, and then the accumulated SM is updated by choosing the minimum of all possible cases recursively where represents the SM of the th state at time , and denotes the BM at time associated with a transition from the th state to the th state. This update operation is performed in the ACS unit in the VA. Each operation in the ACS unit is carried out in a purely serial way, thus causing the worst speed bottleneck in the whole throughput. This is clearly illustrated in Fig. 2 for a binary input case. In this diagram, a triangle and a trapezoid represent a comparator and a multiplexer, respectively. The multiplexer chooses either an input at the top or at the bottom as an output, depending on the middle input at the left side. In the following section, we propose a new architecture to speed up the ACS unit by changing the serial ACS unit into a parallel structure.
III. NEW STRUCTURE: DOUBLE STATE
For simplicity, we assume a binary input case (
). This result can be easily generalized to a multiple-input level case. Also, we continue explaining a new structure in the MLSD case. Applying the same structure to the ML convolutional decoder is straightforward.
First, note that the channel response polynomial of order could be written as . As an example, Fig. 3 illustrates two equivalent trellises for the two-state channel and the four-state channel. The numbers next to the states represent the input sequence. (For example, 10 of the left-hand side state of the trellis represents an input 1 and 0 at time and , respectively.) Also, the numbers on the arrow line show the ideal channel output associated with the transition.
For a channel which has a zero coefficient for the last coefficient, the BMs for two transitions which have the same ending state are the same, because the two starting states are different in only the oldest bit position. In this case, the Viterbi processor has states, even though is actually a polynomial of order (thus the term "double state"). By having the double state in a trellis, the BMs ending in one state are all the same. This means that when choosing the minimum of two possible SMs and , we can select the less of two previous SMs and without waiting for an addition of the BMs. (In this case, ) Equivalently, we perform the following recursion:
For example, in the current state 00 of Fig. 3 , two incoming paths from the previous states 00 and 10 have the same BM 0. This applies to all the other states, since in the double state, the oldest input to the Viterbi processor makes no contribution on computing the BM for each state transitions. Therefore, in the double-state structure, the "Add" operation which computes the SM can be carried out at the same time as the "Compare" operation. This new structure is shown in Fig. 4 . As clearly shown in this diagram, two BMs and are the same. A careful investigation of the double-state trellis reveals that a further hardware savings is possible. Looking at the current states 00 and 01 in Fig. 3 , they share the same pair of the previous states 00 and 10. Therefore, if the current state 00 chooses the path from the previous state 10 over one from the previous state 00, then the same decision is made at the "Select" operation for the current state 01. This is the same for the other pair of the states 10 and 11. Thus, every two states in the double-state structure can share the same decision-making unit in their "Compare" operation. This can be easily generalized to an -ary input case. Combining two states which share the same previous states in the double state, the new ACS structure is illustrated in Fig. 5 . In this diagram, the state and share the comparator, thus reduce the hardware complexity. As a result, units of the ADC shown in Fig. 3 are used in the double-state architecture.
At first it appears that the double-state trellis in Fig. 3 requires twice as much computation for the SM than the ordinary trellis. However, it shall be shown in the following section that the double-state architecture computes no redundant information. Actually, every transition shown in the ordinary trellis should be computed, whereas only the half of the transitions shown in the double-state trellis are used for computation.
Note that the proposed architecture can be explained in terms of a simple lookahead with nonlinearities in the loop [7] , since it is viewed as retiming of the existing trellis update. Basically, the proposed architecture transforms the conventional ACS unit into a "compare-add-select" operation. In contrast, the differential trellis decoding (DTD) algorithm has been proposed in [8] based on a "compare-select-add" method. While DTD reduces the number of additions by increasing the number of preprocessing operations in the trellis, the double-state architecture speeds up the operation by increasing the number of states in the trellis. Therefore, the double-state approach is better suited for the high-speed applications where additional complexity is acceptable.
IV. EXAMPLE AND AREA SPEED ANALYSIS
This section shows a simple example explaining that the double-state structure contains no redundancy at all. Also, area and speed estimate analysis is given as well as the actual synthesis data. Fig. 6 compares the ordinary VA and the double state in a binary channel shown in Fig. 3 . Here, represents the input sequence to the Viterbi processor. The BM is computed using the normalized equation
A. Example
, where shows the ideal channel output. The thick line and the dashed lines indicate the survival path and the discarded paths, respectively.
Here in this example, it is readily apparent that the two representations are exactly the same. Each transition in the ordinary trellis appears in the double-state trellis. In the ordinary trellis, the discarded paths are not shown conventionally, while in the double-state trellis representation, there are no hidden discarded paths. The only difference in the two trellises is that the decision made in the double state has one more latency. For example, at time , the state 0 in the ordinary trellis chooses a path with metric over one with metric . In contrast, the double-state trellis makes the same decision at time . This extra latency can, however, be corrected by noting that knowledge of a current sample value is not necessary to make a decision. Another point to note is that the double-state trellis does not make any decision at the initial stage (time ), so that transitions shown at time can be arbitrarily made and this first decision is neglected.
B. Area-Speed Analysis
Compared to the ordinary VA, the proposed double-state structure requires twice more adders, SM registers, and multiplexers. Everything else remains the same, including the path memory, since the number of the surviving paths in the double-state trellis is the same as that in the ordinary trellis. As the adders, SM registers, and multiplexers are a substantial, but not a dominant, portion of the area of the ordinary Viterbi processor, the expected area in the double-state implementation is roughly 50% more than the ordinary implementation. However, this area estimate assumes the worst-case scenario, and it can be lower than that in actual designs. This is confirmed at the end of this section. In some applications, the Viterbi processor comprises only a small part of the complete chip. For example, in an eight-state EPR4 Viterbi detector chip [9] which includes a timing recovery, adaptive equalizer, continuous time filter, encoder/decoder, and servo processing, the Viterbi detector would take only about 8% of the total area of the chip, thus the area increase for the double-state approach is only about 4% of the chip area, in this case.
Transistor-level simulations of an ACS designed for highspeed operation show that of a clock cycle, about 50% is used by the add, 25% by the compare operation, 5% by the multiplexing, and 20% by the register setup and propagate delays. In the proposed double-state structure, the propagation to the adder can be operated at the same time as the comparator delay, thus saving 25% of the clock cycle. Therefore, the speed of the double state should be 33% faster than the ordinary ACS unit.
A comparative synthesis for the 16-state MLSD with a fixed-channel response polynomial was made [10] to compare the speed and area estimate for the proposed architecture and the block process approach, and this result is summarized in Table I . This demonstrated that the double-state architecture provides 30% throughput increase with 32% additional area over the conventional implementation, whereas the synthesis based on the block processing yields 58% speed-up with 151% area increase. It is clear that while the throughput gain of the proposed architecture is smaller than the block approach, the former requires a much smaller area than the latter. It was also noted that for applications where is programmable [11] , the area increase for the block-processing technique is much higher due to the increased complexity in the BM computations. It should also be mentioned that the area and speed comparison presented in Table I may vary, depending on the implementation. Thus, it provides a relative figure of merit in terms of the area and speed.
V. EXTENSION TO THE BLOCK-PROCESSING TECHNIQUE
In this section, we extend the proposed double-state architecture to the block-processing technique described in [3] - [5] and show that the Viterbi processor throughput can increase even further. Consider a four-state trellis, described in Fig. 7(a) . By processing a multiple-stage trellis at one cycle, the block processing can speed up the whole processing operation. Fig. 7(b) shows a merged two-stage trellis from time to . In this example, the number of inputs to each ending state is four, and the ACS unit implementing this merged trellis is illustrated in Fig. 8 . Note that this ACS unit operates at the half rate, compared to the conventional Viterbi processor, and the BM is computed by adding two separate BMs in each stage trellis.
We can improve the throughput even further by extending the proposed double-state approach to this block-processing architecture. This can be done by adding two zero terms at the end of the channel response polynomial . Then, all four inputs to any ending state have the same BM values, and this allows us to employ the same principle of the double-state technique described in Section III. Following the similar approaches used in the double-state technique, which combines states sharing the same starting states, the resulting architecture is illustrated in Fig. 9 . Note that we need of these ACS units to implement a state trellis. The number of adders and multiplexers is quadrupled in this case, and the total complexity is certainly much greater than the block-processing architecture.
As for the throughput gain, it is expected that the double-state approach benefits more when it is applied to the block-processing trellis rather than to the standard Viterbi processor, since the four-input compare operation in the block-processing technique takes longer than the normal two-input compare operation. It is estimated that the adder and the four-input comparator take up about 40% and 35% within one processing cycle, respectively. Therefore, when the double-state technique is combined with the block-processing architecture, the new architecture results in a speedup factor of 1.5 over the block-processing technique. Based on the throughput gain of 1.6 for the block processing reported in the previous section, this indicates that the overall speedup gain of the combined structure over the conventional Viterbi processor would be 140%. Note that this kind of impressive speedup gain can be achieved with a three-stage block-processing technique employing an eight-input comparator, but the three-stage block processor would be more complex.
VI. CONCLUSIONS
In this letter, we have proposed a novel ACS unit which speeds up the VA. By increasing the states of the ordinary trellis, the serial operation in the ACS unit is reorganized so that the "Add" and "Compare" operations are carried out at the same time, thus increasing the speed of the operation cycle. We have presented that 33% and 50% of speedup can be achieved, respectively, when the proposed algorithm is applied to the conventional Viterbi processor and to the block processor, and this is confirmed through the synthesis simulation. Especially, following a "system on a chip" trend, the increased area becomes smaller compared with the whole chip size.
Therefore, we conclude that the proposed double-state architecture is an attractive approach to achieve a high speed in many communication application very large-scale integration (VLSI) chip designs at a fairly modest cost in area and power dissipation.
