Abstract-In exploiting the potentials of highly parallel architectures to speed up the computation rate of systems enabled by VLSI, special attention has to be paid to designing algorithms such that they can be mapped onto parallel hardware.
solution. Summarizing, architectures favorable for VLSI have the following four general properties, here referred to as the VLSI constraints: 1) local communication;
2) high regularity; 3) small area consumption (efficiency);
4) linear scale solution.
Of course, it is a difficult task to find architectures for specific problems that satisfy all four VLSI constraints and it certainly cannot be solved for an arbitrary problem. In this paper, we show that an architecture for the Viterbi algorithm (VA) can be derived which comes very close to meeting all four VLSI constraints, even though its major part is a nonlinear data-dependent feedback loop. First, in Section 11, we find an appropriate mathematical description, i.e., an algebra for the VA which allows the ACS recursion to be written as a linear equation. This mathematical description is used in Section I11 to derive a linear scale solution. In Section IV, the constraint 1 (local communication) is provided at the word level whereas, in Section V , the granularity is broken down to the bit level. This allows the implementation by a bitlevel systolic array with bit-level local communication and regularity, leading to a very eficient architecture.
THE VITERBI ALGORITHM AND ITS ALGEBRA
The VA was introduced in 1967 as a method to decode convolutional codes [ 13. Later, it was shown to be a special case of dynamic programming [2] . In the meantime, the VA has found widespread applications, e.g., in digital communication, magnetic recording, speech recognition, etc. For a comprehensive tutorial on the VA, see [3] . To introduce the notation, a brief summary of the VA is provided below.
Underlying is a discrete-time Markov chain with a fi- calculates the optimum path, which leads to that state, and discards all other paths already at time k as nonoptimal. This is accomplished by summing a probability measure called the path metric Y , , k for each state s, at every time instant k . At the next time instant k + 1, depending on the newly observed transition, a transition metric X,,,k is calculated for all possible transition branches from state s, to state s, ( sJ -, s,) of the trellis.
The algorithm for obtaining the L,pdated y,, + I is called the ACS recursion of the VA (ACS: add-compare-select) and can be described in the following way. For each state s, and all its preceding states s,, choose that path as optimum according to the following decision: The surviving path to each state has to be updated for each state and has to be stored in an additional memory, called the survivor memory. For a sufficiently large number of observed transitions (survivor depth D ) , it is highly probable that all N paths merge when they are traced back.
Hence the number D of transitions which have to be stored as the path leading to each state is finite, which allows the estimated transition of time instant k-D to be determined.
An implementation of the VA, referred to as the Viterbi processor (VP), can be divided into three basic units as shown in Fig. 2 . The input data are used in the transition metric unit (TMU) to calculate the transition metrics XI,, which then are accumulated recursively as path metrics yf in the add-compare-select unit (ACSU). The survivor memory unit (SMU) processes the decisions made in the TMU and ACSU and outputs the estimated data.
A. Algebraic Laws of Add and Max
When parallel branches exist' (e.g., a and b ) , i.e., branches that start at the same state and end at the same state, the maximum of their transition metrics can be 'Parallel branches exist, e . g . . in most TCM codes (TCM: trellis coded modulation). The notation used here assumes that the maximum metric of each set of parallel branches is found prior to the ACS operation being performed. It is the one referred to This property of the two operations maximum selection ("max") and addition ("add") given in (3) should now be examined closer. In a more general notation, this amounts to
which is equivalent to the distributive law of multiplication and addition as X y , k . ac + bc = ( a + b ) c .
( 5 )
The distributive law is one of the basic axioms so that, e.g., the conventional addition and multiplication of integers and real numbers have the algebraic structure of a ring. Therefore, we now undertake a more detailed examination of the operations "add" and "rnax" with the goal of finding a set of axioms that form an algebraic structure. To simplify the notation and emphasize the fact . I ,-"max" and "add" are distributive (4), the two symbols, 8 for "add' id 0 for "max," will be used from here on. Since, algctmically, add operates like the multiplication and max like the addition, to ease the understanding, we refer to 0 (max) as the "algebraic addition."
) for which the algebraic sum a 0 b and product a 8 b of any two elements of R are defined. Here, R can be the set of all reals, integers, etc. As is well known, 8 (the arithmetic addition) forms an Abelian group over R, i.e, ( R , 8 ) meets the axioms Let R be a set of elements { a , b , c, 
B. Semiring Algebra
Now the question arises whether the operations 0 and 8 can satisfy more laws, i.e., what algebraic structure is defined by ( R U Q, 0, Q )? If we could find an expansion of R so that ( R , 0 ) would form a group, i.e., we could construct an inverse element, then ( R U Q , 0, 0 ) would form a ring. However, no such expansion of the set R can be found (as can be shown by applying general methods [4] ). By including the element Q , "max" and "add" form a semiring over R . (Notice that the term "semiring" sometimes is used in the context of lattice theory Axioms of the Semiring:
3) R contains a zero element Q (neutral element of 0 ) and a unity element 1 (neutral element of Q ) such that
(Remark: this can be achieved by defining Q and 1 ap-
The difference between a ring and a semiring is that the semiring is defined without any inverse elements, neither for 0 nor for 0. Therefore, all operations allowed in a ring are allowed in a semiring with the exception of operations that require the existence of an inverse element of 0 or 8. We mention that, in our case, ( R , 0 ) forms a group since the inverse element ( a -) exists. Thus, operations which require the existence of ( a -' ) are allowed, too (e.g., the calculation of fractions, see the Appendix).
Because the distributive law is met, superpositions can be undertaken in the same way as known from linear algebra, thus semiring algebra is linear. Beyond the mathematical beauty of this property, we, of course, want to show its practical significance for the set of N ACS recursion equations ( I ) of the VA.
C. Algebraic Formulation of the VA
ACS recursion (2) can be rewritten as a linear form By using the notation of the semiring, the four-state and by setting A, : = Q for all A, that do not occur in the ACS recursion (6), the general form of (6) can be written as It is to be noticed that the linear form can formally be written as a matrix-vector product, where the operations of matrix-vector multiplication are defined in analogy to the well-known definitions of conventional linear algebra. This can be exploited to rewrite the set of N ACS recursions (6), (7) as one matrix-vector equation
where the vector of path metrics rk is given by
The N X N transition matrix Ak comprises all of time instant k which, for the trellis example given in Fig. 1  (6) , results in where, as was mentioned above, the elements ( i , j ) of the matrix take on the value A,J,k (and those corresponding to transitions si + si, which are not present in the trellis, have the value h,J,k = Q ) . As will be shown shortly, the matrix-vector formulation is of far more than formal interest. In particular, the fact that all
[6], [8] , [IO] , allows us to employ the powerful tool of look-ahead techniques developed for linear vector recursions. The fact that the ACS recursion can be interpreted as a matrix-vector product has been previously shown to schedule the dataflow [ l l ] , [12], but here we may make use of the fact that the ACS recursion can be written as an algebraic matrix-vector product. This is the crucial difference which now allows the calculation and execution of all algebraic operations defined for semirings.
111. THE ALGEBRAIC M-STEP VA A N D ITS IMPLEMENTATION As can be seen in the block diagram (Fig. 2 ) , a highspeed VP can only be realized by a high-speed implementation of all three units. Since the TMU and SMU are of simple feedforward structure, these units can easily be pipelined or implemented with parallel hardware. However, the nonlinear ACS recursion is a major bottleneck for a high-speed implementation of the VA. Two ad hoc solutions were presented to break this bottleneck, leading to an architecture that is a linear scale solution [ 131-[ 151. These methods were modified or extended in [16]- [ 191. In the following, we exploit the fact that the operations of the VA obey the axioms of a semiring [20] A. The Algebraic M-Step Viterbi Algorithm
We recall that the ACS recursion can be written as a matrix-vector product r k + 2 = A k + l 8 r k + l .
(11) With the help of the rules of semiring algebra, we are now allowed to insert (8) Since the VA is written in (1 1) as a linear homogeneous recursion for the path metric vector T k , the recursion over M steps (13) can be interpreted as the solution of the linear homogeneous recursion with initial vector I',. The important observation is that the updating of the vector of path metrics rk does not have to take place at every time instant kT, i.e., at rate 1/T, but can also be carried out By carrying out this M-fold multiplication of (16) from right to left, i.e., in time sequence, it can be seen that this is exactly as if the conventional VA were carried out over M -1 steps with initial path metrics of time k + 1 set to rk + I = column, ( A k ) . Furthermore, since columnj (A,)
is made up of all transition metrics of state s, of time k to all other states, computing (16) is equivalent to decoding the "rooted one-step trellis of state s," shown in Fig. 4 . Writing ( I 6) for all columns of M A h leads to at a much slower rate l / ( M T ) where M can be chosen freely. The computation of ,,,Ak, ,+,Ak+,,,. . . is not a part of the feedback loop but can be computed outside the loop with pipelined and/or parallel look-ahead hardware, in analogy to the results known for conventional linear recursions [33] . By implementing (13), the computation of the ACS recursion is carried out in steps of MT and, therefore, it is referred to as the M-step ACS recursion. A block diagram of this method is given in Fig. 3 , where one can see that two units are required to compute the M-step ACS recursion: the A-ACSU to compute the "M-step transi- state sj and finite block length M . Thus, the computation of ,,,,Ak corresponds exactly to the decoding of the complete set of N-rooted one-step trellises of one M-step, as described in [13] , and (8) describes the decoding of the M-step trellis.
C. The A-ACSU and the M-Step ACSU
Since we now have a matrix-vector notation of the M-step decoding principle, we can derive optimized systolic architectures. There are two possibilities to realize the M-step ACS recursion, as known from conventional linear recurrences [33] ; either by look-ahead computation via pipeline interleaving or by block processing [ 101. If look-ahead computation via pipeline interleaving is chosen, the achievable speed is limited by the maximum clock rate that the technology (CMOS, ECL, etc.) allows. In the case of block processing, blocks of successive input data are processed in parallel. If the delays of interconnections are not taken into account, an arbitrary speedup of the recurrence is possible. Therefore, we concentrate here on the discussion of block processing.
Below, we refer to the cycle time of the feedback loop of the M-step ACSU as r. The M-step ACSU does not operate with the high-rate 1 / T but with the slow rate 1 / ( M T ) = 1 / r . At each new cycle of the M-step ACSU, a new M-step transition matrix, which is the M-fold product of the transition matrices, has to be fed in from the A-ACSU. Fig. 5(a) shows the simple flow diagram of the operations being carried out in the ACSU's, with a pipeline stage between the matrix multipliers. As can easily be seen, for a given latency 7 of a matrix multiplier, a number of M = r / T of such multipliers has to be implemented. This shows the linear dependency between the throughput 1 / T and the hardware complexity which goes with M (linear scale solution).
The M-fold matrix multiplication carried out in the A-ACSU does not have to be implemented as a pipeline structure quires a large amount of interconnection wiring between the multipliers, whereas the pipeline solution is highly regular. In the following, we will concentrate on the discussion of the systolic pipelined form as shown in Fig.  5 (a). However, we want to stress the fact that, due to semiring algebra, all kinds of different architectures can be applied [lo] , [19] , [21] as they are known for linear recurrences (see, e.g., [22] ). The problem of finding the appropriate architecture for the matrix multipliers will be discussed in Sections IV and V .
D. Block Processing Architectures
After discussing realization aspects of the A-ACSU and M-step ACSU, it is also necessary to show that the TMU and SMU of the M-step VP can be implemented efficiently.
As can be seen in Fig. 5 (a), at any time M transition matrices have to be available and fed into the A-ACSU in parallel. This can be achieved by implementing M TMU's in parallel, one for each A input of the A-ACSU.
If the M-step ACSU calculates the value of r k only for every k = nM time instant, i.e., it carries out a coarse grain decoding of the trellis, the additional information of the fine grain decoding being performed by the A-ACSU has to be processed additionally in the SMU [ 131. However, since each matrix multiplication of the A-ACSU yields N 2 elements of a new matrix and, thus, also N 2 results on the max operations are carried out, this leads to a severe problem of buffering, wiring, and processing in the SMU. Therefore, we propose to buffer the transition matrices and to process them a second time in the "survivor memory unit ACSU" (SMU-ACSU) to receive the path metrics rk of all time instants, which is a method known from linear algebraic systems [22] (see Fig. 6 ). For easier understanding, the method of calculating the "missing" rk is illustrated in Fig. 6 , without pipelining.
However, the processing elements are placed in a way which indicates the schedule of processing with pipelining. Each M-step, which is fed in on the left-hand side as one vertical block, is led through the architecture in a horizontal direction while being processed. This way of illustration shows very clearly which amount and shape of bufferinglskewing is needed.
Since all r k are calculated step-by-step in the SMU-ACSU, all decisions are also made as in a conventional VP. Therefore, it is now easy to understand that the trace- back SMU [23] , also called a pointer-organized SMU [27] , realized in conventional VP's, can also be implemented for this blockwise decoding scheme of the M-step VP.
The above described method, referred to as method A , seems to have two negative features. It requires a long global wiring connection to feed the M-step path metrics r k = nM into the SMU-ACSU and it requires 0 ( M 2 ) memory between the A-ACSU and the SMU-ACSU to buffer the transition matrices. However, since the buffering can be implemented by first-in first-out (FIFO) addressed RAM's which can be implemented very efficiently, in most cases, 0 ( M 2 ) dependency of the buffering will not lead to realization problems. The long connection wire of rk=nM can be avoided by making use of the algebraic identity of the transposed matrix-multiplication, i.e., By using this identity, method A can be transformed to the architecture shown in Fig. 7 , referred to as method B , which does not require the long wiring connection for r k = n M . (Note that the matrix transposition identity used in (18) requires in addition to the axioms of the semiring that 8 is commutative.)
Rather than trying to avoid the necessity of a long wiring connection, this problem can be solved by substantially reducing the amount of wiring through the implementation of an SMU for the M-step ACSU, as if it were a VP on its own. This M-step SMU allows the unique indication of the state sequence at all time instances k = nM with a decoding delay D, the survivor depth of the M-step VP (0, of course, in this case has to be a multiple of M ). Then, instead of feeding all path metrics rk = nM = ( Y 1 . k . * * * , y N , k l T up to the SMU-ACSU, it is necessary only to transmit a pointer which indicates the decoded state sd of time instant nM. This is referred to as method C (Fig. 8) . The pointer is used to construct the input path metric for the SMU-ACSU having "zero" ( Q ) entries for all states si # sd and 1 for state si = sd. In addition, if the SMU is implemented with the trace-back method [23] , this pointer can be used to indicate the beginning state of tracing back the decoded path, as shown in Fig. 8 . Having shown how communication can be minimized for the long wiring connection, it is also important to mention that the communication wiring between the A-ACSU and SMU-ACSU can be drastically reduced. In many applications, the input to the TMU is one sample, i.e., a relatively short word, say G bits. The TMU then calculates the N 2 elements of the transition matrix from these G bits which, in most cases, amounts to a relatively large number of H bits. In these cases it is, therefore, more efficient not to buffer the transition matrices but the input symbols and then to implement an additional TMU for the SMU-ACSU (see Fig. 8 ). Even though the area consumption for the TMU is doubled, in cases with large N , M , and large H / G , this cuts down the implementation area significantly, since the size of the buffers (RAM's) is proportional to 0 ( M 2 ).
E. Implementation Aspects
Factors determining whether a single-chip implementation is possible include parameters such as the number of states N , the block length M , the number of bits, and also the technology chosen (e.g., semicustom/full-custom, CMOS/BiCMOS,
9
) to decide which of the three discussed methods is best. However, there will also be problems where M, N , etc., are so large that a complete M-step VP does not fit onto one chip. In this case, a partitioning of the M-step VP must be found such that the number of different modules is small (to cut down design (Fig. 6 ) or method C (Fig. 8) cost) and the amount of interconnections is not prohibitively large. The architectures discussed in Section 111-D were especially drawn so that a regularity can be recognized in a vertical direction. In other words, we can design a horizontal "VP slice," as shown in Fig. 9 , and stack M such VP slices vertically over each other and add the M-step ACSU to receive a complete M-step VP. Note that the VP slice shown in Fig. 9 , for method A or C, can simply be reconfigured to receive a VP slice for method B . It might be possible to implement not only one, but two or more, VP slices on one chip or, if necessary, also only part of one VP slice (e.g., 1 / 2 , ---, 1 / N ). This depends not only on the area consumption but also on the pin count, which can become extremely large for large N and for a large number of bits of the arithmetic. For large N and a trellis which is a shuffle exchange graph [25] , it can be advantageous to exploit the special topology of the trellis [26] to partition it as it has also been proposed for a register exchange implementation of the SMU of a conventional VP. Since this still results in a large communication network, we instead propose to implement the VP slice for shuffle-exchange trellises as presented in Section IV-C.
I v . SYSTOLIC ARRAY ARCHITECTURE
In the preceding section, the M-step VP was derived based on semiring algebra and its implementation was described in terms of general, high-level functions. Since the architectures presented are based on matrix operations, they can be readily described at the word-level operations involving the matrix elements, providing greater detail.
For most applications, the TMU of a VP is a transversal filter, which can have adaptive coefficients (e.g., when using the VA for equalization of intersymbol interference). Much research has been done to find optimum architectures for this type of filter, which has led to a very efficient bit-level systolic array solution [27] .
Another part of the M-step VP, the pointer-organized SMU, does not differ very much from an implementation for a conventional VP, since it is always preferably realized in a blockwise operating mode [24] . The open question now is to find an appropriate architecture for the matrix multipliers.
A. Matrix Multipliers
As is commonly known, a matrix multiplier is wellsuited for VLSI if it is implemented as a systolic array, either of a hexagonal or square type [28], [29] . For the present application, the hexagonal type has three major drawbacks. First, the chaining of more than one array to the systolic pipeline structure of the A-ACSU [ Fig. 5(a) ] cannot be realized without prohibitive detour wiring to feed in the transition matrices, due to the placement of the 1 / 0 ports. Second, the latency of a hexagonal array is three times as large as from a square array. Third, the hexagonal array requires more processing elements to be implemented [30] because the state diagram of the codes to be decoded is typically a shuffle exchange graph [25] , which leads to an N x N matrix of large bandwidth N + 1. Therefore, the suggested realization, which is shown in Fig. 10 for the case N = 2 , is a square-type systolic array. Each of the N 2 elements of the resulting matrix is the (algebraic) inner product of two vectors of the two input matrices and is accumulated locally in one of the N 2 processing elements (PE's) of the array. Thus, after completion of the computation of one matrix element, it has to be fed out, requiring the implementation of N output buses as shown in Fig. 10 . This allows the realization of the systolic pipeline structure of the A-ACSU simply by abutment of M -1 square matrix-multiplication arrays.
B. Twin-Array A-ACSU
In the same way as the upper matrix is distributed locally in the array and the output matrix is fed out with the help of buses, a second square-type array can be designed where now the matrix is fed in by buses and the output matrix is accumulated and fed out with local communication, illustrated for N = 2 in Fig. 1 1 . In this type of array, the ( i , j ) t h PE receives only the ( i , j ) t h element of the upper input matrix and computes all combinations of this element with the other input matrix that are necessary for the matrix multiplication. Of course, this type of array can also be used to implement the A-ACSU. Since both types have the same size and an equivalent communication network it therefore makes little practical difference if one or the other is chosen. Both are characterized by the fact that they not only have local communication, but also buses, and there seems to be no way to eliminate the need for buses in square-type arrays. However, the A-ACSU performs not only a single, but an M-fold matrix-multiplication, allowing elimination of the buses as follows. The first multiplier of the A-ACSU is implemented with output buses, as shown in Fig. 10 , and the second one with input buses (Fig. 11) . As can be seen in Fig. 11 , the array with input buses does not distribute the elements of the upper input matrix over the array, but each PE gets its matrix element from the bus only once during one matrix multiplication and stores it locally. So, the first array with output buses accumulates the elements of the output matrix locally while the second array processes the upper matrix locally. A close analysis even shows that the ( i , j ) t h PE of the first array accumulates and emits exactly that matrix element of the output matrix, which is then processed in the ( i , j )th PE of the second array. This can be exploited to interleave the first and second array to a twin array, which has only local communication and no global buses anymore, as illustrated in Fig. 12 . Note that this array was found ad hoc. Due to the irregularity in combining both types of square arrays to the twin array, it cannot yet be derived systematically with approaches described, e.g., in [31], [32].
C. Code-Optimized A-ACSU
As was mentioned above, the codes to be decoded almost always have a shuffle exchange structure. The number of states that can be reached by one transition is a constant number b for every state. Hence, the N x N transition matrices only have Nb nonzero elements which therefore can lead to a weak utilization of the matrix multiplication arrays.
The code given in Fig. 1 is a simple example having a shuffle exchange graph with N = 4 and b = 2, which already leads to 50% "zero" ((3) entries in A, (10). However, already the product of only two matrices has no zero entries. In general, as can easily be proven, the number of nonzero entries of the product of L matrices of a shuffle exchange code equals min(bLN, N'). Since the systolic pipeline structure of the A-ACSU computes an M-fold matrix multiplication, it is important to examine the product of a full matrix with the matrix of a shuffle exchange trellis. This is illustrated for the trellis of Fig.   1 in Fig. 13 . It can be noticed that all PE's of the multiplication array are active only every second time instant. Therefore, the PE's accumulating new matrix elements, which generally would be implemented as shown in Fig.  10 , can also be implemented with two time delays (buff- 14. By using these double-latched PE's, the calculation of the bottom and top half of the product matrix can be carried out pipeline-interleaved [33] in one array of size N x b. As can be seen by the example shown in Fig. 14, this solution is easily generalized and can be used for all trellises with a shuffle exchange graph (e.g., for all linear convolutional codes). It can also be shown that it is possible to implement a code-optimized twin array. However, a simple code-optimized array, since it does not have long buses, in most cases will be more favorable for VLSI than the more complex code-optimized twin array. Note, since we only make use of semiring algebra and the systematic structure of shuffle exchange matrices, the code-optimized arrays can also be applied for other problems based on the same assumptions, e.g., for FFT or sorting.
D. The M-Step ACSU and the SMU-ACSU
As was discussed in the previous subsections, the A-ACSU can be implemented by a systolic array. Then, of course, the matrix-vector product ( 13), which is carried out in the M-step ACSU, can also be easily implemented by a systolic array which basically comprises one row of , [38] ). This implementation is given in Table I for two cases: once for a fully connected trellis, where each state is connected to each state in the trellis by a single transition; and once for a shuffle exchange trellis. As can be seen, the complexity of the processing hardware of an M-step VP increases by a factor N over the conventional VP, whereas the complexity of wiring is divided by N . The granularity of the architecture presented in the prethe multiplication arrays of the A-ACSU, feeding its results back to itself [28]- [30] . This was shown for a conventional VD in [ l l ] , [12] . Instead of implementing one row of the matrix-multiplication array, as is done for the M-step ACSU, the matrix-vector products of the SMU-ACSU can be implemented with one column of the array. Thus, it is possible to realize all units that perform the add-compare-select operations with systolic arrays that comprise the same basic PE's.
E. Complexity Considerations
The complexity of VLSI architectures is measured by the (area) x (throughput cycle time) product, the AT measure [37]. (It is a measure of area per processing power.) Each matrix-matrix multiplication in the square systolic array requires the computation of N 2 elements by the N 2 PE's, which carry out an inner product of complexity N ( N clock cycles of each PE). Hence, the AT measure concerning the processing hardware of the square systolic arrays is O ( N 3 ) . Due to the systolic structure, the complexity of communication wiring is only O ( N 2 ) . Since ACS hardware is the dominant part of an M-step VP implementation and the M-step VP is a linear scale solution (excluding buffering), these measures also hold for the complexity of the whole M-step VP.
By the use of code-optimized arrays, the complexity of processing hardware can be reduced substantially to O ( b N 2 ) and that of writing to O ( b N ) .
The complexity measures are compared, in Table I , to the complexity of a conventional high-speed VP architecture, where one ACS processor is implemented per state cedingsection is at the word level, i.e., each processing element (PE) of the systolic array carries out word-level operations (see Fig. IO ). To increase the efficiency of the hardware, i.e., to minimize the complexity measured by the area-time (AT) product, it would be desirable to pipeline the word-level operations such that the clock cycle time is reduced significantly without an excessive increase in area. The best way of doing so and increasing the regularity of the architecture at the same time is by pipelining the word-level operations at the bit level, i.e., to pipeline each PE at the bit level. The result of this additional pipelining is shown in Fig. 15(a) for the example of w = 3 b carry-ripple arithmetic. The addition leads to a carry ripple from the LSB (least significant bit) to the MSB (most significant bit) and the maximum selection operating at bit level can only be implemented with a flag ripple from MSB to LSB. This is due to the fact that, by examining only the MSB's of two numbers, one can uniquely determine the MSB of their maximum while the LSB of their maximum cannot be obtained by examining only their LSB's. Inevitably, this leads to an amount of skewing latches that depends on the number of bits by O ( w 2 ) and to a latency varying as O ( w ) . The dependency by O ( w 2 ) can be eliminated by the use of canylook-ahead logic for the adder and flag-look-ahead logic for the maximum selector, as indicated in Fig. 15(b) . This, however, requires the implementation of additional logic of complexity O ( w log w ) and has latency O ( log w). Since the skewing latches can only be avoided by eliminating their cause, either the max-ripple or the addripple has to be deleted. Because the maximum selection is a nonlinear operation, it is certainly difficult to find a solution without ripple. However, this is not the case for the addition. 
A. Carry-Save Arithmetic
The addition of two binary numbers, with the help of 1-bit carry-ripple full adders, is shown in Fig. 16(a) , which leads to the sum S s = c s,2'. I Instead of leading each carry bit c, to the next full adder, it also can be combined with the sum bit s, to a new digit U , : = s, + c,, as shown in Fig. 16(b) . Then, the sum S equals S = (s, + c,)2' = v,2' where U , E ( 0 , 1, 2 ) .
The advantage of this carry-save (CS) addition is that it eliminates the carry ripple, although it leads to a redundant number representation due to the fact that the binary number has ternary weights U , . Hence, the CS addition is a solution to eliminating the skewing latches in the PE if it is possible to derive a CS maximum selector. This certainly is no trivial task, since one given value can be represented by many different CS numbers, because of the ternary weights (digits) U , .
B. The Carry-Save Maximum Selection
The problem now to be examined is to find the maxi- In the following, the pair of numbers a, 6, will be written as ( a , b ) , and the maximum and minimum possible value of A , B as A,,,, A,,, and B,,,, B ,,,, respectively. As can easily be seen, a bit-level CS-maximum selection can only operate with a flag ripple from the MSB to LSB. Therefore, we have to begin the examination with MSB and start with Case 1. This shows that, for la,-l -b , -I I = 2, the maximization can be made despite the redundancy of CS numbers. = b , -I : Since this case, that a,, -I = 6 , -is trivial, the only case left to be examined is the following.
Case3 ( I , 0) ,,-,: which is equivalent to (2, l ) , -l . In this case, 
Here, no final decision can be made on this bit level; we will refer to it as a "predecision for A." As will be shown below, even though B might be the maximum, no error of the value of the output G is made by giving out g,-I : = a,, -I . This is important, since it allows the implementation of a CS maximizer with one flag ripple from MSB to LSB and no feedback from the lower to higher bit-levels.
For further examination, assume the predecision for A ( 1, 0) -I has been made on bit-level w -1 and is slgnaled to bit-level w -2 , where we find (0, 0 ) , -2. However, given the following: As can be seen, this is exactly analogous to the case (1, 0 ) " -I shifted by one bit-level. Therefore, the same examination has to be camed out at bit-level w -3, etc.
The last case which remains to be examined is the following.
which is equivalent to ( 2 , l ) w , -l (0, 2 ) u , -2 . In both cases, we obtain that the weighted sum of the examined digits of A and B are equal, i.e., au,-12w-' = b,,-12'Y-' + b,,-22W-2 .
Thus, no final decision can be made. Because of the equality of the two numbers, A and B , at the two most significant bit levels, the predecision for A is removed and the decision procedure starts anew on bit-level w -3. This procedure is repeated until a decision at a lower bitlevel is possible.
Note, as can be seen by this analysis, a predecision for A at a lower bit-level can only be made to a final decision (Case 3.1), be sent downstream (Case 3.2) , or be removed (Case 3.3) . It never occurs that a predecision for A is directly converted to a predecision or even a decision for B. If the predecision is removed on bit-level j , this equals
Therefore, if a predecision for A occurs, no error has been made if g l : = ai was given out at higher bit-levels ( i 1 j ), which proves that a bit-level maximum selection can be realized for CS numbers, referred to as CSM. Note that it may even happen that, during one CS-maximum selection, one or more predecisions for A will alternate with predecisions for B. The decision finding can be visualized in the state transition diagram given in Fig. 17 , which shows the five existing states (cases) that can be transmitted from a higher to lower bit-level and the state transition that can take place on the lower bit-level.
We were able to carry out the optimized design of a CSM with only 30 gates, which shows that its implementation requires very little hardware. As can easily be proved, a CSM can also be realized for 2's complement CS numbers. Therefore, next to the minimum selection, the absolute value of a number 1 A I = max ( -A , A ) can also be implemented with CS arithmetic, which might be of interest for a wide field of applications [34].
C. The Carry-Save Processing Element
Due to the fact that CS arithmetic can be applied for the ACS operations "add" and "max," the PE of the array can be realized as a systolic array as given in Fig.  18 .2 As is shown, it comprises almost identical n bit slices (ACS-BS's) and thus leads to a high regularity of the ACS hardware of an M-step VP. Since the output of the PE is skewed exactly as is needed as input by the next PE, no skewing latches have to be implemented. The inputs A, of the PE, which are the elements of the transition matrices, are computed in the TMU. As has been stated 'Note that CS arithmetic, applied to conventional VP's, also leads to a new interesting architecture, which can be pipelined on the bit level [34] . above, the TMU can easily be built as an (adaptive) transversal filter, which is optimally implemented as a bit-level systolic array with CS arithmetic [27] . We, therefore, assume that the input elements of the A-ACSU A, are given as CS numbers and are skewed appropriately.
D. Bit-Level Normalization Concept
The continuous accumulation of transition metrics as path metrics leads to the problem of normalizing the rk = After showing the possibility of implementing a bit-level systolic PE, the question still remains as to how normalization in the bit-level systolic M-step ACSU can be realized. Since the pipelining which was applied to obtain the systolic PE assumes only bit-level communication, normalization also has to be implemented with bitlevel communication to allow the use of a pipelined PE in the M-step ACSU. It is known [35] that the maximum difference between any two path metrics Aymax is upper bounded for conventional VP's by
where K is the constraint length of the trellis (code).
The idea is to implement a normalization unit limited to the MSB level that operates as follows. If the carrysave MSB (CS-MSB) equals 2 for any path metric A then, by the argument below, all other path metrics have a CS-MSB which at least equals 1 . Thus, we can normalize by subtracting 1 from all CS-MSB's until, at a later time instant, by further accumulation, a CS-MSB takes on the value 2 and the normalization has to be repeated again. Now, we only have to ensure that the smallest path metric ( Emin) with CS-MSB equal to 2 that can initialize the normalization is at least Aymax + 1 greater than the maximum metric (Fmax) with CS-MSB equal to 0, so that Emin and Fmax can never occur simultaneously. where K is the constraint length of the code. Since the M-step trellis, in most cases, is fully connected (i.e., each state is connected to all other states by one M-step transition), we can generally assume K = 2 for the M-step trellis. This yields that only h = 3 additional bit levels are needed in the M-step ACSU compared to the M-step transition metrics computed in the A-ACSU.
2) Normalization in the A-ACSU: Now we want to address the question of if the intermediate results during the M-fold matrix multiplication in the A-ACSU can be normalized. This would keep the number of bit levels to a constant amount and would not lead to an increase of bits by logz M, due to the accumulation of M numbers.
By using the same arguments that hold for the derivation of (19) [35] , we can also find the maximum difference between any two elements of any M-fold product of transition matrices [by analysis of the set of N-rooted onestep trellises, see Fig. 41 . The only difference is that K -1 has to be replaced by 2 ( K -1 ) where, in this case, K is the constraint length of the original trellis. Hence, we can also implement a normalization in the pipeline [ Fig. 5(a) ] and tree structure [ Fig. 5(b) ] of matrix multipliers, leading to the result that the number of bit levels required is not a function of M but is a fixed number which is a function of the constraint length K of the original trellis. Assuming that the maximum value of one element of the transition matrix Ak has G bits, the number of additional H bit-levels required in the A-ACSU is given by
With the help of the bit-level systolic PE presented here, it is now possible to realize the complete M-step VP as a bit-level systolic array. Next to the advantage of easy pipelining, the use of CS arithmetic also has the big advantage of an easy-to-implement normalization of the path metrics of the M-step ACSU and A-ACSU.
E. Complexity Considerations
To examine the real advantages of the CS arithmetic presented here, we need to compare the complexity of the different alternatives by computing their AT measure. The three possibilities are carry-ripple, carry-look-ahead, and carry-save arithmetic. In all cases, we assume pipelining such that the clock frequency is independent of the wordlength w , i.e., O ( 1).
The area consumption of the carry-ripple realization [ Fig. 15(a) ] is O ( w 2 ) , due to buffering (skewing triangles), thus, the AT measure is angles between concatenated PE's, its complexity is Since the CS arithmetic does not require skewing tri-(AT)cany-save = O(w) (25) and latency 0 ( 1 ).
Even if we take into account that one CS adder and maximum selector needs about twice the logic of a conventional binary implementation, this shows that, for a moderate word-length w , CS arithmetic is the most efficient choice. When implementing a PE with CS arithmetic, we further have the advantage that it is highly regular, even at the bit level.
VI. IMPLEMENTATION RESULTS
As was shown in Section V, each PE can be implemented by a bit-level systolic array. Since all ACS bitslices (ACS-BS's) of the PE are almost equal (see Fig.  18 ), only a few ACS-BS's have to be designed which then, simply by abutment, form a PE. Now the question that arises is how to implement the systolic PE in the twodimensional matrix-multiplication array, which is embedded in the one-dimensional systolic pipeline structure [ Fig. 5(a) ]. The PE can either be implemented by horizontal (shown in Fig. 18 ) or vertical abutment of its ACS-BS's (90° rotation of the PE of Fig. 18 ). Since the rectangular shape of the A-ACSU is given by N x MN PE's, we chose to implement the PE vertically which, for a square-shaped ACS-BS, leads to a more square-shaped A-ACSU array of w N X MN ACS-BS's (w: number of bit levels). Furthermore, this solution requires only a horizontal wiring channel for the input matrix, while the additional vertical output can be implemented by over-thecell wiring of the ACS-BS's. First design results of a section of the array are shown in Fig. 19 . This is a study for a code-optimized VP slice, which we are currently designing for an N = 4 state convolutional code.
The complete design of the chip we carried out with 2p CMOS standard cells (ES2) that were placed and routed with the flattened netlist (CADENCE). As an architecture, we chose method C (Fig. 8) with code-optimized arrays and placed four VP slices, plus the M-step ACSU and M-step SMU, on a 120 mm2 chip, see Fig. 20 . Due to the bit-level systolic array concept with bit-local communication, the channels of the standard cell block (8000 cells, 100 mm2) consume only the same area as the standard cells. The buffering is implemented with 15 RAM'S, which consume less than 10% of the total area. The chip is layed out for M = 16, which is achieved by cascading four chips. We implemented the PE's with CS arithmetic on every second bit level, as proposed in 1341. Furthermore. we incorporated the algorithmic level modifications proposed in 1211, which allows us to further cascade M-step VP's (four chips) for very high data rates. With a clock frequency of 50 MHz (simulation), each chip achieves a decoding rate of SO MHz (four VP slices on one chip. but N = 4 clock cycles per matrix multiplication ). Thus. e.g., a 1 Gb/s (GHz) decoding rate could be achieved by cascading 20 chips. Assuming 1 p CMOS technology, dynamic latches. and a full custom design of the CSM, this 1 Gb/s decoder could be implemented on a single 200 mm2 chip, which shows that the concepts shown are realizable today.
VII. CONCLUSIONS By identifying that the two operations of the ACS recursion form an algebraic structure called semiring, we were able to write the Viterbi algorithm as a linear vector recursion. Thus, all kinds of different architectures, known for linear recurrences. can be applied to the VA. Since the high-speed VP presented here basically processes matrices, it can be implemented by systolic array structures known for the matrix-multiplication problem. Furthermore, by proving that CS arithmetic can be applied, three levels of hierarchy are obtained. A bit-level systolic array is embedded in a word-level systolic array, which is one element of a systolic pipeline structure (or of other architectures known for linear recurrences). Since the conventional two ring-operations of multiplication and addition are replaced by the semiring operations "addition" and "maximum selection," the conventional problem of linear bit accumulation with M in an M-fold concatenation of matrix multipliers is replaced by a fixed upper bound which is independent of M. Therefore. the presented architecture, characterized by fine granularity, high regularity. and local communication, comes very close to meeting all four VLSI constraints and is wellsuited for VLSI implementation. It can be linearly scaled (by cascading M -1 arrays) to achieve very high throughput rates through the implementation of parallel hardware. Neither the communication network nor the amount of active processing hardware break the linearity of the linear scale solution. Only the RAM buffers grow by square with the throughput rate. These combined features, specifically the linear scale solution, cannot be achieved by the conventional implementations of VP's summarized in [36] . The complexity of the processing hardware is by a factor N larger than that of conventional VP's, but the communication network (wiring) is smaller by N .
Since the presented M-step solution for the Viterbi algorithm is based only on semiring algebra, it can be generally applied to achieve high-speed parallel processing architectures for other applications. which can also be described by semiring algebra [ 101.
APPENDIX
The operation 8 (arithmetic addition) forms a group, since the inverse element exists (arithmetic subtraction) which, therefore, leads to the result that the calculations of fractions can be carried out in ( R , 8, 0 ) . The algebraic division, thus, is equivalent to the arithmetic subtraction. Given an N = 2 state ACS recursion w h e r e 6 1 , k : = h l , , L / h l z , k . 6 2 . r : : = h 7 1 .~/ X 1 2 . b , a n d 6 3 . b : = h 2 z . k / h 1 2 , h .
Thus, the VA can be algebraically trans-formed to a recursion of differences (A3) (for N > 2 to a set of recursion differences ), resulting in less operations per recursion step and an inherent normalization. This was shown for N = 2 states by conventional calculus [39J, [40] . With the help of algebraic calculus, it can now easily be examined for N > 2 states, which might lead to some interesting results, however, in most cases, probably will not lead to such substantial simplifications as for N = 2.
