Abstract-The problem of survivor memory management of a Viterbi detector is classically solved either by a register-exchange implementation which has minimal latency, but large hardware complexity and power consumption, or by a trace-back scheme with small power consumption, but larger latency. Here an algebraic formulation of the survivor memory management is introduced which provides a framework for the derivation of new algorithmic and architectural solutions. This allows for solutions to be designed with greatly reduced latency andor complexity, as well as for achieving a tradeoff between latency and complexity. VLSI case studies of specific new solutions have shown that at minimal latency more than 50% savings are possible in hardware complexity as well as power consumption.
I. INTRODUCTION

D
YNAMIC PROGRAMMING is a well-established approach for a large variety of problems concerning multistage decision processes [l] . One spccific application of dynamic programming is the search for the best path through a graph of weighted branches.
These branch weights in the following will be referred to as branch metrics. The path through the graph which is to be found is that onc with the maximum (or minimum) cost, Le., the maximum value of accumulated branch rnetrics. An example of such a graph is the trellis (the state transition diagram) of a discrete-time finite state machine.
The state sequcncc of the finite state machine marks a path through the trellis. If this path is to he estimated with the help of noisy measurements of the output of the finite state machine, and if this is solved by dynamic programming, then in communications this is called the "Viterbi algorithm" (VA) [2] . The VA was introduced in 1967 as a method to decode convolutional codes [3] . In the meantime the VA has found widespread applications in communications as. e.g., in digital transmission, magnetic recording and speech recognition. A comprehensive tUtorbd1 on the VA is given in [4] .
The VA can be divided into three functional units, the branch mctric unit (BMU), the add-compare-select unit (ACSU), and the survivor memory unit (SMU). Whereas the BMU and ACSU perform arithmetic operations as addition, multiplication, and maximudminimum selection. the SMU has to trace the course of a path with the help of decision pointers that were generated in the ACSU. Two basic methods for implementing the SMU are known, the register-exchange and trace-back SMU, of which the first has minimal latency but large hardware complexity, and the latter has a smaller hardware complexity but longer latency. The focus of this paper is on providing a novel algebraic framework for describing the survivor memory management problem. This enables the easy design of new SMU architectures, tailored to the desired latency/complexity optimization goal. Following, a brief introduction in the VA is given in Section 11. Section 111 describes the survivor memory problem. and furthermorc its algebraic formulation is introduced [5] . Based on this, the following two sections outline architectural alternatives, Le., continuous-flow processing in Section IV, and block processing in Section V.
THE VITERBI ALGORITHM
Assume a discrete-time finite state machine with N states. Without loss of generality we assume that the transition diagram and the transition rate 1/T are constant in time. The trellis, which shows the transition dynamics, is a twodimensional graph which is described in vertical direction by M states and in horizontal direction by timc instants kT (T = 1). The statcs of time instant k are connected with those of time k + 1 by the branches of time interval ( k , k + I).
Below we refer to a specific state z at time instant k as "node" The best path through the trellis is calculated recursively by the VA, where best can mean, e.g., thc "most likeliest". This is done recursivcly by computing N paths, i.e., the optimum path to each of the N nodes of time k . The N new optimum paths of time k+ 1 are calculated with the help of the old paths and the branch metrics of timc step ( k , k + 1). This shall be explained for thc simple trellis shown in Fig. 1 (a) . As indicated in Fig. l(b) and the path metric of node . s z ,~+ l is computed in analogy. This is referred to as the add-compare-select (ACS) recursion of the VA.
The problem which needs to be solved is to dctcrmine the best (unique) path with the help of the decisions of the ACSrecursion. If all N paths are traced back in time then they merge into a uniquc path, and this is exactly the best one which is to be found. The number of time steps that have to be traced back for the paths to have merged with high probability is called the survivor depth D. Therefore, in a practical implementation of the VA the latency of decoding is at least l? lime steps.
An implementation of the VA, referred to as Viterbi detector (VD), can be divided into three basic units, as shown in Fig. 2 . The input data is used in the branch metric unit (BMU) to calculate the set of branch mctrics for each new time step. These are then fed to the add-compare-select unit (ACSU) which accumulates the branch metrics recursively as path metrics according to the ACS-recursion. The survivor memory unit (SMC) processes the decisions which are being made in the ACSU, and outputs the estimated path with a latency of at least D.
The problcm solved by the SMU can therefore be stated as: find the state of time k ~ L). This is classically solved either by a register-exchange implementation which has minimal latency. but large hardware complexity and power consumption. or by a tracc-back scheme with small power consumption, but larger latency. Here an algebraic formulation of the survivor memory management is introduced which provides a framework for the dcrivation of new algorithmic and architectural solution$. VISI case studies of specific new solutions have shown that more than 50% savings are possible in hardware complexity as well as power consumption.
THE SUKVIVUK MEMUKY UNIT
The unit of the VU which is of concern in this paper is the SMU. Generally. two basic methods have been proposed for solving the problem of processing the decisions made in the ACSU to reccivc thc dctcctcd path: thc rcgistcr-cxchange (RE) and the trace-back (TB) SMU [ h ] .
In case of an RE-SMU the new decisions of each iteration k are used to compute and store all N paths recursively, one to every state. Then the state of time k -U is simply determined by reading out the state of time A: ~ l ? of one of the paths.
In case of a TB-SMU the decisions are stored in a RAM, and thcn one path is traced back recursively D steps by using the stored decisions to determine the state of time k ~ U . At a first glance this might seem not to be well suited for VLSI, since at each time step one new decision is written to the RAM and D decisions are read during the trace-back, making this a bottleneck for the iteration speed of the VD. However, by block-wise tracing back more than i 7 steps at a time, a block of more than one state is determined per trace-back. Combining this with multiple trace-back pointers operating on multiple RAM'S in parallel has allowed for the derivation of many efficient hardware solutions [7]-[9] .
A. The Trace-Back SMU
A more detailed description of the trace-back scheme is as follows. At time k the current decision of state i points to its preceding state, for which we will use the notation 
the state of time k ~ D is determined.
As can be seen by the nature of this decision trace-back, the usual way of implementation is by using multiplexers to pick the next decision pointer in the scheme. However, this trace-back can also be formulated algebraically by introducing another notation for 6 k ( i ) .
For 6 Furthermore, due to the simplicity of the matrix operations it is clear that this can also be done by simple gate logic.
The most important aspect of ( 5 ) is that thc multiplication operation is associative. Therefore it can be carried out not only from left to right, but also in an arbitrary order as, e g . , in a faster tree-like manner. In the following we shall now make use of this algebraic feature. 
Iv. PIPELINE INTERLEAVING LOOK-AHEAD ARCHIT~CTURES
Since the trace-back decoding of the dccisions principally has to take place at every new time instant, it is clear that the multiplication given in ( 5 ) is to be viewed upon as a sliding window operation over the sequence { A h } . Hcnce. at time k:+l has to bc evaluated, and so on. It is to be noticed that, due to the fact that the associative law holds, the ( D -1)-fold matrix-matrix multiplication of ( 5 )
( 7) can be carried out first, and then the row of interest can be picked by applying b.
The continuous "sliding window" computation of the expression (7) is analogous to the type of operation which is referred to as "pipeline interleaving look-ahead computation" for the parallelization of linear feedback loops' [ 101, [ I I] .
Hence, all pipeline interleaving architecturcs known for lookahead computation can be applied for the continuous (sliding) evaluation of (7).
A. The Register-Exxchunge SMU
For notational ease the short-hand notation shall be introduced. The architecture known as "linear lookahead" [lo] for the sliding-window evaluation of (7) is shown in Fig. 4 . In this case the current A, is multiplied with D stored values in parallel, to obtain the following D results
As can be seen, the first element, A,, indicates the prcceding states of the N current paths. The next elemcnt, 1, determines the state of two time steps back of every current path. By carrying this on, it can be seen that (8) yields exactly the state sequence of all N current paths of time k ovcr the whole interval ( k -D + 1: k ) . Thus, it can easily be seen that the linear look-ahead architccture of Fig.  4 is the algebraic formulation of thc RE-SMU.
the carry computatlon of a binary adder, for which different algorithm are 'Onc other very important application of such a U-told mdtlplication IS known as e.6. carry-ripple, cmy-skip, cq-select, and carry-look-ahead [IO] . Now these architectures can all be transferred to derlve SMU realizations. Since these multipliers operate sequentially on one single decision matrix at a time, their complexity is exactly that of one stage of a conventional RE-SMU. (12)
The contents of the vector in expression (12) is exactly what is computed by a register-exchange SMU of length hil, see Section IV-A. Hence, combinations of register-exchange and trace-back promise lo yield further solutions of interest.
In addition note that the algebraic formulation can also lead to simplified software implementations.
For example, for an N = 2 state problem it can casily bc seen that the logarithmic look-ahead RE-SMU of Fig. 5 can be much more efficient to implement than any other solution.
VI. DISCUSSION
Due to the variety of possible different technologies that may be used for implementing the architectures discussed in this paper, it is difficult to find an objective measure to compare them. To allow for some objective comparisons to be made, the total amount of memory must be divided into memory which can be realized by RAM and memory that must be realized by registers. In addition, the multiplications can be divided into vector-matrix and matrix-matrix multiplicatlons, where the latter IS N times as complex as the former since it comprises N vector-matrix multiplications.
A basic measure of power consumption is the number 01. vector-matrix multiplication and the number of read and write (W) operations that are necessary. Therefore, for power consumption comparisons, the number of R / M 7 operations must be added as a measure.
Using these more detailed measures, the solutions which are compared in Table I are It can be seen that the algebraic formulation 01' the SMU problem allowed for an easy design of new architectures which are sample points in the large space of solutions with differing latency. memory complexity, and arithmetic complexity. The algebraic formulation enables solutions to be designed with greatly reduced latency and/or complexity, as well as it allows for achieving a tradeoff between latency, hardware complexity, and power consumption.
VII. CONCLUSION
In this paper an algebraic formulation of the survivor memory management of Viterbi detectors is introduced. This reveals the fact that the problem of survivor memory implementation is analogous to the realization of look-ahead in parallelized linear feedback loops. Hence, next to finding new solutions, a wide range of known solutions can be transferred and adapted from chis well-known problem. They mainly present novel approaches for survivor memory realization. VLSI case studies of novel algorithms and architectures have shown that 50% savings in hardware and/or latency can be achieved.
The algebraic formulation introduced here is related to the algebraic formulation of the add-compare-select recursion of the Viterbi detector, introduced in [14], [15]. Hence. it now is easy to derive well matched survivor memory realizations also for all parallelized Viterbi detectors.
