Abstract-The Viterbi algorithm (VA) is characterized by a graph, called a trellis, which defines the transitions between states. To define an area efficient architecture for the VA is equivalent to obtaining an efficient mapping of the trellis. In this paper, we present a methodology that permits the efficient hardware mapping of the VA onto a processor network of arbitrary size. This formal model is employed for the partitioning of the computations among an arbitrary number of processors in such a way that the data are recirculated, optimizing the use of the PE's and the communications. Therefore, the algorithm is mapped onto a column of processing elements and an optimal design solution is obtained for a particular set of area and/or speed constraints. Furthermore, the management of the surviving path memory for its mapping and distribution among the processors was studied. As a result, we obtain a regular and modular design appropriate for its VLSI implementation in which the only necessary communications between processors are the data recirculations between stages.
I. INTRODUCTION
T HE VITERBI algorithm (VA) is known to be an efficient method for the realization of maximum-likelihood (ML) decoding of convolutional codes [1] , [2] . It is based on the study of a weighted graph (trellis) which is used for attempting the reconstruction of the input sequence to the convolutional encoder based on the coded sequence received from a noisy channel. Consequently, the objective is to find the best path (the ML sequence) through a trellis in a dynamic manner through the study of the output sequence of a convolutional encoder received from a noisy channel. Other applications of the VA are related to communications (trellis-coded modulation, TCM) [3] and image compression trellis-coded quantization (TCQ) [4] . In order to establish a notation and briefly introduce the operation of the decoder, it is convenient to start by describing the encoder. A convolutional encoder consists of a shift register with stages and binary function generators. We denote as state the content of the least significant bits of the shift register. The data input, usually binary, is shifted along the shift register by bits each instant of time. From the state diagram it is possible to obtain the trellis. It is a twodimensional (2-D) representation in which we represent the states in the vertical direction and the temporal transitions are marked in the horizontal direction. The states at an instant of time are connected to those at instant by branches that specify possible state transitions. In the case of a convolutional code, the transition scheme is repeated in time. Consequently, it is enough if we specify the possible transitions of the trellis that occur between instants and
In Fig. 1 (a), we depict a trellis diagram of four states and in Fig. 1(b) we display the same trellis but rearrange the states so that the butterfly structure of any trellis of a convolutional code is evident. From now on, we will represent the state at instant as although the temporal subscript will be eliminated whenever its specification is not necessary. Each transition has a branch metric (BM) associated to it that is a measure of unlikelihood of the transition.
The ML path through the trellis is calculated recursively by the VA. This is done by computing the optimum path to each of the nodes of time with the help of the old paths of time and the BM's of the transitions. The paths are represented by a path metric (PM) so that the path metric corresponding to state at instant can be obtained from its possible preceding states (1) where is the BM associated with the transition from state to state The central unit of a Viterbi decoder is this data-dependent feedback loop which performs an add-compare-select (ACS) operation. This nonlinear recursion is the only bottleneck for a high-speed parallel implementation.
There are several alternatives for the implementation of the VA in a VLSI architecture: the state-serial [5] strategy in which one (or a few) processor(s) is used for the computation of the whole trellis, parallel processing [6] , [7] in which an ACS is assigned to each one of the states of the trellis and intermediate solutions [8] - [11] in which there are more states than ACS's and several states share the same ACS.
Nonetheless, the error probability of convolutional codes decreases exponentially with the constraint length [10] and thus longer convolutional codes are of great interest for highperformance systems. This implies an increase in the number of states in the trellis making a fully parallel scheme impossible. On the other hand, the impossibility of implementing state-serial schemes in real systems with many states due to their slowness is clear. For this reason a need arises for the use of intermediate solutions in which the number of ACS's can be preset according to the speed requirements and the availability of area, so that each ACS is shared by more than one state of the trellis. However, the previous methodologies [10] , [11] of trellis mapping onto a ACS network are based on matrix permutation techniques where the working matrix is heuristic.
In this paper, we propose an area-efficient architecture that trades speed for area allowing the processing of states in an arbitrary number of ACS's For this, we have made use of the similarity of trellis diagrams for convolutional codes and the computational flow diagram of the fast Fourier transform (FFT) and have developed a mapping method where it is easier to find an optimal design solution for particular area or speed constraints. We present the mathematical model that permits representing the flow of data between states of the trellis as well as its mapping onto a processor column. This mapping determines the structure of each processor and the interconnections required for the computation of the successive stages of a trellis with an arbitrary number of states. Variations of this model have been applied to the FFT [13] , [14] and tridiagonal systems [15] . In addition, this same methodology is employed for the mapping of the memory scheme used for storing the survivor path of the processor network. The result is an architecture that is highly regular, made up of a column of processors, and adequate for its implementation in an application specific VLSI architecture.
In the next section, we describe the area-efficient architecture model we have employed for the mapping of the VA. In Section III, we introduce the methodology we propose for the mapping of the trellis onto a processor network of arbitrary size. After this, in Section IV, we describe the scheme employed for the management of the survivor path. The structure of the processing elements of the architecture is presented in Section V. Then, in Section VI, we present a generalization of the scheduling method to a generic convolutional trellis. In Section VII, we evaluate the resulting architecture. Finally, in Section VIII we present the main conclusions of the work.
II. AREA EFFICIENT ARCHITECTURE MODEL
The computations associated with a trellis involves three steps: 1) branch metric computation, 2) PM updating and storage, and 3) path storage and output sequence selection. In this work, we consider the BM computation as the Hamming distance from the values received from the noisy channel to the output that would be produced in a noiseless channel [16] . Obviously, another unlikelihood measure could be implemented. Furthermore, the architecture is absolutely valid for the other Viterbi applications by just replacing the BM generator.
In the following, we develop a method for the design and partitioning of the trellis corresponding to convolutional codes in an area efficient architecture that trades speed for area. One feature of our scheme is that each ACS is shared by a fixed subset of the states. Therefore, given a number of states , the algorithm is mapped onto a column of ACS's and an optimal design solution is obtained for a particular set of area or speed constraints.
The general structure of the architecture model we are going to use is presented in Fig. 2 . We assume that the number of incoming edges to each state and the number of outgoing edges from that state is , i.e., that the convolutional encoder only receives one input each instant of time. We will later show it is easy to generalize to other convolutional codes. For the two paths entering a state, the accumulated PM is calculated by adding the BM associated with the state transition and the PM of the preceding state. The two sums are compared, the smaller of the two is stored as the new PM of the state, and the decision is output.
Each processing element (PE) of the architecture is made up of two ACS's which compute the two states corresponding to the same butterfly and thus we will consider a total of processors. The computation of the states corresponding to a butterfly requires the use of the PM's of its two possible previous states. For example, in Fig. 1(b) it can be observed that for computing and , we need and In order to do this, there is a local routing in each one of the processors and a global routing between processors, thanks to which the data is recirculated and presented to each processor in the correct order and instant of time for processing the data of the next state of the trellis.
A. Memories for Storing the Survivor Path
The final objective of the VA is to find the survivor path, i.e., the path through the trellis that matches the received sequence as measured by a calculated path metrics. In order to do this, we are going to update the content of a memory each instant of time and this is going to allow us to reconstruct the survivor path.
Each processing cycle, in addition to computing the value of the two PM's associated with the two states of the butterfly, each PE offers the decision bits that permit reconstructing the state sequence that has occurred. Decision bit of state from time to permits the estimation of the previous state given the current state according to the following update: (2) which corresponds to a right shift of the current state introducing the value of the decision bit in the vacant position. We call decision vector the set of the decision bits, one per state, that are computed each stage of the trellis. The storage of the decision vectors permits the later reconstruction of the paths associated with each state as we have indicated in (2) . This path reconstruction operation based on going through the state sequence backward is called trace-back [17] , [18] .
The decision vectors are stored in a decision memory that is partitioned into two regions, one for reading and one for writing (Fig. 3) . In each transition between stages, we store the decision vector for the reconstruction of the transition produced for each one of the states. If the states are traced back in time, they converge to a single path (survivor path) [17] . The number of time steps that have to be traced back for the paths to have merged with high probability is called the survivor depth, For this reason, traditionally, we can consider that the reading part of the memory is divided into a merge block of length and a decode block of length If a trace back operation is carried out over the merge block taking as initial state any of the states, we obtain the starting state for reading the survivor path (decode block). This can be clearly seen in Fig. 3 , where the paths associated with the states of the trellis (merge block) converge to a single path (survivor path) which can be read in the decode block and presented as the output of the Viterbi decoder. Therefore, the decoding latency of the VA is at least time steps. In Table I , we indicate a possible decision vector sequence for a four-state trellis (4-b decision vectors). In order to illustrate how the reconstruction of the survivor path is carried out, let us assume a system in which the reading of the survivor With the traditional trace-back method, the reconstruction of the survivor path implies performing a stage trace back in the merge block for estimating the starting state of the survivor path and carry out its decoding (trace back operation) in the decode block. This technique may be improved by an a priori estimation of the starting state using a procedure called trace-forward [17] , where trace-back of the merge block is avoided. In this case, a level trace-back is not necessary as the starting state is estimated during the writing of the decision vectors and thus the merge block can be eliminated from the reading memory area. The basis of the trace-forward is the dynamic estimation of the predecessor of each state and thus at instant we insure that the predecessors of all the states associated with instant coincide with the starting state. Consequently, the size of the memory has to be at least 
III. TRELLIS PARTITIONING AND SCHEDULING
In this section, we discuss how to partition the trellis states into ACS's and schedule the states within each ACS, where is chosen depending upon the amount of area saving desired. The objective is to find a mapping method that permits finding the correct data flow through a set of ACS's and, consequently, the local and global interconnection networks between ACS's for arbitrary values of and We thus present the reordering/permutation of the data generated in each stage so that the indexing is constant throughout all the stages of the algorithm, producing a regular structure and a simple control of the hardware. This processing method permits a PE column to perform parallel computations on the data and, using a fixed recirculation network, make them recirculate over it, thus optimizing both the use of the PE's and the communications. A similar strategy has been used in FFT and tridiagonal systems [13] - [15] .
Each PE is made up of two ACS's, a local memory for storing the decision vectors, and the modules for the calculation of the BM's and the trace-forward. It receives as inputs the of its preceding states an provides as outputs the These data must be recirculated in order to be used in the next stage in the PE that requires them.
Therefore, within each PE we can differentiate two sections: processing section and routing section. In addition, there is a global routing among PE's. While the computation of the butterflies (1) is performed in the PE's, the routing circuit will recirculate the calculated elements and deliver them to the PE's as data, in an appropriate order for them to compute the butterflies corresponding to a new stage of the trellis. As an example, in Fig. 4(a) we present the recirculation needed for an eight-state trellis. The data must be rearranged in such a way that their true final position corresponds to a perfect shuffle over the index of the data item. The hardware mapping of this data rearrangement will have to be carried out in an efficient manner for the performance of each processor to be optimal.
In order to facilitate the representation of the routing by means of mathematical operators and its later mapping onto a hardware structure, we will denote each state by means of the three-dimensional (3-D) index , where As we will later see, the differentiation into three fields will permit the association of one of them to the indexing of each processor, another to the paths and the other to the processing cycles. The perfect shuffle operation (operator ), which reflects the global data flow, is defined as (3) corresponds to a circular left shift of bits, with , the number of shifts that are produced in the convolutional encoder each cycle.
We define the inverse decimation and inverse concatenation operators as (4) where operator introduces the bits of field in field , and operator introduces the most significant bits from field in field
The perfect shuffle operation can be rewritten as a combination of three operators that are executed sequentially from left to right, where is the perfect shuffle or left shift operator applied to fields Consequently, the sequential application of these three operators to index of each state leads to the perfect shuffle of this index:
The 3-D index of each state is interpreted as the triad of indexes that index the PE, the computation cycle in which it is processed, and the input-output path of the PE [13] . That is, the set of indexes can be interpreted as [PE, CYCLE, PATH], so that reading the index of each state we will be able to read the computation cycle in which it is going to be computed, the PE where the computation is going to take place, and the path for inputting and exiting the PE.
Therefore, the decomposition of the perfect shuffle into three operators can be interpreted as follows: with the operators , we specify the global routing between PE's and between which paths it takes place (each PE has two output paths and two input paths corresponding to the calculation of a state butterfly). With operators and , we specify the internal data flux in each PE. In this case, we operate over indexes CYCLE and PATH.
In Fig. 4(b) , we indicate the result of the execution of these three operators over an eight-state trellis indicating by means of arrows the flow of each data item. We have considered the mapping onto two PE's, and thus two computation cycles will be required, in each one a total of two butterflies will be computed. It will thus be necessary to associate 1 b to the PE index in order to specify the two PE's, 1 b to the PATH, as we assign the computation of one butterfly to each PE each instant of time and 1 b to CYCLE, as we will need two cycles of time in order to sequentially compute two pairs of butterflies in the PE network. The flow of data item [0, 0, 1] has been specified as well as the result of applying the operators to it.
In Fig. 5(a) , we indicate the internal data flow of each PE and the interconnection between PE's for the example of Fig. 4 . and indicate the internal interconnections of each PE, in particular, with operator a 2-D data array is transformed into a one-dimensional (1-D) array by eliminating the PATH field whereas with operator the opposite happens. Thus, e.g., , indicates that the state that is produced in PE 0 in cycle 0 and through path 1, will occupy position 01 in the 1-D array. On the other hand, when applying operator to the previous result indicates that the value in position 01 of the 2-D array must move to path 0 of the PE in order to be gathered in position 1. On the other hand, operator affects the external interconnections between PE's. Thus, e.g., indicates that output path 0 of PE 0 is connected to input path 0 of the same PE.
The implementation of operators and , which make up the routing section of the PE, can be carried out through first-in first-out (FIFO) queues. Their structure is shown in Fig. 5(b) . It basically consists of an FIFO queue with cells, which is the number of states that must be processed in the PE. It must have two inputs in the first two cells. In the first one of them (cell 0), we introduce PATH 0 from the processing section [ Fig. 5(a) ], and in the second one (cell 1) PATH 1.
After the computation and storage of the PM's associated with each butterfly, a two position right shift of the FIFO queue must be carried out, freeing the first two cells for the storage of the data associated with the computation of the next butterfly.
Once all the butterflies of a stage have been computed, and consequently once the queue is full, the outputs are gathered in With this organization of the FIFO queues, we obtain the data flow that is displayed in Fig. 5(a) .
IV. IMPLEMENTATION OF THE TRACE FORWARD UNIT
With the trace-forward methodology, we try to avoid a trace back operation of the path of any of the states for the estimation of the starting state of the survivor path. This method is based on the dynamic estimation of this state by means of the use of a set of registers and multiplexers with an interconnection structure that is similar to that of the ACS's. For this reason, the area efficient topologies for the ACS's can be directly translated to the trace-forward units.
In Fig. 6 , we present the structure of the trace-forward unit for the example of Fig. 1 . Two edges input each multiplexer and its output is used as the input to two multiplexers for the next processing stage, with a connectivity structure that is the same as the trellis that defines the transitions between states. The control signals for each multiplexer are the decision bits associated to each state. If we compute state and the decision bit is 0, it means that the state before the current one is (first state of the butterfly computed) and the content of its associated register is gathered as state tail
In each of the registers we store the predecessor (TAIL) of its state associated with instant which will be updated in each processing cycle. At instant we will be able to say that the contents of the registers, that is, the tail associated with instant , is the starting state of the merge-block. At the initial reference instant , we store as tails of each state the state itself (each state is its own predecessor). For example, if at instant the decision bit associated with state 00 is 1 it means that its predecessor at instant is state 10 and this value is stored in its associated register. This updating process is repeated until instant The trace-forward structure has been taken as distributed among the PE's so that the same data recirculation is required for the TAIL's and for the PM's that are being computed, i.e., as the connectivity of the trace-forward computation unit is the same as the one for the trellis that is computed, the area efficient architectures we have studied can be applied to this trace-forward unit.
V. ARCHITECTURE OF THE PROCESSOR
In Fig. 7 , we present the structure of a PE in a system in which the number of states is larger than the number of ACS's (four ACS's). A total of two butterflies are computed sequentially in each PE. In particular, we indicate the data flow for the computation of the first and second ones. Each PE stores the real values produced in the convolutional encoder in the transitions of the butterflies computed for the calculation of the BM's. We include a scheme for the selection of the values associated with the butterfly that is currently being processed, a scheme that is not necessary when the number of states is the same as the number of ACS's. The branch metrics of the real transitions are calculated in the blocks as the Hamming distance between and the values from the noisy channel.
The BM's together with the PM's that are received as and input are introduced in the ACS's in order to compute the PM's of the new states at instant
The outputs are the decision bits of the transition These decision bits, which must be stored in a local memory for the subsequent reconstruction of the survivor path, are used as control signals for the selection of the TAIL of each state. The routing section is made up of two FIFO queues of length
The FIFO queues are not necessary when the number of ACS's is the same as that of states.
As we indicate in Fig. 7 , we have considered memories that are distributed among the processors. A total of two butterflies are computed in each processor in the example, which leads to a total of four decision bits per processor and trellis stage. In Fig. 8 , we display the structure of the memories and its selection logic. Once the initial state of the survivor path has been defined, we read the decode-block from the last word backward. The state index allows us to select the memory to be activated and the bit required in the activated word. After reading this bit, we construct the preceding state (2) and carry out the same selection over the previous word
As we indicate in Fig. 8 , the most significant bit of the state is used for the selection of the memory and the two least significant bits for the selection of the decision bit in the selected word. This selected bit is used for the construction of the preceding state through the shift of the register that stores the state.
VI. GENERALIZATION OF THE SCHEDULING METHOD
TO A GENERIC CONVOLUTIONAL TRELLIS The method of scheduling and partitioning over a network with an arbitrary number of PE's can be generalized to a generic convolutional trellis
The value of specifies the number of shifts that are produced in the register of the convolutional encoder in each processing cycle. In this case, we define the state as the content of the least significant bits of the shift register. From each state, we can go to a total of states in each transition between stages of the trellis, and, consequently, each butterfly involves states. Equation (2) for the reconstruction of the sequence of states from the decision bits is transformed into (6) where represents a set of decision bits per state, and, consequently, the dimension of the complete decision vector is bits. The partitioning and scheduling methodology presented in Section IV and the hardware interpretation of the operators are completely valid.
VII. COMPARISON
In this section, we evaluate our area-efficient VLSI architecture with respect to implementations based on systolic meshes [6] and cascade arrays [8] , [9] . We also evaluate the methodology we propose with respect to the one proposed in [10] and [11] .
The Viterbi decoding process can be formulated in terms of a repeated general matrix-vector multiplication for which the systolic arrays are suitable. When using the linear systolic array [6] , it is necessary to have as many ACS's as states, i.e., it is a fully parallel design in which each ACS is devoted to the processing of one state. All the processors are identical and only require simple add-compare-select operations. The interconnection structure only requires adjacent neighbor interconnections and the values of the are recirculated to all the processors in order to process all the After an initiation period each processor requires at least times the ACS unit delay in order to process its associated A correct data recirculation permits the processing of the next without any loss of clock cycles. The disadvantage is that the utilization of the array processors becomes very inefficient for a low connectivity trellis diagram: in the case of a convolutional coder with a generic the utilization of the processors is only
The canonic cascade architecture of the Viterbi decoder (CCVD) [8] , [9] is another parallel solution in which processors are arranged as a ring, together with local memory and switches. The cascade family is characterized by the assignment of all the computations in a given stage of a trellis to the same ACS. This strategy differs from the traditional one in which the assignment is studied within each stage. Each processor is followed by two shift registers of length , and a cross-point switch. The architecture requires relatively small interprocessor wire area, as only near neighbor interprocessor wiring is allowed. But the size of the registers grows with the number of states and the timing control of the design is not simple. The utilization of each processor is about 50%. In [9] , the CCVD is modified to reduce the number of processors to so that each processor is utilized 100%, the proposed design is called the folded cascade architecture of the Viterbi decoder (FCVD).
In Table II , we present the comparison among the architectures. The comparison is performed in terms of the number of processors, utilization and number of processing cycles (in ACS unit delay). In order to make the comparison with the cascade architectures, the processing cycles were particularized for a stages case in a -state trellis. As an example, we show the number of processing cycles for a 16-state trellis. In the systolic designs and in the cascade architectures the number of ACS's depends of the number of states in the trellis, that is, the designer losses degrees of flexibility and the design can be unsuitable depending on the area constraints. Our methodology permits the mapping of the trellis onto an arbitrary number of ACS's with a 100% processor utilization. To make the comparison with the CCVD and FCVD architectures, we apply our methodology to two architectures with as many ACS's as the CCVD and cases. In our design, we obtain a 100% processor utilization and less processing cycles than for the other architectures. We get a similar number of processing cycles as in the case of FCVD architecture but cascade architectures present an irregular design and a complex timing control that makes the selection of our architectures advantageous.
An algorithm that allows a number of trellis states to share one ACS has been proposed in [11] . High-speed techniques and area-efficient techniques can be seen as two different ways of exploiting the design space, i.e., two paths in the area-time diagram. The technique presented provides a continuous means of trading speed for area unlike other area-efficient techniques (as the previously presented) that suggest discrete design choices in the area-time diagram. Its disadvantage is that the methodology is based on the utilization of "heuristic" matrices that represent the partitioning and scheduling of the trellis states. Unlike it, our methodology allows the designer to find the architecture configuration and the correct mapping of the data for any number of processors in a deterministic and easy way.
VIII. CONCLUSION
In this paper, we propose a new class of area-efficient architectures for the VA that allows a number of trellis states to share one ACS. A systematic approach for partitioning, scheduling, and mapping the trellis states associated with convolutional codes to P ACS's is presented.
Based on the perfect shuffle, inverse concatenation and inverse decimation operators the partition and schedule of the trellis can be mapped onto an area-efficient architecture with an arbitrary number of processors and a fixed global routing between them. We present a methodology in which it is easy to find an optimal design solution for particular area or speed constrains. The use of mapping operators permits finding the global routing between processors, the implementation by means of FIFO queues of the local interconnection networks of each PE, the implementation and distribution of the traceforward unit and the structure and operation of the memories for storing the survivor path easily.
The biggest advantage of our methodology is the development of a formal model for representing the trellis and the data flow that permits its hardware mapping onto a PE network in a direct way. This differentiates it from the heuristic methods proposed in previous works.
