Introduction
The maximum likelihood decoding of the convolutional and trellis codes based on the Viterbi algorithm is an important problem in digital communication. It can be viewed as a technique to nd the shortest path in a weighted graph, called the trellis diagram, using dynamic programming 1]-2]. While the performance of convolutional codes improves dramatically as , the constraint length, increases, the decoding e ort increases exponentially. Our aim in this paper is to develop architectures for long constraint length codes which are regular, have simple interprocessor connection and are exible enough to meet a wide range of design requirements.
Traditionally, the implementations of Viterbi algorithm have been (i) state-parallel where the path metrics of all states are updated in parallel using multiple path metric updating units (called Add-Compare-Select units or ACS units), (ii) state-serial where the path metrics are updated sequentially using a single unit and (iii) a combination of both. Based on these schemes, several architectures have been proposed for the case when the input to the encoder is a single bit (code rate R=1=n) 3]-7], 11] 13]. In such a case, the trellis diagram has a simple and symmetric structure, and can easily be mapped into one or more processing units. Many of the architectures for code rate 1=n that have been implemented in VLSI 8]-10] achieve decoding rates as high as 140Mb/s.
For code rate k=n, the encoder has k input bits connected to k shift registers (1 bit per register) 2]. When the sizes of the shift registers di er by at most one bit, the encoder can be modelled as a radix-2 k shift register 12]-14]. The k input bits are then shifted in bulk into the radix-2 k shift register in each cycle. The output codewords of this structure are the same as those of the original encoder structure. Daneshgaran and Yao 14] have developed an architecture for code rate R=k=n based on this structure. Their architecture is based on a systematic procedure called the Iterative Collapse Algorithm (ICA) which allows downscaling to any level. It also achieves a better than linear tradeo between hardware complexity and computation time. The number of interconnections between the modules can be as large as that of a fully connected graph, making its implementation cumbersome.
In general, the sizes of the encoder's shift registers may di er by more than one bit and the design of a Viterbi decoder for code rate R=k=n should be based on the original encoder structure. The state transitions for this case are however much more complicated, and the trellis diagram no longer has the simple structure of a R = 1=n decoder. Chang and Yao have proposed an architecture based on the original encoder structure in 11]. Their architecture consists of a systolic array of processors, with each state being assigned to a processor. The disadvantages of their architecture are that it cannot be scaled, and that it is less e ective in decoding codes when the shift registers are not of equal size. Black and Meng 13] have proposed a uni ed approach for the scheduling of APM updates for both R=1=n and R = k=n decoders. The matrix frame work of their uni ed approach is elegant and supports architectures ranging from cascade to parallel cascade architectures. However, for R = k=n decoders, the transpose units (between every two ACS units) that are required to route the data are quite complex. As a result the latency of their architecture as well as its complexity increases. In this paper, a new architecture is presented for the implementation of long constraint length Viterbi Decoder (VD) when the number of input bits to the trellis diagram is greater than one. The encoder structure under study is based on the original convolutional encoder. The trellis diagram is analyzed hierarchically so that each level in the hierarchy can be designed independently. This decomposition makes the state transitions in a level quite regular and straight forward. Furthermore, the concepts, design procedures and the trellis diagram developed for code rate 1=n can be applied here. The resulting architecture is regular, achieves 100% processor and interconnection utilization, has a foldable global topology, and is very exible. Also, the number of global interconnections between the modules is quite small -at each stage each module communicates with only two others.
The rest of the paper is organized as follows. In Section 2, the global topology of the whole system is described. Then in Sections 3, the detailed design of each level, including issues of communication, allocation and scheduling of ACS units is presented. Memory design including storage of the accumulated path metrics and survivor sequences is described brie y in Section 4. In Section 5, it is shown that this architecture has a much better than linear tradeo between area complexity and computation time. A procedure to fold the global topology is described in Section 6. The paper is concluded in Section 7. If we aggregate all those states having the same bits in the kth block into one single node called the supernode, the trellis diagram for R = k=n reduces to the diagram for R = 1=n. Since the connection pattern between these supernodes is the same at every stage of the trellis diagram, we refer to this pattern as xed. In this case, every supernode communicates with two parent and two descendant supernodes at every stage. The communication pattern between supernodes can also vary from stage to stage with a period of l k . We refer to such a communication pattern as dynamic. This pattern is accomplished by shifting circularly the bits of the kth block by one bit at each trellis stage. The result is a trellis diagram with periodic butter ies as described in 12]. For dynamic communications, every supernode communicates externally with only one parent and one descendant supernode at every stage of the trellis diagram. Note that while the interconnection between the supernodes is di erent for xed and dynamic connections, the interconnections between the states within every supernode are the same. 
Processor Clustering and Interconnections
Since the interconnection between the ACS units is quite costly, we propose a second aggregation phase similar to 14] to increase the overall e ciency of the communication links to 100%. De ne the set of ACS units on the left-hand side of eqn (1) Figure 4 . 2 
Memory Management
In this section we brie y describe the memory required to store the APMs in-place and the survivor sequences. The memory to store the APMs is distributed among the ACS units and is controlled locally, while the survivor sequences are stored in a single survivor memory.
In-place Storage of the Accumulated Path Metrics (APM)
To update the APMs of the states, an ACS unit has to read a set of parent APMs from its local memory in an order dictated by the update schedule. At the same time, the ACS unit receives from its parent ACS units a set of newly updated parent APMs for computations in the next cycle. Figure 5 describes the time steps at which the APMs of 64 parent states required by the update schedule shown in Figure 4 , are generated (updated). Every state label in Figure 5 represents two parent states from two supernodes which are updated at the same time. The notation i j represents the jth time step of the ith interval. There are only two kinds of access for the stored APMs as shown in Figure 6 . The memory scheme can be generalized for the case when j 1 2 and j n = 2 for n = 2; : : : ; m. Since each ACS unit updates 2 jm+:::+j 1 
Survivor Memory Management
In our design of the VD, the time required to complete one global cycle is 2 jm+ +j 1 clock cycles. This is usually long enough to perform the trace back operation. Hence, the through-put is not actually limited by the rate at which the survivor sequence can be decoded. The classical one or K-pointer trace-back technique can be easily employed here, the speci c choice depending on the value of j m + + j 1 . Thus, the structure of the survivor memory used in our design is similar to the one proposed in 15].
Tradeo s between Area and Computation Time
In this section we compare the purely state-parallel VD, where an ACS unit is assigned to every state, with the proposed design, where one ACS unit is assigned to 2 jm+:::+j 1 states. Let the propagation delay (with storage and communication time included) of an ACS unit be T d and let the data path be pipelined to N levels. If the delay of each pipelining latch is t l , then the delay per stage of the pipelined ACS unit is T d +N t l N = T c , where T c is set equal to the clock period (i.e. one time step) and T c < T d . Figure 8 compares the area and time complexities of the purely state parallel implementation with the proposed architecture. The global cycle time of our architecture is increased by a factor of Since the number of global communication links is also reduced by a factor of 2 jm+ +j 1 , we conclude that the proposed architecture has a better than linear tradeo between hardware complexity and speed. The factor 2 jm+ +j 1 is determined by the value of j m + + j 1 , and so the individual values of j m ; : : :; j 1 cannot directly a ect the performance of the VD. This enables us to choose suitable values for j m ; : : :; j 1 to obtain di erent ACS unit allocations or e cient APM memory access (see Section 4.1), and still maintain the same performance improvement.
Folding the Global Topology
In the analysis so far, the number of supernodes is 2 l k since the kth block has been chosen at the highest level. While any of the blocks could have been chosen at the highest level, hypernodes. All the hypernodes are processed in parallel and each hypernode sequentially processes the supernodes assigned to it. The scheduling and interconnections between the ACS units in a hypernode follow the same rules as discussed in Section 3.
Conclusions
In this paper we presented a novel architecture to implement long constraint length Viterbi decoder for the case when the number of input bits to the trellis diagram is more than one. The architecture has been designed in a hierarchical fashion by breaking the system into several levels, and designing each level independently. The resulting architecture is regular, supports folding and achieves better than linear tradeo between hardware complexity and computation time. Another notable feature of this architecture is its exibility. Di erent architectures can be obtained for di erent parameter choices (j i s), and yet all the architectures achieve the same performance improvement as long as the value of (j k + : : : + j 1 ) remains xed. , 0_1  1_0 , 1_1  2_0 , 2_1  3_0 , 3_1   0_2 , 0_3  1_2 , 1_3  2_2 , 2_3  3_2 , 3_3   0_4 , 0_5  1_4 , 1_5  2_4 , 2_5  3_4 , 3_5   2_6 , 2_7   1_6 , 1_7   3_6 , 3_7   0_6 , 0_7   I   II   III   IV   I   I   I   I   II   II   II   II   III   III   III   III   IV   IV   IV   IV   0_0 , 0_1  0_2 , 0_3  0_4 , 0_5  0_6 , 0_7   1_0 , 1_1  1_2 , 1_3  1_4 , 1_5  1_6 , 1_7   2_0 , 2_1  2_2 , 2_3  2_4 , 2_5  2_6 , 2_7   3_0 , 3_1  3_2 , 3_3  3_4 , 3_5  3_6 , 3_7   I   III   IV   II   0_0 , 0_1  1_0 , 1_1  2_0 , 2_1  3_0 , 3_1   0_2 , 0_3  1_2 , 1_3  2_2 , 2_3  3_2 , 3_3   0_4 , 0_5  1_4 , 1_5  2_4 , 2_5  3_4 , 
