where she has ioined the Center for Telecommunications Research. Her current teaching and research interests are in the areas of high-speed digital signal processing, parallel processing and VLSI circuit design and error-correcting coding.
A New Transform Algorithm for Viterbi Decoding
Abstmct-Implementation of the Viterbi decoding algorithm has attracted a great deal of interest in many applications, but the excessive hardwarehime consumptions caused by the dynamic and backtracking decoding procedures make it difficult to design efficient VLSI circuits for practical applications.
A new transform algorithm for maximum likelihood decoding is derived from trellis coding and Viterbi decoding processes. Dynamic trellis search operation are paralleled and well formulated into a set of simple matrix operations referred to as the Viterbi transform (VT).
Based on the VT, the excessive memory accesses and complicated data transfer scheme demanded by the trellis search are eliminated. Efficient VLSI array implementations of the VT have been developed. Long constraint length codes can be decoded by combining the processors as the building blocks.
I. INTRODUCTION HE Viterbi decoding algorithm (VA) [l], [2] is a maximum like-
T lihood decoding algorithm for convolutional codes [3] . Its implementation in VLSI circuits is increasingly in demand for diversified applications such as deep-space and satellite communications, text and voice recognitions, and many others [4]- [7] . However, the excessive storage requirements and implementation complexity can be major problems, especially with codes of longer constraint lengths. A number of special architectures for VA have been proposed [8] - [lo] and numerous algorithmic variations have been explored [ 121, [ 131 to facilitate hardware implementations and reduce computation time. Chang and Yao proposed the combination of several stages with a strongly connected trellis diagram [l 11, our work follows their merging philosophy, but results in a new algorithm which greatly reduces data transfer and access requirements.
We have analyzed the encoding computations of convolutional codes and developed by integration algorithm for Viterbi decoding operations. The algorithm will be called the Viterbi transform (VT). By computational modifications, the decoding processes of several stages are efficiently transformed into simple matrix operations, and branch code generation is integrated into the decoding processes. The dominant time consumptions in VLSI circuits caused by data access operations are greatly reduced, as this transformation limits the memory access to a loccal access without global memory requirements and the backtracking is eliminated.
A modular processor element is developed to construct the Viterbi decoder for long constraint length codes. An efficient systolic VT processor for (2,1,4) code is presented; it may be used as the building block for any rn with rn being radix 4.
II. CONVOLUTIONAL CODING SYSTEM
An (n, k , rn) convolutional encoder ( Fig. 1) is a finite state machine with k inputs, X I , X 2 , . . . , X k , and n outputs, In Fig. 2 , for state S4 at t + 3, J(4) = (2, 6). Path metrics for all the candidate paths of Si, denoted by pm(i, j , t ) for all i E J ( j ) , will be evaluated via the likelihood measurement function f, and the maximum metric among them will be chosen as the survivor metric s m ( j , t ) . Let r ( t ) be the n-bit word received at the tth stage and w ( i , j ) the valid codeword of transition ei,,. The pm(i, j , t ) and s m ( j , t ) are related as s m ( j , t ) = X ( p m ( i , j , t ) ) (= max(pm(i, j , t ) ; V i E J ( j ) ) ) (2) where X is the maximum comparator and is applied over all Si with i E J ( j ) . From (2), the operands relating to survivor metric evaluation for S , are r ( t ) . { w ( i . j)li E J ( j ) } and {sm(i, t -1)li E J ( j ) } .
m. TRELLIS DIAGRAM INTEGRATION
By topological modifications, p stages of the trellis diagram may be integrated with each p-branch path selected as a single branch in the reconfigured trellis diagram. Two variations of the trellis diagram for (n, 1, 3) codes are illustrated in Fig. 3 (a) and (b). In Fig. 3(a) , every two stages are integrated and each branch is exactly a 2-branch path in the original trellis (Fig. 2) . In Fig. 3(b) , the merged 3-branch paths are illustrated. The advantages of mapping Viterbi decoding algorithm onto the integrated trellis diagrams are analyzed as follows.
Referring to Fig. 2 , by assuming that p m ( 1, 3, t + l ) is determined as the survivor metric for S3 at the ( t +l)th stage and pm(3, 6, t +2) as the survivor metric for s 6 at ( t +2)th stage, (as circled in Fig. 2 ).
Then the 3-branch path p 6 . 4 , 1+3 (as highlighted in Fig. 2 ) composed of el, 3 , , e3, 6, and e6, is one of the candidate paths of s4 at the ( t +3)th stage. In the conventional VA, pm(6, 4, t +3) is iteratively calculated stage by stage as p m ( 6 , 4 , t + 3) = f ( r ( t + 31, w(6,4), sm(6, t + 2)) = f ( r ( t + 3)-~( 6 , 4 ) , X ( p m ( i , 6, t + 2))) = f ( r ( t + 31, w(6,4), pm(3, 6, t + 2)) = f ( r ( t + 31, ~( 6 4 1 , f ( r ( t + 21, ~( 3~6 )~ sm(3, t + 1))) = f ( r ( t + 31, ~( 6 4 1 , f ( r ( t + 3 , ~( 3 , 6),
In this stage-by-stage dynamic process, survivor metrics for all Si with i E J ( j ) should be available to update the survivor metric of Si. This requires excessive data access operations which are the most time consuming in VLSI circuits.
While with p stages merged into a single one, m successive branches on the trellis diagram will be taken into an integrated branch. If the sum of the branch metrics of these m successive branches can be calculated in a single computation procedure and added with the survivor metrics evaluated m stages before to form the new path metrics, then the survivor metric for each state will be updated every p stages and the number of data access operations can be reduced p times.
However, in order for the branch/path/survivor evaluations defined on the integrated trellis to be compatible with those on the original VA trellis, integration of trellis diagram has the constraint that a merged branch on the integrated trellis must be in one-to-one correspondence to a p-branch path in the original trellis. I-
Fig. 3. (a) The integrated trellis diagram (p = 2). (b) The integrated trellis
A brief proof given in Appendix A shows that for a p-branch path which is uniquely specified by its source and destination states, p should be no larger than m. For the extreme case p = m, the trellis diagram with m stages integrated is in the form of a bipartite complete K N , N graph 1161. The Viterbi transform is derived according to the modified K N , N trellis diagram. We use Ei,, to denote an m-branch path linking Si and S j in the K N , N graph. The trellis diagram of an (n, 1, 3) code with three stages integrated, as shown in Fig. 3(b) , is a K g , 8 graph. The path marked as is exactly the 3-branch path highlighted in Fig. 2 .
The dynamic decoding procedures of the VA will be modified for E,, ,, so that the data accesshransfer for metric evaluations demanded by the component branches can be eliminated. For clarification, only (n, 1 , m) convolutional codes will be discussed for the derivation of the VT, and we start our discussion with the encoding process. gk, r" - Ei,,, that is,
A . State Code Assignment
where Gm is a (2m x mn) extended matrix composed of m G's with each G being located one row lower than the previous one:
The extended m-stage branch code derived from (10) We define a matrix operator "" to be:
I -
D. The m-Stage Branch Metric Evaluation
In the case of a binary symmetric channel (BSC), w ( i , j ) and r ( t ) are n-tuple vectors and branch metric bm(i, j , t ) may be measured from the Hamming distance between w ( i , j ) and r(t). A maximum likelihood decoder will choose w(i, j) as the codeword sent if bm(i, j , t) is the minimum among the Hamming distances between r(i) and all the codewords. Branch metric can be implemented by inner product operation "x" as
Thus, the codeword generating, branch metric and path metric calculations for each state in the integrated trellis can be done by the simple matrix operation given in (17).
F. Elimination on Backtracking
In standard Viterbi decoding, the final decoded symbol should be traced out with backtracking procedures. In evaluating S m ( j , t ) , we record the code of the source state of the path processing the largest path metric, that is, 
U ( j , t ) =H(Pm(i, j , t)). (19) Bm(i, j , t ) = ( W ( i , j ) @ R ( t ) ) x I',
For example, if P m ( l , 4 , 6) is chosen to be the survivor metric of S4 at t = 6,
P m ( l , 4 , 6 ) =X(Pm(i, 4, 6))
) is regarded as the maximum likelihood source code of the path leading to S4 at t = 6 and will be selected by the
where I, is the nm-tuple unit vector extended from I .
E. The m-Stage Path Metric Evaluation
After evaluation of branch metric, path metric must be calculated.
The path metric pm(i, j , t ) is defined as the sum of bm(i, j, t ) and
+sm(i, t).
With m-stages integrated, J ( j ) for each state contains all the numbers from 0 to N -1. Path metric and survivor metric for the integrated trellis diagram are denoted as Pm(i, j, t ) and S m ( j , t ) which can be evaluated analogously to pm(i, j, t ) and s m ( j , t ) , with X operated from i = 0 to N -1
S m ( j , t ) =X(Pm(i, j , t ) )
Thus, if S m ( j , t ) is the maximum among the survivor metrics for all states, the most likely decoded code may be directly read out from U ( j , t). The usual backtracking processes in the VA is then eliminated.
G. Transformation Properties
We denote Branch metrics, path metrics and survivor metrics will be evaluated stage by stage as listed in Table I . metric is calculated as
GmL 0 R ( t ) 1 Sm(i, t -1)
Computations are listed in Table 11 . The decoded information sequence obtained via the VA and the VT procedures are both [0100100100].
V. SYSTOLIC ARRAY PROCESSOR FOR THE VITERBI TRANSFORM By using the VT, a number of features can be obtained for the VLSI array design [ 191, [20] .
Locality: According to the matrix operation (17), systolic array implementation for the VT may be realized with local data access. Thus the complicated wiring problem incurred by the long distance data transfer in the original trellis is eliminated.
Regularity and Asynchrony: In the VT, branch code generation is integrated with the decoding processes. Hence, there is no need for additional branch code generation circuitry or memories [8], [IO] .
Evaluation for branch/path/survivor metrics may also be rhythmically operated without extra time delays for branch code access [8] , [IO] .
Modularity: Computations relating to C ( i ) and C ( j ) , i.e., C(i)*GmL and C ( j ) * G m , , can be isolated and be operated in parallel by modular processing blocks to reduce processing time performance.
A . The Systolic (2, I , 4) Processor Design
Due to the triangular property of G m L and G m , , as shown in Fig. 4(a) , G m may be installed in a 2-D array processor as shown m ( i , j , t ) ) . c) S m ( j , t ) = Min ( S m ( j , t), Pm(i, j , t ) ) (where min is the minimum comparator). S m ( j , t ) will be transmitted with B ( j ) and iteratively updated through the linear array. As soon as the S m ( j , t) emerges from the right, it will be recirculated into the linear array for the evaluation of S m ( j , t + m). At the same time, all the S m ( j , t ) will be compared sequentially in the COM processor to obtain the maximum survivor metrics. Within 2" execution cycles of PE2, m decoded symbols will be sent out from COM processor.
B . System Extension by Modularization
To reserve the architecture of the ( 2 , 1 , 4 ) VT processor, the generator matrix of a 
C . Performance Comparison
Computations required for the branches in a single stage using the VA processor and for the m-stage branches in an integrated stage using the VT processor will be analyzed as follows. As listed in a), there are four computation modules in the VA. In most Viterbi decoder designs, these four modules are always implemented with four building blocks: the codeword generation mechanism, the branch metric calculation mechanism, the path metric calculation mechanism, and the survivor metric decision mechanism. The major design complexity of Viterbi decoder is caused by the data interconnection within these building blocks. With the VT procedures derived here, although more path metrics are to be calculated than with the VA, the computation procedures are more concise and hence the building blocks and the routing schemes in most Viterbi decoders [8] Go GI Go Gz G I Go G3 Gz GI Go G4 G3 G2 GI Go G5 G4 G3 G2 GI Go Gs G5 G4 G3 Gz GI Go G7 Gs G5 G4 G3 G2 G i Go Gs G7 Gs G5 G4 G3 Gz Gi Ge G7 Gs G5 G4 G3 G Z Ge G7 Gs G5 G4 G3 G8 G7 Gs G5 G4 Gs G7 Gs G5 P J e G7 Gs Gs G7 For small rn ( < 4 ) , the concise computation of the VT can be implemented with a single chip design of a VT processor as illustrated in Fig. 6 , in which the routing scheme is rather simple. For larger m, the concise computation can be realized by the systolic ( 2 , 1,4rn) VT processor design as introduced in Section V. In the systolic ( 2 , 1,4m) VT processor, data access per symbol is reduced and confined to a local access and branch codes are efficiently generated by a systolic array as shown in Fig. 4(b) . Survivor metrics and survivor paths are all stored with local memory and are locally transferred. The processing time required to decode rn symbols with the VT processor [ Fig. 4(b) ] is only the processing time of the pipeline composed of 2"+' PE2's as shown in BLOCK 3 [ Fig. 4(b) ]. That is, the sum of the execution cycle for 2"+l Hamming distance evaluations for nm-tuple vectors, 2m+l additions and 2m+l comparisons.
VII. CONCLUSION
The Viterbi transformation can be regarded as a VLSI oriented algorithm because it is developed to take advantages of the highdensity, low-cost, and high-speed VLSI's. The dynamic decoding procedures in successive rn stages are paralleled. The integrated decoding operations are formulated into matrix operations. For longer constraint lengths, the Viterbi decoders can be derived from the combination of basic rn = 4 VT processors such that a single standard VLSI design can be used.
APPENDIX A
A trellis diagram is derived from a tree graph. There are N( = 2") rn-level trees embedded in a trellis diagram. They are rooted from each of the N states and will get to all of the possible N states rn stages later. Hence, there is exactly one path between any pair of nodes which are m stages apart.
Over p stages ( p < m), there are either one or no paths between any arbitrary pair of nodes. Hence, each p-branch path may be uniquely specified by its source state and destination state.
