Author index  by unknown
Computers Math. Applic. Vol. 17, No. 12, pp. 1511-1521, 989 0097-4943/89 $3.00 + 0.00 
Printed in Great Britain. All fights re erved Copyright © 1989 Pergamon Press pie 
EFF IC IENT COMPUTING METHODS FOR PARALLEL  
PROCESSING:  AN IMPLEMENTATION OF  
THE V ITERBI  ALGORITHM 
K. A. WEN and J. F. WANG 
Institute of Electrical and Computer Engineering, National Cheng Kung University, 1, University Road, 
Tainan, Taiwan, Republic of China 
(Received 15 January 1988; in revised form 26 July 1988) 
Abslract--Eliicient computing methods are exploited for parallel processing of the most important trellis 
search algorithm, i.e. the Viterbi decoding algorithm (VA). The complicated data transfer scheme and the 
rather time-consuming computations caused by dynamic trellis search procedures are reorganized into 
matrix operations. The well-developed systolic processors for matrix operations can be well adapted to 
implement the whole decoding procedures of VA. A certain amount of AND/EOR operations for 
maximum likelihood estimation are saved. Flexible time/area performances are provided and T times 
speedup can be obtained with T consecutive stages being parallelized. 
1. INTRODUCTION 
Mapping problem [1]has long been studied for the speed advantage ofVLSI array processors. Most 
of the existing mapping methods [2-5] are developed for a certain class of algorithms uch as 
FORTRAN-like algorithms, algorithms with DO-loops, etc. In a systematic work on exploring 
parallel processing for various algorithms, we concentrated on computational nalysis instead of 
being directly involved into algorithm mapping. By reorganizing the algorithm procedures into 
simple computations, especially into matrix operations, most well-developed arithmetic processors 
can be employed with minor modifications for these algorithms. It substantially facilitates the array 
processor design for various algorithms. 
In the following, we present the case study on the Viterbi algorithm (VA) [6] which is essentially 
a dynamic trellis search algorithm firstly proposed as an important technique inoperations research 
and is now increasingly demanded in modern signal processing systems [7-9]. In recent years, a 
great deal of design efforts have been attracted to find efficient VLSI implementations of VA 
[10-13]. All the proposed works are concentrated onparallelization ofthe qm states in a single stage 
of the trellis where m is the memory length and q is the alphabet size. With strongly connected 
trellis proposed [12], at most m stages' decoding procedures are parallelized. Besides, all the 
proposed structures have the fixed topology for specified values of m and q and the logic speed 
performance to be expected is fixed. 
By computational nalysis, the decoding procedures for arbitrary number of stages can be 
concurrently operated. Tradeoff between computation speed and degree of integration can be 
adjusted as needed and flexible time/area performances can then be obtained to suit various 
time/area constraints. Efficient computing methods for branch distance valuations are also derived 
and, in some cases, half of the AND/EOR operations can be saved. Since the trellis search 
procedures and the branch distance computations are both reorganized into matrix operations, 
parallel processing of VA can be implemented on the systolic array processors for matrix 
operations. 
2. EFF IC IENT COMPUTATION FOR TRELL IS  SEARCH 
The convolutional code generated from an encoder with n outputs, k inputs, and m memories 
is specified as an (n, k, m) code. At he receiver, a Viterbi decoder is always employed to determine 
the survivor code sequence which most resembles the received code sequence [18]. The excessive 
computations for maximum likelihood estimation and the complicated data transfer schemes for 
1511 
1512 K .A .  WEN and J. F. WANG 
Fig. 1. Trellis diagram for a 32-state Viterbi decoder. 
dynamic trellis search are widely regarded as the crucial problems in VLSI design. In Fig. 1, the 
trellis transfer scheme for a rather small m(= 5) is listed. Sorting algorithms have been used to 
implement locally connected parallel processors of VA [12]. Distinctively, we modified the all-pair 
shortest path algorithm [14], applied it to the multi-stage trellis diagram such that the data transfer 
and computations demanded in VA can be implemented on the well developed array processor for 
parallel processing. 
2.1. Parallelization on survivor distance evaluation 
For clarity, the computation procedures for survivor distance valuation will be explained by 
a simple example of a 4-state Viterbi decoder ofwhich the trellis diagram is listed in Fig. 2. Different 
indexes for "time interval" and "stage" are introduced in the figure. Each directed branch from 
Si to Sj is denoted as ei, j and is labeled with the branch distance. Extracting a single time interval 
of the trellis, the subgraph can be resketched as a weighted igraph denoted as M(t). Cost matrix 
[14] of M(t) can be constructed with the (i, j)th entry (i.e. the entry located on the ith row and 
jth column) being the branch distance of e~.j. We use A (t) = [a~,j(t)] to denote the cost matrix of 
M(t). If ei.j does not exist in M(t), a~,j(t) will be valued as oo. In Fig. 3, M(1), M(2) and M(3) 









I I I eo,o 
1 2 3 4 5 
1 2 3 4 
Fig. 2. Trellis diagram for a 4-state Vitebri decoder. 
e O • 
Q • Q 
Methods for parallel processing 
1 0 3 
2 1 1 
M(1)  M(2)  M(3)  
1513 
A(1)= 
~ 3 0  
2 1 ~  
~ 1 2  
A(2)= 
1 0 ~  
~ 3 1  
0 1 ~  
~ 2 1  
A(3)= 
1 3 ~  
~ 2 2  
2 1 ~  
~ 0 1  
Fig.  3. M(t) and A( t ) ,  t = 1, 2, 3. 
Let X = [xi, j] , Y = [Yi, j] and Z = [Z~,j] be n x n matrices, we define a modified matrix-matrix 
multiplication as: 
Z=XeY,  z~.j=¢(x,.=+Y=j) for s=0ton-1 ,  (1) 
where ~b is the minimum comparator. The (i, j)th entry in A(t) ~ A(t + 1) is exactly the shortest 
path length [14] between Si at the tth stage and Sy at the (t + 2)th stage. Let B(p,q) (q >p)  be the 
"product" ofA (p), A(p + 1) . . . . .  A(q), i.e. B(p, q) = A(p) G A(p + 1) 8""  e A(q), the (i, j)th 
entry in B(p,q) is exactly the shortest path length between S,- at the pth stage and Sj at the (q + 1)th 
stage. For the trellis in Fig. 2, 
B(l,2) = A(1) e A(2) = I! 1 ~100 [ !  0 ~ ~ ] ~  3 I2 1 4 i ] 3  4 2
1 e 1 oo = 3 2 4 • 
oo ~ 2 1 2 4 
(2) 
s(1 ,  3) = B( l ,  2) e A(3) = [i 14il Ii"l 42 ~ 43  1 2 24  ~ 1 = 52  
2 4 ~ 4 3 
(3) 
The minimum value on the jth column of B(1, 3) as circled with dotted line is exactly the survivor 
distance (or metric) of Sj at the 4th stage. If the minimum value locates on the ith row, it indicates 
that the survivor (shortest) path leading to Sj at the 4th stage must be rooted from Si at the 1st 
stage. 
Among all the entries in B(1, 3), the minimum one as circled with solid line indicates that the 
survivor path from S~ (at the 1st stage) to $2 (at the 4th stage) is the shortest survivor path within 
the 1st stage and the 4th stage. 
The well-developed array processor for matrix multiplication [15] can be employed to implement 
the matrix operation defined in (1) with a minor modification on the processing element. Thus, the 
survivor distance evaluation for qm states can be completed in at most 3q [m/2] - 1 execution cycles 
of simple ACS (addition-comparison-selection) operations. 
2.2. Parallelization trellis search 
Parallelization on the decoding procedures for successive stages are provided due to the 
associative property of the operator e • That is, 
x® rez  G w=(xe  Y) e (z  e w). (4) 
1514 K.A. WEN and J. F. WANG 
Equation (4) can be verified as 
T l  = [t li.j] = x e Y t l i .y= ~b(xi, k + Yk,yk = O-N - -1) ,  N = q m, (5) 
T2 = [t2i, j] = T I  6) Z t2,.y = ~k(d/(X~,k + Yk, rk = O-N - 1) + Zr=, r = 0-N -- 1), (6) 
T3 = [t 3~.j] = T2 6) Wt3~,y = ~k(~b(~,(X,.k + Yk, ,k = O-N - 1) + Zr=, 
r=O-N- -1 )+wfy ,  s=0-N-1) ,  (7) 
T4 = [t4i.y] = Z 6) Wt4~,~ = d/(Z~,k + Wk, j ,  k = O-N - 1), (8) 
T5=[t5~, j ]= T1 6) T4t5i, j=~b(~l(X,,k-FYk, r, k =O-N-  1)q- (Zr,= +Wf,y, 
s=0-N-1) ,  r=0-N-1) .  (9) 
Since the results of (7) and (9) are identical, E) obeys the associative law. Hence, survivor distance 
evaluation for T consecutive stages, say from the (t + 1)th time interval to the (t + T)th time 
interval, can be obtained by evaluating the product of A (t + 1), A (t + 2) . . . . .  A (t + T) pairwisely 
in the binary tree processor [16] as shown in Fig. 4. By employing T/2  matrix processors for 
parallel-in-serial-out processing (Fig. 4a), the total product of A (t + 1), A (t + 2) . . . . .  A (t + T) can 
be obtained in log(T/2) steps and T/ log(T /2 )  times speedup can be achieved. By employing T - 1 
matrix processors for pipeline processing (Fig. 4b), T times speedup can be achieved. 
2.3. Survivor path determination 
A Viterbi decoder at the receiver equires arbitrary long path memories to store the semi-infinite 
survivor paths for each state (i.e. the shortest path leading to each state from the original time 
interval). While, for practical implementation, a truncated-memory decoder retains only finite- 
length survivor paths determined in the last Q time intervals where Q is the truncation length [18]. 
In this case, backtracking procedures hould be employed to obtain the decoded symbol for the 
previous Q time interval. That is, the first branch of the shortest Q-length survivor path should 
be traced out and the k-bit input symbol specified to that branch will be sent out from the decoder 
as the decoded maximum likelihood symbol for the previous Q time interval. Conventionally, large 
memories are arranged to store the Q-length (kQ-bit) survivor paths for all the qm states and the 
first branch of the survivor path is obtained by memory access. Besides, based on the dynamic 
(a )  A( 
A(13) 








) A( ) 
(161 
A(2)  A(3)  
A (61  A(T)  
A( IO)  A111 ) 
A1141 A(151 
A(4)  
A(81  ) 
A1121 
A1161 
Fig. 4. The binary tree machine: (a) for PISO processing; (b) for pipeline processing. 
Methods for parallel processing 1515 
programming, a certain memory management s rategy should be adopted to refresh the survivor 
path for each state at each stage. 
With our computation procedures, the path memories as well as the memory operations can all 
be eliminated because the source states of the shortest Q-length survivor paths determined at the 
tth and (t + 1)th stage (say S~ and Sj) can be easily observed from B( t -Q ,  t -  1)=A( t -  1) 
A(t  - 2) . . . . .  @ A( t  - Q) and B(t  - Q + 1, t) = A( t )  E) A( t  - 1), . . . ,  e A(t  - Q + 1) and 
e~.j is exactly the first branch of the Q-length survivor path determined by the decoder at the tth 
stage. Hence, there is no need for path memories which have long been regarded as a crucial 
problem in Viterbi decoder design. 
In Fig. 2, the path rooted from S~ at the 1st stage to $2 at the 4th stage is determined by (3) 
as the shortest 3-length path and S~ is the source state of that path. If $2 is determined from 
A (2) G A (3) O A (4) as the source state of the shortest 3-length path within the 2nd and 5th stage, 
the k-bit input code specified to el.r will be sent out as the decoded symbol for the first time interval. 
2.4. Time~area nalysis 
With the concise computation for trellis search, Viterbi decoding procedures pecified to the 
qm states in a single stage are integrated in a matrix operation. Parallel processing for arbitrary 
number of consecutive stages is derived. With T consecutive r ceived symbols being operated by 
log(T/2) steps of the matrix operation, hardware logic speed required for the matrix processor [12] 
is [2(log(T/2)wT) where w is the symbol interval. That is, if the symbols transmitted to the decoder 
in w time period, the execution cycle of the matrix processor should be less than Tw/log(T/2) to 
obtain real time operation. With various ettings of T(i.e. with various number of stages being 
parallelly operated), the factor log(T/2)/T can be adjusted. Hence, if acertain matrix processor with 
execution cycle w' is specified, real time operation can be achieved by adopting T matrix processors 
for parallel processing such that Tw/log(T/2) = w'. Or, with a certain data transmission rate and 
size of overall architecture being specified, suitable matrix processor with w' < Tw/log(T/2) should 
be available for real time operation. The hardware logic for the binary tree machine as illustrated 
in Fig. 4b is f~(1/wT). Thus, various time/area performances can be provided for various 
applications. 
3. EFFICIENT COMPUTATION FOR BRANCH DISTANCE EVALUATION 
In Fig. 5, the state diagram of a (3, 1, 3) code is illustrated where ei.j is labeled with the 1-bit 
(k = 1) input code and the 3-bit (n = 3) output code. The n-bit output code is referred to as the 
branch code of ei.j. Let V be the set of all the branch codes generated from the encoder, given 
that r(t) is the n-bit code received at the tth time interval, a Viterbi decoder will compare r(t) 
with all the branch codes in V and determine the branch distance for every branch. Thus, for 
large m, either huge memory should be retained to save the q,+m branch codes for branch 
distance valuation or excessive computations should be performed to regenerate them at each 
stage. 
In VLSI array design, the global memory required for branch codes should be avoided because 
it increases the routing complexity and causes excessive memory access time. Therefore, the second 
approach as described will be adopted adn we developed an efficient computation method to 
simplify the branch distance valuation for hard-decision Viterbi decoders (q = 2). 
olooo~ 
3 
Fig. 5. State diagram of a (3, |, 3) code. 
1/011 
1516 K.A. WEr~ and J. F. WANO 
3.1. Efficient branch code generation by RMB/LMB 
An m-bit state code is used to identify the 2 m states of the decoder and the state assignment is
arranged such that at state Si the current content i  the m registers are exactly the m-bit reversed 
binary representation of i, that is, 
S~ = [S, ,o ,  s i , ,  . . . , S~,r ,_  ,] = itml, (lO) 
where it,, 1 is the m-bit reversed binary representation of i and s~,j is thej th bit of Si. We use C(i, j )  
to denote the branch code of e~.j. Evaluation of C(i, j )  can be formulated as a transformation 
operation according to the encoding processes of convolutional code [18]. That is 
C(i, j )  = Q( i , j ) .G,  (1 l) 
where Q(i, j )  is the concatenation of S; (the current content i  the m registers) and the left-most 
k bits of Sj (the k-bit input symbol causing the state transition from Si to Sj), i.e. 
Q(i, j )  = [Sj.o, sj, , , . . . ,  Sj.k_ tlSi]. (12) 
Matrix G is an [(m + k) x n] encoding matrix which specifies the convolutional code 
G = 
gl. 1 gl, 2 
g2, I g2. 2 
g in+k,  1 g in+k•2 " " • gm+k,n ' . J  
(13) 
The vector-matrix multiplication operator • is performed in modulus 2 system. 






gk, l gk,2 
g,,,+ 1. I g,,,+ ,.2 
gra+k- l , I  g in+k- l ,2  




g l ,n  
g2, n 
g k, n 
gm+ l ,n  
gm+k- l , r  
gra + k, n 
(14) 
05)  
Define RMB and LMB to be two sets containing 2k -  1 n -b i t  vectors: 
LMB = {Lt(j) l Lt(j) =Jtk)* L , j  = 1-2 k -- 1)}, (16) 
RM B = { Rt(j)  I Rt(j)  = J<k)* R, j = 1-2 k -- 1 }. (17) 
We will show that via the application of LMB and RMB, 2m-k(22k-I)/2 m+k of the matrix- 
multiplications for branch code generation can be replaced by vector additions and a certain 
amount of AND/EOR/ADD operations can be saved. 
For the code specified in Fig. 5, 
G = 
l 0 =R 
Methods for parallel processing 1517 
Lt(1) = [1],[110] = [110] LMB = [Lt(l)} = [110), (18) 
Rt(1) = [1],[101] = [101] RMB = [Rt(1)] = [101], (19) 
C(0, 0) = O(0, 0),G = 000, (20) 
C(4,  0) = C(0, 0) ~ Rt(1) = 101, (21) 
C(0, 1) = C(0, 0) ~ Lt(1) = 110, (22) 
C(4, 1) -- C(0, 0) ~) Lt(l) ~ Rt(1). (23) 
In (20), C(0, 0) is calculated by matrix multiplication. With C(0, 0) evaluated, C(4, 0), C(0, 1) 
and C(4, 1) can be evaluated by vector additions as listed in (21-23). In the Appendix, we list the 
16 matrix-multiplications demanded for branch distance valuation which can be reduced to be 4 
matrix-mutiplications for C(i, 2i)s (i = 0-3) and 12 vector additions for others. In general, with 
LMB/RMB, the 2 k+'` [(m + k )x  n] matrix-multiplications required for calculating all the 2 k÷m 
branch codes can be implemented by 2"`-k [(m -- k) x n] matrix-multiplications a d 2"`-*(22 - 1)n- 
bit vector additions. In Table 1, the AND/EOR operations for branch code generation demanded 
by the original encoding process and the modified one are listed. 
RMB and LMB are derived based on geometric analysis. Each time interval of the trellis diagram 
can be viewed as a bipartite graph in which 2"-k independent components are embedded. In Fig. 6, 
a single stage of the trellis diagram of an 8-state decoder and the four components extrated from 
it are listed. Note that each component is exactly a K2k2k graph [19] as illustrated in Fig. 7 and 
is denoted as BUq = (Nq, Eq). It consists of a set of states Nq and a set of edges Eq where Nq is the 
union of two sets of mutually independent s ates. That is, 
{Sq+r2[ r  
Nq = lq U Jq [Sq2k+ oIs = 0-2k}. (24) 
For branch code evaluation, Q(q + r2 ''-k, q"k + s) should be available. According to (12), 
Q(q + r2 "`-k, q2 k + s) = [(q2 k + S)tkjISq+r2"`-e]; r, S = 0-2. (25) 
In (25), (q2 k + s)tkl is exactly the k-bit reversed binary representation f s. The right-most k bits 
"` -k  of Sq +,.2 are exactly equal to rtk j. Hence Q (q + r2 m- k, q2 k + s) may be decomposed into three parts: 
Q(q + r2 "`-~, q2 ~ + s) = [stkllMID(q,r)lrtJ, (26) 
where MID(q, r) is the left-most (m -k )  bits of Sq+r2,-k and is equal to qt'`-kl. Thus 
C(q + r2 "`-k, q2 k + s) = Q(q + r2 "`-k, q2 k + s)*G 
= [stkllq('`_k)lrtkl]*G 
= stkl.L ~ qt'`_kl,G' ~ rtkj.R , (27) 
where G' is the submatrix of G with L and R isolated. For s = r = 0, stkj*L and rt~l.R are both 
all-zero n-bit code words. Thus, 
C(q, q2 k) = qt'`_kl*G'. (28) 
For s, r = 0, according to the definition of Lt(i) and Rt(i), equation (27) can be evaluated as 
C(q + r2 "`-k, q2 k + s) = stkl*L ~ qt'`_kl*G" ~ rtkl*R 
= Lt(s) ~ qt'`_kl*G" ~ Rt(r). (29) 
Table 1. Branch code generation 
Original VA With RMB/LMB 
AND n(m + k)2 ~+" n(m - k)2 '~-k 
EOR n(m + k)2 ~+~' n(m - 1)2 m-k + 2 ~'+k 
1518 K.A. WEN and J. F. WANG 
(o) (b) 
So $1 $2 s3 S4 Ss $6 Sr So $4 Si $3 $2 56 53 Sr 
s;  s; s; s . . . . . .  s . . . . .  3 S, S~ Ss S~ So S~ Sz 3 S4 S3 S3 Sr 
BUG BUl BU2 BUs 
Fig. 6. (a) The bipartite state transition diagram; (b) the embedded butterflies. 
Therefore, via the usage of RMB and LMB, branch code generation processes for each 
block, BUq, can be completed as follows: (1) C(q, q2 k) with be evaluated by a matrix- 
multiplication: 
C(q, q2 k) = Q(q, q 2k)*G = qt,,,-k]*G'; 
(2) for s, r = 1-2 k - 1, C(q + r2 m-k, q2 k -I'- S) will be evaluated by 2 2k vector additions: 
C(q + r2 m-k, q2 k + s) = Lt(s) ~ C(q, q2 k) ~ Rt(r). 
3.2. Homogeneous addition for branch distance evaluation 
For hard decision, the Hamming distance between r(t) and C(i, j) will be evaluated asthe branch 
distance of e~.y. We adopt homogeneous addition technique to simplify the Hamming distance 
evaluation and integrated process is established to integrate the branch code generation and the 
branch distance valuation in a single matrix operation. 
The integration technique is derived from the concept of homogeneous coordinating [17]. We 
name it as homogeneous addition which makes possible the integration of vector additions with 
matrix multiplications in a single matrix operation. 
Definition. Homogenous addition operator, @. Let Y be a k x n matrix, let Z and X be k-tuple 
vectors, then the resultant value of homogeneous addition operation is 
X @ [YIZ] = [X) I ] , [Y]- / ,  (30) 
where I is an n-tupl¢ unit vector, i.e. I=[11. . .1] .  The augmented 'T '  added to the 






~ m ~  m I ~  ~'~ ~'~ 
\ 
,/ lq 
Eq Nq : lq U dq 
Jq 
Fig. 7. The K2,.~ subgraph: BUq = (Eq,Nq). 
Methods  for paral lel  processing 1519 
Via the @-operation, the branch distance between r(t)and C(i, j) can be derived simultaneously 
with branch code generation process: 
d(r(t ), C(i, j)) = Jr(t) ~ C(i, j)]. ! 
= Jr(t) ~ Q(i,j).G].I 
= [Q(i, j)l]*r~" l 
= [Q(i, y)] @ [GJr(t)]I]. (31) 
The three computation operations for b anch distance valuation are integrated in the single matrix 
operation listed in (31): (1) generation of C(i, j); (2) bit comparison between C(i, j) and r(t); and 
(3) bit-addition for branch distance. 
Combining homogeneous-addition with the application of LMB/RMB, the [(m + k + 1) x n] 
matrix-multiplication as listed in (31) for branch distance valuation can be reduced as 
d(r(t), C(q, q2k))=[Q(q, q2k)[1]*[r--~l'I 
= [q[~_J@[Glr(t)lI], (32) 
," C(1,12 k) - 




C(q, q2 k) ] 
= [III] @ LT(s) Ir(t) • (33) 
RT(r) 
With (31), the 2 m+k branch metrics should be calculated by 2m+k[(m +k-  1)x n] matrix- 
multiplications. While with (32) and (33), they can be implemented with 2m-k[(m -- k + 1) x n] 
matrix-multiplications [a  listed in (32)] and 2 m- k(2~'- 1)[4 x n] matrix-multiplications [a  listed 
in (33)]. 
In Table 2, the computations for branch distance valuations demanded by the original Viterbi 
algorithm are listed in comparison to those demanded by the efficient computations with 
LMB/RMB and homogeneous additions. The actual number of AND/EOR/ADD operations 
demanded for various m in case that k = 1 and n = 3 are listed in Table 3. It is shown that for 
m > 6, about half of the AND/EOR operations can be saved with LMB/RMB and homogeneous 
additions. 
4. CONCLUSION 
For parallel processing, the step-by-step decoding processes branch code generation, branch 
distance, path distance, survivor distance evaluations are completed with simple matrix operations 
which are well suited for VLSI array processor design. The dynamic trellis search procedures are 
Table 2. Branch distance computation 
With RMB/LMB and 
Original VA homogeneous addition 
AND n(m + k)2 k+m n2'~- ~(2 ~ + m - k - 3) 
EOR n(m + k + l)2 k +m n2m-k(2~ + m -k  - 3) 
ADD n2 ~+~ n2 m÷k 
1520 K.A. WEN and J. F. WANG 
Table 3 
Original VA/with RMB/LMB and homogeneous addition 
mt AND EOR ADD 
3 192/180 240/180 48/48 
4 480/384 576/384 96/96 
5 1152/816 1344/816 192/192 
6 2688/1728 3072/1728 384/384 
7 6144/2648 6912/2648 768/768 
8 13824/7680 15360/7680 1536/1536 
9 30720/16128 33792/16128 3072/3072 
10 67584/33792 73728/33792 6144/6144 
I1 147456/70656 159744/70656 12288/12288 
12 319488/147456 344064/147456 24576/24576 
13 688128/307200 737280/307200 49152/49152 
14 1474560/638976 1572864/638976 98304/98304 
15 3145728/1327104 3342336/1327104 196608/106608 
16 6684672/2752512 7077888/2752512 353216/393216 
tk = l,n =3. 
also reorganized into matrix operations and arbitrary number of stages can be parallelly operated 
to achieve faster decoding speed. Long path memories are not necessary and complicate data 
transfer scheme is negliable. It is thus predicted that computational modifications on 
algorithms can be of great help for its parallelization and its parallel processor implemen- 
tations. 
REFERENCES 
1. S. H. Bokhari, On the mapping problem. IEEE Trans. Comput. C30, 207-214 (1981). 
2. M. S. Lain and M. Mostow, A transformational model of VLSI systolic design. IEEE Comput. February, pp. 42-43 
(1985). 
3. D. I. Moldovan and J. A. B. Fortes, Partitioning and mapping algorithms into fixed size systolic arrays 1EEE Tram• 
Comput. C35, 1-12 (1986). 
4. P. Cappello and K. Stieglitz, Unifying VLSI array designs with geometric transformation Proc. 1983 Int. Conf. Parallel 
Processing, pp. 488-457 (1983). 
5. J. A. B, Fortes, K. S. Fu and B. W. Wah, Systematic approaches to the design of algorithmic specified systolic arrays. 
Proc. IEEE ICASSP, Tampa, FI. 
6. A. J. Viterbi and J. K. Omura, Principles of Digital Communication and Coding. McGraw-Hill, New York 
(1979)• 
7. J. A. Heller and I. M. Jacobs, Viterbi decoding for satellite and space communication• IEEE Trans. Commun. Technol. 
COMI9, 835-848 (1971). 
8. K. N. Ngan, Adaptive transform coding of video signals, lEE Proc. 129, 28-40 (1982). 
9. S. Haykin and J. P, Reilly, Maximum-liklihood receiver for low-angle tracking radar, Part I: The symmetric case. lEE 
Proc. 129, 261-272 (1982). 
10. J. P. Reilly and S. Haykin, Maximum-likelihood receiver for low-angle tracking radar, Part 2: The nonsymmetric case. 
lEE Proc. 129, 331-340 (1982). 
11. S. Mohan and A. K. Sood, A multiprocessor a chitecture for the (M, L)-algorithm suitable for VLSI implementation 
IEEE Trans. Commun. Technol. COM34, 1218-1224 (1986). 
12. C. Y. Chang and K. Yao, Viterbi decoding by systolic array. Proc. 23rd A. Allerton Conf. Communication, Cont. 
Computers, Allerton House, Monticello, IU. pp. 430-439 (1985). 
13. P. G. Gulak and T. Kailath, Locally connected VLSI architectures for the Viterbi algorithm IEEE J. SAC 6, 527-537 
(1988). 
14. E. Horowitz and S. Sahni, Fundamentals of Computer Algorithms. Computer Science Press (1978). 
15. S. Y. Kung, VLSI Array Processors. Prentice-Hall, Englewood Cliffs, N.J. (1988). 
16. C. Mead and L. Conway, Introduction to VLS1 system. Addison-Wesley, New York (1980). 
17. I. D. Faux and M. J. Pratt, Computational Geometry fo  Design and Manufacture. Ellis Horwood, (1979). 
18. S. Lin and D. J. Costello Jr, Error Control Coding. Prentice-Hall, Englewood Cliffs, N.J. (1983). 
19. N. Deo, Graph Theory with Applications to Engineering and Computer Science. Prentice-Hall, Englewood Cliffs, N.J. 
(1974). 
APPENDIX 




Methods for parallel processing 1521 
Computation for branch code generation: 
C(0, 0) = [0, 0, 0, 0 ] ,G  =000 
C(I, 2)= [0, I, 0,0],G =011 
C(2, 4 )= [0,0, l ,O] ,G =Oi l  
C(3, 6) =[0, 1, 1 ,0] ,G =000 C(3, 
C(4,0) = [0, 0, 0, 1],G = 101 C(4, 
C(5, 2 )= [0, 1,0, I ] ,G  = 110 C(5, 
C(6, 4 )= [0,0, !, I ] ,G  = 110 C(6, 
C(7, 6) = [0, 1, 1, l l ,G  = lOl C(7, 
Computations for branch generation with RMB/LMB templets: 
C(O, O) = IN(O, 0) ,  G = 000 C(4, 
C(O, 1) = C(O, O) ~3 Lt(1) = 110 C(4, 
C(1, 2) = IN(l, 2),  G = 011 C(5, 
C(1, 3) = C(I, 2) ~) Lt(l) = I01 C(5, 
C(2, 4) = IN(2, 4) • G = 011 C(6, 
C(2, 5) = C(2, 4) ~ Lt(l) = 101 
C(3, 6) = IN(3, 6),  G = 000 
C(3, 7) = C(3, 6) ~) Lt(l) = 110 
c(o, 1) - [1, o, o, 
c(1, 3)= [i, 1,0, 
c(2, 5) = [i, o, 1, 
7 )=[ i ,  1, 1, 
0=[1 ,0 ,0 ,  
3)=[1, l,O, 
5) =[ i ,0 ,  1, 
7) = [1, 1, l, 
o) = c(o,  o) 
i) = c(o,  o) 
2) = C(l, 2) 
3) = C(l, 2) 
4) = c(2,  4) 
C(6, 5) = C(2, 4) 
c(7, 6) = c(3, 6) 
c(7, 7) = c(3, 6) 
0]*G = 110 
0]*G = 101 
0]*G = 101 





Rt(1) = 101 
Lt(l) + Rt(l)  
Rt(1) = I10 
Rt(l)  ~ Lt(l) = 000 
Rt(1) = 110 
Rt(l)  ~ Lt(l) = 000 
Rt(l)  -- 101 
f9 Rt(1) ~ Lt(l) = 011. 
