Viterbi algorithm on a hypercube:  Concurrent formulation by Pllara, F.
N86-22809
IDA Progress Report 42-84 October-December 1985
Viterbi Algorithm on a Hypercube: Concurrent Formulation
F.Pollara
Communications Systems Research Section
The similarity between the Fast Fourier Transform and the Viterbi algorithm is
exploited to develop a Concurrent Viterbi Algorithm suitable for a multiprocessor sys-
tem interconnected as a hypercube The proposed algorithm can efficiently decode large
constraint length convolutional codes, using different degrees of parallelism, and is attrac-
tive for VLSI implementation.
I. Introduction
Concurrent computers have the potential to obtain large
increases in computational power. This is true if one can find a
concurrent decomposition of a given algorithm.
We consider the Viterbi Algorithm for decoding (m, 1/«0)
convolutional codes, where m is the memory (constraint
length = m + 1) and 1/«0 is the code rate, and we describe an
efficient decomposition of this problem, the Concurrent
Viterbi Algorithm (CVA), which is suitable for multiprocessor
systems with a Hypercube architecture.
There are two key requirements in the problem decomposi-
tion:
(1) Divide the problem in equal parts, m order to share
equally the execution time available in each processor.
(2) Minimize the communication between the parts, so
that each processor needs to share information only
with its neighbors.
Fox (Ref. 1) has shown that the Hypercube is a natural
topology for the binary Fast Fourier Transform (FFT), and
high efficiency can be obtained with this structure. We will
show that there is a simple correspondence between the con-
nectivity of the FFT algorithm and the trellis diagram of the
Viterbi Algorithm. Therefore, efficient methods for imple-
menting the Viterbi Algorithm on a Hypercube computer can
be developed.
II. Hypercube Computer Structure
In general, it is desirable for an interconnection structure
to have a low number of links per node (low degree of a node),
and a small mternode distance. This distance dxy between any
two nodes x and y is defined as the minimum number of links
required to send a message from x to y. The diameter D of a
network with N nodes is defined as D = max {d \ 0 <*, y <Af}.
A Boolean «-cube computer, or just Hypercube (Ref. 2), is
a network of N = 2" processors placed at the vertices of an
«-cube, and connected by its edges. Such a structure is attrac-
tive because both the degree of a node and the diameter, which
are equal to n, grow only linearly with the dimensionality of
the cube.
In a multiprocessor network there is always a trade-off
between the degree of a node and the diameter, the extreme
249
https://ntrs.nasa.gov/search.jsp?R=19860013338 2020-03-20T15:03:52+00:00Z
case being the completely connected network which has
N - 1 links per node and D = 1 . In general a low degree of a
node implies a large diameter.
The Hypercube structure seems a reasonable compromise
for practical multiprocessor systems, except for the disadvan-
tage of offering limited input/output capability This negative
aspect, which is less important for computationally intensive
(vs. I/O intensive) algorithms, is furthermore mitigated by the
small diameter of the network, and the simplicity of efficient
broadcasting methods.
Two processors x and y will be called neighbors if their
binary labels differ only in one position, i.e.,
* =
a(x,uk) = o( [x m _ l > . . . , x Q ] , u k )
(1)
This is a cyclic right shift of x, with x0 replaced by uk. Then
the transformation a(- , •) completely describes the transitions
of the graph in Fig. 1, and the m-stage transformation from a
state x at stage 0 to state y at stage m is defined as
y = a(m)(x,u)
= a(a( . . . (7(a(x , M o ) , U l ) . . . ) ,« m _ 1 ) (2)
The elementary transformation needed to describe Fig 2 is
given by
n-l
k-0
where xk e {0,1},7 = [xn_1 ,xn_2 , . .. , x k , . . . ,XQ] ,and
xk is the complement of xk. In this context, the distance d
between two processors is just the Hamming distance of
their binary labels, and neighbors have d = 1. In the Hyper-
cube, the number of nodes at distance d from a given node
(node [0 ,0 , . . . , 0] can be considered without loss of general-
ity) is Nd = (2) >so that the average distance is
d=l
_ _ _
2 N-l 2
III. Equivalence of Networks
We now define two basic network structures the m-stage
Viterbi algorithm trellis (de Brujin graph), shown in Fig. 1 for
M = 2 m =8 states, and the FFT decmation-m-time graph of
Fig. 2.
Let x = [xm_ l , . . . ,X 0 ] be the binary index of a state or
node of the graphs, and u = [UQ , . . , u k , . . , u m _ l ] , u k &
{0,1}, be a sequence of input bits which define one of the
two possible paths out of each node, where k represents the
stage.
First, consider the elementary transformation a(x,uk~) which
describes the state transitions at stage k, and is defined as
which replaces xk by uk. The complete /n-stage graph of Fig. 2
is then described by the transformation co(m)(x, u),
y = u(m\x,u)
Now, it is easy to verify that, at the mth stage,
Therefore, the two networks lead from a given state x to the
same state y, after m stages.
The equivalence of the two networks can be further
expressed, at any stage, by defining the cyclic right shift
operator
where p(m)(x) = p(°)(x) = x, and verifying that,
This result shows that, if we relabel each node x of the graph
in Fig. 1 with the label? = p(fc)(.x:)at stage k, the two networks
are functionally and topologically equivalent; that is, they are
just two different ways of drawing the same network. A given
path generated by an input sequence u visits the same nodes in
Fig. 1 and Fig. 2, if we relabel each node x in Fig. 1 byp(k)(jr).
250
Having established this formalism on networks, we can now
apply the above results to the study of algorithm structures, in
particular to the Viterbi algorithm and its relationship with the
radix-2 FFT.
IV. Concurrent Viterbi Algorithm
Consider a multiprocessor system with N processors located
at the vertices of an n-cube and linked only by the edges of the
cube (see an example for n = 3 in Fig. 3 [a]). Note that the
nodes are labeled by an n-bit binary number, so that the /th bit
is the coordinate of a node along the /th dimension.
If in Fig. 2 we collapse the horizontal dimension, we obtain
a graph which is exactly identical to that of Fig. 3, i.e., with
the same connections between nodes. This observation, as
explained in Refs. 1 and 3, suggests a natural way to imple-
ment the FFT on a Hypercube computer.
The implementation of the FFT on the Hypercube can be
stated more formally if we define the Hypercube (m-cube) net-
work by the transformation,
m-1 '•
— v1
(6)
where xk is the complement of xk, and observe that the trans-
formation in Eq. (3) can always be obtained by Eq. (6), since
uk € {0, 1} and uk = xk requires no communication (self-
loop).
To perform the first stage of Fig. 2, let the nodes of Fig 3
communicate along the first dimension, and so on for each
stage and dimension. In this way, the links provided by the
n-cube are just those necessary to perform the FFT. This
implementation on the n-cube is possible since the FFT
requires communication only between neighboring nodes of
the cube.
At first, the network of Fig. 1 might seem to require com-
munication between distant nodes on the cube. But this prob-
lem can be easily overcome if we relabel the nodes as discussed
in Sec. III. Specifically, at stage k, processor* will represent
state p(fc)(x) of the Viterbi trellis. Processor transitions are
described by the graph of Fig. 2, while state transitions are
given by Fig. 1, as desired. Therefore a Viterbi decoder can be
efficiently implemented on the Hypercube, exactly as for the
FFT. The similarity between the FFT and Viterbi trellis was
previously pointed out by Forney (Ref. 3), but apparently not
exploited in any practical way.
Each processor, at stage k, receives the accumulated metric
and survivor of its neighbor along dimension k of the cube,
and performs the usual comparison and update. In a practical
Viterbi decoder for a (m, l/n0) code, the number of stages
required to obtain a performance very close to optimum is
approximately 5m. Notice that, when the stage & is a multi-
ple of m, the state and processor labels are identical, since
p(m)(x) = x, so that we may easily select the decoded bit
belonging to the most likely survivor. However, in order to
minimize internode communications, it may be more advan-
tagenous to increase slightly the number of stages and read
the decoded bit at node zero, which simplifies I/O operations
(see Sec. V).
V. Message Broadcasting
Performing the CVA requires that blocks of data be loaded
in every processor: This operation is called broadcasting. In
the Hypercube (Ref. 2), data from the host processor is directly
exchanged only through node zero (the origin of the cube).
Therefore, an efficient concurrent method is required to broad-
cast a message from node zero to all other nodes. Since the
diameter of an n-cube is n, a lower bound on the broadcasting
time is n time units (where the unit is the time to send a mes-
sage to a neighbor).
Assume that message A is in node zero, at time zero In
each subsequent time slot k, send messages in parallel from
each node x = \xm_ l , . . ., xk+l , 0, *fc-1, . . . ,x0 ] to each
node 7(x), the neighbors along dimension k. After n time
units, messaged has propagated to all nodes.
Even though this method does not minimize the number of
communications (with the advantage of a very simple index-
ing), it optimizes the total broadcasting time to n time units.
The result is clearly optimum, since it achieves the lower
bound.
VI. Decoders for Large Constraint Length
Existing Hypercube computers have up to 128 nodes
(n = 7), and will soon be extended to 1,024 nodes (n = 10).
Yet, in order to decode powerful convolutional codes with
m > 10, one needs to obtain algorithms which assign more
than one state per processor. This need is dictated not only by
the practical limitations on the physical size of the computer,
but also by the goal of achieving high computational efficiency.
The efficiency 77 of a parallel computer is defined as,
sequential algorithm time
N X (parallel algorithm time)
251
and the speed-up a is given by a = rjN, where TV is the num-
ber of processors, N0 is the number of parallel operations
(butterflies), t0 is the time required by each operation, Wf is
the number of parallel data transfers, and tt is the communi-
cation time.
When the number of states M of the decoder is larger than
the available number of processors N, states can be grouped in
sets of S = 2s states per processor, where M = SN. To see how
this is possible for S = 2 in the proposed CVA implementation,
consider Fig. 2 and group each pair of nodes into one proces-
sor P if.2j , for all nodes /, i = 0,. .. , 7, where[/J is the largest
integer less than or equal to /. Similarly, with two processors
Ptj i4 i , i = 0, .. ., 7, we obtain the case 5 = 4. The extreme
cases, S = M and S = 1 represent the completely sequential and
completely parallel decoder, respectively. Intermediate cases
represent different degrees of parallelism or a different granu-
larity of the algorithm, which is defined as the amount of com-
putation between successive data transfers. In general, the effi-
ciency increases with the granularity.
During m stages of the CVA, the number of parallel data
transfers is,
M (7)
since only m - s stages of Fig. 2 need to communicate with
neighbors (s stages do only internal computations). The num-
ber of parallel operations (butterflies)^ is given by
M (8)
As an example, Nt and N0 are plotted in Fig. 4 forM = 64.
From Eqs. (7) and (8) we have that the efficiency T? decreases
as S decreases from N to 1, and, for N = M, r) is equal to the
ratio t0l(t0 + 2r,), which depends on the hardware implemen-
tation.
The CVA with S > 1 is useful because it allows the imple-
mentation of complex decoders, avoiding the pin-limitation
constraint problem encountered in existing VLSI decoders.
This problem is due to the particular partitioning of the tradi-
tional decoder into a branch metric computation block and a
path memory storage block. This partition requires a rapidly
increasing number of pins in the VLSI chips. The proposed
CVA has instead a number of connections per node increasing
only linearly with the dimensionality of the cube, and is there-
fore more suitable for VLSI implementation
Furthermore, the CVA has been extended to high rate
(m, k0/n0) codes with k0 > 1, where kQ is the number of
input bits in the encoder. This extension is possible only for
a limited range of m, fc0 and 5, the number of states per pro-
cessor. An example for (3, 2//?0) codes is given in Fig. 5,
where S = 2.
Given a rate 1/«0 code it is known how to generate all the
punctured codes of rate fc0/«0, k0 > 1. Since these codes
involve only pairwise comparisons at each stage, it is certainly
possible to decode them with the CVA. It must be noted how-
ever that punctured codes require more stages, and this is
equivalent to linking nodes of the Hypercube which are not
neighbors, by using multiple stages.
VII. Conclusion
The proposed CVA decoder has been implemented and
successfully tested on a 64-node Hypercube computer for
(m, 1/«0) codes, with m = 2, . . . , 14. Present results con-
firm the usefulness of the CVA for large constraint length
codes.
252
Acknowledgment
The author wishes to thank R. J. McEliece and C. L. Seitz of Caltech for providing
access to the Hypercube and E. Majani for some of the programs.
References
1. Fox, G. C., and Otto, S. W., "Algorithms for Concurrent Processors," Physics Today,
Vol. 37, No. 5, pp. 50-59, May 1984.
2. Seitz, C. L., "The Cosmic Cube," Communications of the ACM, Vol. 28, No. 1,
pp. 23-33, January 1985
3. Pease, M. C., "An Adaptation of the Fast Fourier Transform for Parallel Processing,"
Journal of the ACM,Vol. 15, No. 2, pp. 252-264, April 1968.
4. Forney, G. D., "The Viterbi Algorithm," Proc. IEEE, Vol 61, No 3, pp. 268-278,
March 1973.
253
000 *=
000 000
000 4
001
111
.000
» 001
»010
110
111
111 111
101 Fig. 2. The FFT decimation-in-time graph (m = 3)
111 111
Fig. 1. The m-stage Viterbi algorithm trellis (m = 3)
000
(a) 1st STAGE (b) 2nd STAGE (c) 3rd STAGE
Fig. 3. The 3-cube graph
254
160
140
120
100
60
40
20
64 32 16
16
Fig. 4. N0 and N, vs S and N
32 64
STATE
0
P00,2
Pi 1.3
Fig. 5. Decoder for (3,2/n0) code, with 5 = 2
255
