A parallel Viterbi decoder for block cyclic and convolution codes by Reeve, Jeffrey & Amarasinghe, Kosala
A Parallel Viterbi Decoder for Block Cyclic
and Convolution Codes.
J.S.Reeve 1, K. Amarasinghe
Department of Electronics and Computer Science, University of Southampton,
Southampton SO17 1BJ, UK
Abstract
We present a parallel version of Viterbi's decoding procedure, for which we are able
to demonstrate that the resultant task graph has restricted complexity in that the
number of communications to or from any processor cannot exceed 4 for BCH codes.
The resulting algorithm works in lock step making it suitable for implementation on
a systolic processor array, which we have implemented on a eld programmable gate
array and demonstrate the perfect scaling of the algorithm for two exemplar BCH
codes. The parallelisation strategy is applicable to all cyclic codes and convolution
codes. We also present a novel method for generating the state transition diagrams
for these codes.
Key words: Viterbi decoding, BCH codes, Field Programmable Gate Array,
parallel algorithms.
1 Introduction
The Viterbi algorithm [1] was developed as an asymptotically optimal de-
coding algorithm for convolution codes. It is nowadays commonly also used
for decoding block codes since the usual [2,3] algebraic decoding methods are
not always readily adaptable for soft decoding. It is also used in soft decision
decoding in which the demodulator returns two numbers that relate to the
likelihood that the data bit is a 0 or a 1. In these circumstances Viterbi de-
coding of Bose-Chaudhuri-Hocquenghem (BCH) [4,5] and convolution codes
is found to be ecient and robust. In this paper we describe a parallel Viterbi
algorithm for hard decision decoding, in which data bits are delivered as either
0 or 1 only, as the adaption to soft decision decoding is trivial.
1 E-mail: jsr@ecs.soton.ac.uk
Preprint submitted to Signal Processing 25 January 2005Although the Viterbi algorithm is simple it requires O(2n k) words of mem-
ory, where n is the length of the code words and k is the message length,
so that n   k is the number of appended parity bits. In practice it is desir-
able to select codes with the highest minimum Hamming distance that can
be decoded within a specied time and an increased minimum Hamming dis-
tance dmin implies an increased number of parity bits. Our parallel Viterbi
decoder necessarily distributes the memory required evenly among processing
elements.
We describe our method for BCH codes as these are slightly more complex
than the convolution decoding, principally because of the presence of feed-
back in the generating shift register. Convolution codes can be treated in the
same way. This work generalises and implements work rst reported in [6].
In that paper we express the Viterbi algorithm as a matrix-vector reduction
in which multiplication is replaced by addition and addition by minimisation.
This resulted in an algorithm closely related to the parallelisation reported by
Kumar et al. [7] of Floyd's [8] minimum cost path algorithm. Our new ap-
proach allows parallel algorithms to be designed for all the block cyclic codes
and proves that the valence of each node of the resulting task graph is re-
stricted, thus limiting the connectivity complexity of the resulting hardware.
This is accomplished by the observation that the state transition diagram can
be generated from a simple rule, described in the next section, that we use
to show that each processor used in the parallel algorithm need communicate
with, at most, four other processors.
2 The Sequential Viterbi Algorithm as a State Transition Machine
BCH codes are a class of cyclic codes that append n   k parity bits to a
message of k bits so that each code word is n bits long. The code parameters
(n;k;dmin) are of the form n = 2m   1, n   k  mt, for positive integers
m and t, and the minimum Hamming distance is dmin  2t + 1. The codes
are specied by their generator polynomials in GF(2) which has the general
form G(D) = g0 + g1D +  + gn kDn k. The parity bits appended to the
message in the systematic generation of the codeword corresponding to the
message polynomial M(D) are the coecients of the remainder of
DkM(D)
G(D) .
This encoding process is usually implemented by a shift register. The general
setup is shown in Fig. 1 in which switches s0 and s1 are closed and s2 open
for the rst k cycles while the message mk of length k is input. For the next
n   k cycles switch s2 is closed and switches s0 and s1 are open.
In our implementation of the Viterbi decoding scheme we regard the shift
register encoder as a state transition machine in which the state R, is the
number that represents the bit pattern fr0r1 :::rn k 1g, and the number of
2We partition the states among 2m processors so that there are P = 2n k m
states on each processor. From the generator graph of Fig. 2 for the state
transition diagram, we see that states R and R + Q are required to update
states 2R and (2R)g, so by placing states R and R+Q on the same processor
we need to send only one message (containing the Hamming weights and paths
from states R and R+Q) to each of the processors handling the states 2R and
(2R)g. Now let L = 2n m k 1, the number of states on each processor with
state number s < Q and assume that L  2 and assign processor p the states
sp and sp +Q such that sp = pL:::(p+1)L 1. To update an even state, 2s,
requires the previous weights of states e = s and e+Q, and to update an odd
state, 2s + 1, requires the previous weights of states o =
(2s+1)g
2 and o + Q.
State s is on processor b s
Lc, when s < Q. bac denotes the nearest integer less
than or equal to a.
The even states, with state numbers less than Q (denoted by e<Q), on processor














and ep+Q, which all lie on processor b
p
2c. The even states, with state numbers









(p + 1)L + Q
2
  1




The odd states, with state numbers less than Q (denoted by o<Q), on processor
p require the previous weights of the states o<Q given by
op =
(pL + 1)  g
2
;
(pL + 3)  g
2
;:::;
((p + 1)L   1)  g
2
and op + Q, which all lie on processor b
(pL+1)g
2L c. The odd states, with state
numbers greater than Q (denoted by o>Q), require the previous weights of the
states o>Q given by
op+Q=2 =
(pL + Q + 1)  g
2
;
(pL + Q + 3)  g
2
;:::;
((p + 1)L + Q   1)  g
2
and op+Q=2 + Q, which all lie on processor b
(pL+Q+1)g
2L c.
5The above result, that the states op (or the states op+Q=2) all lie on the
same processor, depends on the fact that for a contiguous set of numbers
s = f0;1;:::;ng, s  g covers the same set numbers. For instance if s =
f0;1;2;3;4;5;6;7g and g = 5, then s  g = f5;4;7;6;1;0;3;2g, the same set
of numbers in a dierent order.
Hence when L  2 each processor requires at most inputs from 4 dierent
processors. Similarly the number of processors requiring information from any
processor cannot exceed 4. When L = 1 there is only a single state with state
number < Q on each processor and the in and out degree of the task graph
are 2. When the number of processors is the same as the number of states
then our algorithm reverts to a direct implementation of the state transition
diagram for that code.
As an example consider the partitioning of the 16 states of BCH(15,11,3) code








































Fig. 4. The state transition diagram for the BCH(15,11,3) code.
For this code, jRj = 16, Q = 8 and L = 2. The partitioning of states among
processors is shown in table 1 where we have used the notation p(e<Q) to
designate the processor that has the set of states e<Q.
p sp e<Q p(e<Q) e>Q p(e>Q) o<Q p(o<Q) o>Q p(o>Q)
0 0,8,1,9 0,8 0 4,12 2 1,9 0 5,13 2
1 2,10,3,11 1,9 0 5,13 2 0,8 0, 4,12 2
2 4,12,5,13 2,10 1 6,14 3 3,11 1 7,15 3
3 6,14,7,15 3,11 1 7,15 3 2,10 1 6,14 3
Table 1








Fig. 7. Task Graph for the BCH(15,11,3) Code on 8 Processors
takes time proportional to jRj=P. This gives an overall time complexity of
O(njRj=P). The memory complexity of our method is O(jRj) because the
paths and their weights must be stored for each state.
4 The Field Programmable Gate Array (FPGA) Implementation
The implementation of our decoder is best described by referring to the par-
ticular task graph for the BCH(31,26,3) code shown in Fig. 7. Each processor
is assigned to a dierent block of the FPGA and operates independently. The
arcs which represent the transmission of the weights and paths for each state
to another processor are implemented by a shared register between the proces-
sors. All systems were implemented in behavioural VHDL [9]. A synthesis tool
was used to construct the RTL level VHDL for the decoders. This synthesised
unit was then simulated using a commercial simulation tool [10] for VHDL.
In VHDL the initial conditions such as the location of the weights and paths
needed to update a state are readily coded and so don't need to be calculated
for each cycle of the decoding process. The received message is fanned out into
all the processes a bit at a time and this is the logical clock for the machine.
On receiving each input bit, each processor reads the shared registers, updates
the weights and paths and writes the results to the shared registers.
5 Results and Conclusion
No. of Processors 1 2 4 8 32 jRj = 2n k
BCH(7,4,3) 63 35 21 14 8
BCH(31,26,3) 1023 527 279 155 62 32
Table 2
Number of Cycles taken to decode BCH(7,4,3) and BCH(31,26,3) codewords.
As can be seen from Table 2, our parallelisation scheme exhibits perfect scaling
as the number P of processors is increased because the number of execution
8cycles C = (
jRj
P + 1)n as predicted. The factor
jRj
P n is the processing time
and the additional factor n is the communications time, which persists when
P = 1 because we continue to use the \shared" registers to store the weights
and paths. Our parallel algorithm has the same structure and consequently
the same time and memory complexity for soft as well as hard decision Viterbi
decoders applied to cyclic as well as convolution decoders. The only adaption
we need to make to apply our method to soft decision decoding is in the
processor, which selects the path with the maximum likelihood, rather than
comparing the Hamming weights. The possible communications paths remain
the same. Because of its perfect scaling nature we can apply our algorithm
to very large codes by implementing the communication between processors
by blocks of shared memory between processors. Our parallelisation technique
is somewhat more ecient than trellis optimisation methods [11] since we
can gain perfect speed up simply by adding more processors. This is readily
achieved as all the component comparators are the same and can be replicated.
This is not the case with trellis optimisation methods as these result both in
limited reduction in complexity of the decoder (a factor of 8 only for the Golay
codes) and in trellis components that are not self similar and hence not able
to be replicated.
References
[1] A. Viterbi, Error bounds for convolution codes and an asymptotically optimum
decoding algorithm, IEEE. Transactions on Information Theory 13 (1967) 260{
269.
[2] E. Berlekamp, Algebraic Coding Theory, McGraw-Hill Inc, 1968.
[3] R. Blahut, Theory and Practice of Error Control Codes., Addison-Wesley., 1983.
[4] R. Bose, D. Ray-Chaudhuri, On a class of error-correcting binary group codes.,
Information and Control 3 (1960) 68{79.
[5] A. Hocquenghem, Codes correcteurs d'erreurs, Chires (paris) 2 (1959) 147{
156.
[6] J. Reeve, A parallel Viterbi decoding algorithm., Concurrency and Computation
13 (2001) 95{102.
[7] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel
Computing, Design and Analysis of Algorithms, Benjamin-Cummings., 1994.
[8] R. Floyd, Algorithm 97: Shortest path., Communications of the ACM 5 (6)
(1962) 345.
[9] A. Williams, A behavioural vhdl synthesis system using data path optimisation,
Ph.D. thesis, The Department of Electronics and Computer Science, The
University of Southampton (1998).
9[10] Modelsim se/ee plus 5.4c, model technology incorporated, portland, oregon,
usa.
[11] A. Vardy, Handbook of Coding Theory, Elsevier Science BV, 1998, Ch. 24, pp.
1989{2118.
10