A systolic array implementation of a Reed-Solomon encoder and decoder. by McKenzie, Stephen Scott
Calhoun: The NPS Institutional Archive
Theses and Dissertations Thesis Collection
1985













A SYSTOLIC ARRAY IMPLEMENTATION OF A
REED--SOLOMON ENCODER AND DECODER
by
Stephen Scott McKenzie
June 19 8 5
Thesis; Adv isor: H. Fredricksen
Approved for public release; distribution is unlimited
T223866

SECURITY CLASSIFICATION OF THIS PAGE (Whan Data Entered)
REPORT DOCUMENTATION PAGE READ INSTRUCTIONSBEFORE COMPLETING FORM
1. REPORT NUMBER 2. GOVT ACCESSION NO 3. RECIPIENT'S CATALOG NUMBER
4. TITLE (and Subtitle)
A Systolic Array Implementation of a
Reed-Solomon Encoder and Decoder
5. TYPE OF REPORT & PERIOD COVERED
Master's Thesis;
June 19 8 5
6. PERFORMING ORG. REPORT NUMBER
7. AUTHORS.)
Stephen Scott McKenzie
8. CONTRACT OR GRANT NUMBER*- *.)
9. PERFORMING ORGANIZATION NAME AND ADDRESS
Naval Postgraduate School
Monterey, California 93943-5100
10. PROGRAM ELEMENT. PROJECT. TASK
AREA & WORK UNIT NUMBERS




June 19 8 5
13. NUMBER OF PAGES
92




16. DISTRIBUTION ST ATEMEN T (ol this Report)
Approved for public release; distribution is unlimited
17. DISTRIBUTION STATEMENT (ol the abstract entered In Block 20, II dltferent trom Report)
18. SUPPLEMENTARY NOTES
19. KEY WORDS (Continue on reverse aide If neceeaary and Identity by block number)
Systolic Arrays, Finite Fields, Reed-Solomon Codes,
RS Encoder, RS Decoder, Systolic Multiplier, Primitive
Polynomial, Primitive Shift Register, VLSI, Pipelining,
Parallelism
20. ABSTRACT (Continue on reverse aide II neceeaary and Identity by block number)
A systolic array is a natural architecture for the
implementation of a Reed-Solomon (RS) encoder and decoder.
It possesses many of the properties desired for a special-
purpose application: simple and regular design, concurrency,
modular expansibility, fast response time, cost-effectiveness,
and high reliability. As a result, it is very well suited






AN 73 1473 EDITION OF 1 NOV 65 IS OBSOLETE
S N 0102- LF- 014- 6601 SECURITY CLASSIFICATION OF THIS PAGE (When Data Bntared)
SECURITY CLASSIFICATION OF THIS PAGE (Whrni Dmtm Bntrntrnd)
20 (Continued)
This thesis takes a modular approach to the design of a
systolic array based RS encoder and decoder. Initially, the
concept of systolic arrays is discussed followed by an
introduction to finite field theory and Reed-Solomon codes.
Then it is shown how RS codes can be encoded and decoded
with primitive shift registers and implemented using a
systolic architecture. In this way, the reader can gain
valuable insight and comprehension into how these entities
s are coalesced together to produce the overall implementation
S N 0102- LF- 014-6601
SECURITY CLASSIFICATION OF THIS PAGE(Tf>»n Dmtm Entmrmd)
I .....
Approved for public release; distribution is unlimited
A Systolic Array Implementation of a
Reed- Solomon Encoder and Decoder
by
Stephen Scott McKenzie
Lieutenant, United States Navy
B.S., United States Naval Academy, 1979
Submitted in partial fulfillment of the
requirements for the degree of





A systolic array is a natural architecture for the
implementation of a Reed- Solomon (RS) encoder and decoder.
It possesses many of the properties desired for a special-
purpose application: simple and regular design, concurrency,
modular expansibility, fast response time, cost- effective-
ness, and high reliability. As a result, it is very well
suited for the simple and regular design essential for VLSI
implementation
.
This thesis takes a modular approach to the design of a
systolic array based RS encoder and decoder. Initially, the
concept of systolic arrays is discussed followed by an
introduction to finite field theory and Reed- Solomon codes.
Then it is shown how RS codes can be encoded and decoded with
primitive shift registers and implemented using a systolic
architecture. In this way, the reader can gain valuable
insight and comprehension into how these entities are
coalesced together to produce the overall implementation.
TABLE OF CONTENTS
I. INTRODUCTION . 10
II. SYSTOLIC ARRAYS 13
A. BACKGROUND 13
B. PRINCIPLE OF OPERATION 17
III. FINITE FIELD THEORY 19
A. BACKGROUND 19
B. AN EXAMPLE OF THE CREATION OF A FIELD 21
IV. REED- SOLOMON CODES 23
A. BACKGROUND 23
B. GENERIC ARCHITECTURE 24
V. IMPLEMENTATION THEORY 33
A. BACKGROUND 3 3
B. PRIMITIVE BINARY SHIFT REGISTER DESIGN 33
C. CODING THEORY 38
D. MINIMAL POLYNOMIALS 40
E. SYSTOLIC ARRAY MULTIPLIER 46
VI. IMPLEMENTATION 5 4
A. BINARY ENCODER 54
1. Encoding Process 54
2. Single- Error- Correcting Binary Encoder ... 56
3. Double- Error- Correcting Binary Encoder ... 56
B. REED- SOLOMON ENCODER 5 8
C. BINARY DECODER 60
1. Decoding Process 62
2. Single- Error- Correcting Binary Decoder ... 62
3. Double- Error- Correcting Binary Decoder ... 74
D. REED- SOLOMON DECODER 80
VII. CONCLUSION 83
LIST OF REFERENCES 90
BIBLIOGRAPHY 91
INITIAL DISTRIBUTION LIST 92
LIST OF TABLES
I. REPRESENTATION OF GF(2 4 ) 22
II. REGISTER CONTENTS AFTER SUCCESSIVE CLOCK SIGNALS. . 36
III. CYCLOTOMIC COSETS 43
IV. MINIMAL POLYNOMIALS OF ELEMENTS IN GF(2 4 ) 45
V. COMPUTATION OF P = AB + C IN GF( 2 4 ) 50
VI. VERIFICATION OF THE CODE POLYNOMIAL 63
VII. SYNDROME CALCULATION USING LONG DIVISION 68
VIII. CORRECTION AND DECODING PROCESS 72
LIST OF FIGURES
2.1 Various Systolic Array Configurations 16
2.2 The Concept of a Systolic Processor Array 18
4.1 A Reed- Solomon Codeword 25
4.2 A Systolic Architecture 29
4.3 The Systolic Cell Structure 31
5.1 A 4-Stage Primitive Shift Register 35
5 .2 The Encoding Process 39
5.3 A Systolic Multiplier for the Finite Field
GF(2 4 ) 51
5 .4 The Circuit of the Cell Lj_ 52
6.1 A Single- Error-Correcting Binary Encoder 57
6.2 A Double- Error- Correcting Binary Encoder 59
6.3 The RS Encoder 61
6.4 The Error Detection Register 65
6.5 The Error Correction Register 67
6.6 The Initial Single- Error-Correcting Binary
Decoder 69
6.7 The Complete Single- Error- Correcting Binary
Decoder 71
6.8 Stage I: The Syndrome Generator 76
6.9 Stage II: The Central Galois Field Processor ... 77
6.10 Stage III: The Chien Searcher 78
6.11 The Complete Double- Error- Correcting Binary
Decoder 79
6.12 The RS Decoder Architecture 86
6.13 The RS Decoding Timing Chart 87
I. INTRODUCTION
In this very volatile and technological age, it is
imperative that communication links and computer memories
transmit information reliably and quickly. However, in many
cases this is virtually impossible because noise causes the
received data to differ significantly from the original
data. In order to rectify this situation error-correcting
codes have been developed to enable a system to continually
maintain a high degree of reliability despite the presence
of noise. To accomplish the error correction, in addition
to the data or information bits that are transmitted, some
additional redundant-check bits or parity bits are also
transmitted. In this way, although the noise may introduce
some errors in either the transmitted data bits or the
transmitted check bits, there are usually still enough
uncorrupted bits available to the receiver to allow a
sophisticated decoder to correct the errors. In fact, only
a modest amount of redundancy is actually needed to ensure
that the probability of the decoding error is negligibly
small. [Ref. 1]
Nonetheless, unlike the encoders and decoders of the
1950 's and 1960's which were constrained by digital hardware
costs and virtually nonexistent chip technology, today's
10
encoders and decoders coupled with significant improvements
in their associated algorithms have become, and will con-
tinue to be, increasingly attractive from an economic
viewpoint.
One such class of error-correcting codes which is very
popular in the communication circles and is paramount in
this author's discussion of systolic array encoders and
decoders are the Reed- Solomon (RS) codes. These codes can
correct both random and burst errors over a communication
channel, and as such are ideal for the very low error pro-
babilities needed for reliable space communications. Still,
the RS codes are only as effective as the complexity of the
encoder that produces them and the decoder by which errors
are corrected. The encoder complexity is directly propor-
tional to the error- correcting capability of the code, the
speed of the encoding process, and the interleaving level
used, i.e., the number of original codewords which are
multiplexed together to increase the immunity of codes to
burst errors [Ref. 1], In fact, for truly reliable space
communications there is a bonafide need to use RS codes with
a large error-correcting capability and an equally large
interleaving level. As a result, one is especially
interested in decreasing or minimizing the complexity of an
RS encoder while simultaneously ensuring maximum performance
and high reliability. Clearly, what is needed for this type
of application is a special- purpose system which compliments
11
the for ernent ioned attributes. Therefore, a systolic array
is a natural architecture for the simple, regular, and cost-
effective implementation of an RS encoder and decoder.
In an effort to assist the reader in simplicity and
comprehension, this author has taken the pertinent informa-
tion vital to the thesis and created a chapter for each.
After systolic arrays are introduced in Chapter II the
necessary fundamentals of finite fields for an understanding
of Reed-Solomom codes is discussed in Chapters III and IV.
In Chapter V a systolic array multiplier for finite fields
is discussed and finally in Chapter VI the encoder and
decoder for binary codes is described as well as the encoder




It is clear today that developments in microelectronics
have made a revolutionary impact on computer design
[Ref. 2]. For example, integrated circuit technology has
made a significant increase in the number and complexity of
components that can now fit on a chip or a printed circuit
board. In fact, with the component density presently
doubling every one- to- two years, the notion of the million-
transistor chip will soon be a reality [Ref. 3]. Commen-
surate with this major increase in chip density is the
utilization of highly parallel computing structures which,
almost by definition, implies a basic computational element
repeated hundreds or thousands of times. This architectural
style, which has structural properties suitable for VLSI
implementation, reduces the design problem by several orders
of magnitude. As a result, we are interested in high-
performance parallel structures that can be implemented
directly via very economical hardware devices [Ref. 2]. In
other words, cost-effectiveness has always been, and will
continue to be, a major concern in designing special- purpose
VLSI systems; their cost must be low enough to justify their
limited applicability. Furthermore, if a structure can
truly be decomposed into a few types of building blocks
13
which are used repetitively with simple interfaces, tremen-
dous savings can be achieved.
This is especially true for VLSI designs where a single
chip usually comprises hundreds of thousands of identical
components. Clearly, in order to overcome this
complexity, simple and regular designs are essential. In
fact, VLSI systems which are based on simple, regular lay-
outs are very likely to be modular and therefore adjustable
to various performance levels. Still, with the technological
indication of a diminishing growth rate for component speed,
any major improvement in computation speed must come from
the concurrent use of many processing elements. [Ref. 3]
The degree of concurrency in a VLSI computing structure
is largely determined by the underlying algorithm.
Consequently, massive parallelism can be achieved if the
algorithm is designed to exploit high degrees of pipelining
and multiprocessing. For instance, when a large number of
processing elements work simultaneously, coordination and
communication become significant—especially with VLSI tech-
nology where routing costs dominate the power, time, and
area required to implement a computation. Thus, the
requirement is to design algorithms that support high
degrees of concurrency, and at the same time to employ only
simple, regular communication and control to ensure effi-
cient implementation. [Ref. 4]
14
Clearly, what is required is a special-purpose design
which employs simple and regular communication paths for
multiprocessor structures in addition to pipelining as a
general method for utilizing these structures. In short,
systolic arrays provide a realistic model of computation
which captures these concepts of pipelining, parallelism,
and interconnection structures.
According to Kung and Leiserson [Ref. 2]:
A systolic array is a collection of relatively simple
processing units, usually of the same type, which are
connected together by a simple communication network and
that operate in parallel, as depicted in Figure 2.1.
The performance advantage of a systolic array architec-
ture is that it uses each datum retrieved from memory
numerous times without having to store and retrieve
intermediate results, thus allowing significant speedups
relative to the memory bandwidth. Thus, a systolic
system is a network of processors which rhythmically
computes and passes data through the system. The
analogy is to the rhythmic contraction of the heart
which pulses blood through the circulatory system of the
body. Each processor in a systolic network can be
thought of as an element through which multiple streams
of data are pumped. The regular beating of these
parallel processors maintains a constant flow of data
throughout the entire network. As data items are pumped
through the network some constant-time computation is
performed and, depending on the operation, updates of
some of the items may occur. However, unlike the
closed-loop circulatory system of the body, a systolic
computing system usually has ports into which inputs
flow, and ports from which the results of the computa-
tion are received. Thus, a systolic system can be
viewed as a pipelined system—one in which input and
output occur with every pulsation.
As a result, this makes it extremely attractive for a
wide class of compute-bound computations where multiple




(a). ONE-DIMENSIONAL LINEAR ARRAY
(c) TWO-DIMENSIONAL SQUARE ARRAY
(b) TRIANGULAR ARRAY
(d) TWO-DIMENSIONAL HEXAGONAL ARRAY
Figure 2.1 Various Systolic Array Configurations
16
B. PRINCIPLE OF OPERATION
The basic principle of a systolic array is illustrated
in Figure 2.2. As stated earlier, by replacing a single
processing element (PE) with an array of processing
elements, a higher computation throughput can be achieved
without increasing the memory bandwidth.
Suppose each processing element in Figure 2.2 operates
with a clock period of 100 nanoseconds (ns). The con-
ventional memory-processor organization in Figure 2.2a
has at most a performance of 5 million operations per
second (MOPS). With the same clock rate, the systolic
array processor will result in 30 MOPS performance.
This gain in processing speed can also be justified with
the fact that the number of pipeline stages has been
increased six times in Figure 2.2b. Being able to use
each input data item a number of times is just one of
the many advantages of the systolic approach. Other
advantages include modular expansibility, simple and
regular data and control flows, use of simple and
uniform cells, elimination of global broadcasting,
limited fan-in and fast response time. [Ref. 3]
With the above criteria a systolic array is a natural
architecture for the implementation of an RS encoder and
decoder which will become apparent after our introduction of





(a) THE CONVENTIONAL PROCESSOR
MEMORY
5> PE PE PE PE
<
PE PE
(b) A SYSTOLIC PROCESSOR ARRAY
Figure 2.2 The Concept of a Systolic Processor Array
III. FINITE FIELD THEORY
A. BACKGROUND
Finite or Galois fields (named after the nineteenth
century French mathematician Evariste Galois) play many
important and diverse roles in numerous applications ranging
from digital signal processing to switching theory. How-
ever, in this thesis we are concerned with their use in the
construction of Reed- Solomon error- correcting codes. We
begin with a general analysis of the pertinent facts
regarding finite fields. In the next chapter the necessary
facts about Reed-Solomon codes are discussed.
A field is a set of elements, including and 1, any
pair of which may be added or multiplied (denoted by + and
*, respectively) to give a unique result in the field. The
addition and multiplication are associative and commutative,
and multiplication distributes over addition in the usual
way: u* ( v+w)=u*v+u*w. Every field element u has a unique
negative -u such that u+(-u)=0. Every nonzero field element
u has a unique reciprocal field element 1/u, such that
u*(l/u)=l. For every field element u, 0+u=u=l*u, and 0*u=0.
Thus the numbers and 1 are the additive and multiplicative
identities, respectively. [Ref. 5]
The order of a field is the number of elements in the
field. If the order is infinite, we call the field an
19
infinite field. The rational numbers, the real numbers, and
the complex numbers are all examples of infinite fields. If
the number of elements is finite we call the field a finite
field [Ref . 5]
.
For any prime p and any positive integer m a Galois
field denoted GFfp™) or GF(q) exists. We can construct a
field containing p^ elements as an algebra of polynomials
modulo an irreducible polynomial over GF(p) of degree m.
Addition is bit- by-bit modulo p addition.
The multiplicative group of the nonzero field elements
is cyclic, i.e., it is a group that consists of all the
powers of one of its elements, 3. Multiplication is defined
as 3^* 3 J = 3^-+ D where i+j is computed modulo (p111-!) and g is a
generator of this group. A generator of this multiplicative
group, called a primitive element , is a root of an irre-
ducible polynomial over the prime field GF(p). This
irreducible polynomial, called a primitive polynomial , is
the minimal polynomial of the primitive element, i.e., the
polynomial of least degree with the primitive element 3 as a
root. Generally speaking, an irreducible polynomial is
analogous to a prime number: it has no nontrivial factors.
Lastly, the Galois fields that can be created by taking
residue or equivalence classes of polynomials modulo an
irreducible polynomial over GF(p) are said to be fields of
characteristic p. Thus, GF(pm) i s a field of characteristic
p for each choice of positive integer m [Ref. 6]
.
20
B. AN EXAMPLE OF THE CREATION OF A FIELD
Consider the Galois field GF(2 4 ). It has 2 4 elements
and may be constructed as the field of polynomials over
GF(2) modulo the irreducible polynomial 1+x+x4 . If we let g
represent a root of 1+x+x4 , then it is also a primitive
element of the field. Field addition of the elements is
bit-by-bit modulo 2 addition while multiplication of the
elements is described using the primitivity of the element
3. Thus, $i* $3 = $i+ i where i+j is reduced modulo 15/ if
necessary. For example, given the field elements pi 3 and
$1 two of the 15 nonzero field elements listed in
Table I, we can easily demonstrate both operations:
3l3+39 =




+i= 3IO while gl3*39 =
313+9 = 322 = 322-15 = 37 = 33+3+1.
21
TABLE I
REPRESENTATION OF GF ( 2 )
FIELD


















6 =6(6 ) 6 + B 2 110
6 5





) 1+6 + 6 3 110 1
Z
8
=Z(Z 7 ) 1 + B 2 10 10
$
9
=&($ 8 ) 6 + B 3 10 1
$
10
=&{$ 9 ) 1 + 6 + B 2 1110
B
l:L
= B(6 10 ) 6- + s 2 B 3 111
12 11










6 =e (b ) 1 + B
3 10 1
22
IV. REED- SOLOMON CODES
A. BACKGROUND
Reed- Solomon (RS) codes are Bose-Chaudhur i-Hocquenghem
(BCH) codes over GF(q) of length q-1. They are error-
correcting codes which are used in many special- purpose
applications ranging from deep- space communications and
spread spectrum to digital audio disk systems and secure
data transmissions [Ref. 7]. These codes can correct both
random and burst errors over a communication channel and
hence are ideal for the numerous, real-time, and reliable
applications demanded by these applications. The complexity
of RS encoders and decoders is proportional to the error-
correcting capability of the code, the speed of the decod-
ing, and the interleaving depth used [Ref. 8]. For truly
reliable communications there is a very strong tendency to
use RS codes with a large error-correcting capability and an
equally large interleaving level. Hence, one is especially
interested in minimizing the complexity of RS encoders and
decoders for communications and other pertinent applica-
tions. Toward this end, there is a considerable interest in
systolic array construction and eventual VLSI implementation
of RS encoders and decoders which yield significant savings
in size, weight, and power consumption while simultaneously
providing high reliability.
23
In this chapter we look at a generic construction and
architecture of an RS encoder developed by Johl [Ref. 9]
and use this design as a foundation for the subsequent
discussion and implementation in the later chapters. This
implementation utilizes a systolic architecture of identical
cells arranged in a linear array, .each executing a finite-
field multiplication and addition in a pipelined manner;
thereby, significantly increasing the throughput rate.
Also, since the layout of the cell need only be done once
and then replicated, it is extremely attractive for eventual
VLSI implementation.
B. GENERIC ARCHITECTURE
The RS code is a block code which consists of symbols of
more than one bit. When each symbol is J-bits wide, an RS
codeword has (2J-1) symbols. As depicted in Figure 4.1, an
RS code can be designed to be capable of correcting E errors
with each codeword consisting of I information symbols,
together with 2E parity or check symbols. As an example,
given the irreducible polynomial 1+3+3^=0 and its corre-
sponding finite field as described in Table I we are able to
establish an important foundation vital to the development
of a generic RS encoder. This RS code consists of a total of
15-four bit symbols for each codeword. If this particular
code should correct one error, it would need two parity



































symbols. This representation is known as an RS (15,13)
code, where the first integer depicts the total number of
symbols in the codeword, and the second integer indicates
the number of information symbols. It is the responsibility
of the encoder to use the information symbols to generate
the check or parity symbols for the codeword. The informa-






where f^ is the i tn transmitted information symbol. The
corresponding generator polynomial is known as g(x).
2E
g(x) = (x+B 1 )
i = l
Then, the 2E parity symbols are defined as the coefficients





(x+6 1 )(x+3 2 )
x 2 +( 3 1 +6 2 )x+3 3
x 2+B 5 x+0 3
26
Furthermore, let us assume that the thirteen information
symbols are 36 / 3 1 / S8 , S2 / 34 35 / 312 / S7 1 39 , S11 / B14 /







14+f 2 x13+f3X12 f4X11+f5X10 + f6 x5 + f7 x8+f g x7+f9x6+f10 x5
+ fll x4+f12 x3 + f13 x2
= 36x14+31 x13+S8 x12+e2 x11+e4 x 10+s5 x9 +e12 x8 + g7 x7 + 39 x6 + Bll x5
+ 314 x4+33 x3+ e13 x 2
Performing the required division, f(x)/g(x)
3x +3x +3x +...+ 3 x+3x+0
2^.5 io3L6 14^ 1 13^ 8 12^ L 14 4^ 3 3^ „13 2x+3x+3l3x + 3 x + 3 x +...+ 3 x + 3 x + 3 x
6 14^ 11 13 9 123x +3 x + 3 x
6 13 12 12^ 2 113x +3 x +3x
6 13^,11 12^ 9 113x +3 x + 3 x
o iTTJl lr a 103x +3 x +3x
12 L 5 11 3 103x +3x +3x
,
14 4 X 14 3 13 2




4 3 L 2 2
? x + 3 x + 3 x
9 3^ 14 2^n3 x +3 x +0 x








Hence, the remainder we seek is B^, and thus the corre-
sponding 15- symbol codeword is 36 e1 $8 b2 34 35 312 S7 39 611 S14 3
3I3 gl2o where the first thirteen symbols represent the
information symbols and the last two symbols represent the
parity symbols. [Ref. 9]
The architecture of the systolic implementation consists
of a regular array of identical cells. Division is per-
formed in a pipelined manner by simultaneously entering the
highest order of terms of the f(x) and g(x) polynomials on
the left most cell and generating the appropriate codeword
on the far right, as depicted in Figure 4.2. In fact, a
codeword can immediately follow the previous one without any
interruption in the pipeline flow. Likewise, the control is
also systolic. One control bit pipeline path will signal
the start of a new codeword; another will signal the start
of the division operation. Meanwhile, each cell of the
array will hold one term of the quotient. As a result, if d
represents the difference in degrees between two poly-
nomials, then
d=[deg f(x)-deg g(x)]
and thus d+1 cells are required. For example,
deg f(x) = 14
deg g( x ) =2































From our previous calculation, the quotient was ( $ x±2+ $6x11
+ gO x10+ . . # + gl4 x 2+ 39 x+o ) . Since it consists of thirteen
terms, thirteen cells would be needed. In general,
deg f(x) = 2^-2
deg g(x) = 2E
d = 2J-2E-2
and so the total number of cells required is d+1 or 2 J-2E-1.
[Ref. 9]
The operation of each cell is simple and regular.
Essentially, it accomplishes one line of the normal division
by initially determining the specific term of the quotient,
multiplying by the divisor, subtracting the result from the
dividend, and finally passing along the divisor and partial
result to the next cell. More specifically, there are three
J-bit data paths and two 1-bit control paths, as shown in
Figure 4.3. The function of the C data path is to allow the
information symbols to pass through the array unchanged
while the other two data paths, A and B, are for the
dividend and divisor, respectively. The register Q is set
at the start of the division, and remains the same through-
out the polynomial division of one block. The register B is
used as a temporary storage device. While a control bit
accompanies the first byte of information to signal the
start of a new codeword a preceding start bit, one-half the















A: USED FOR DIVIDEND
B: USED FOR DIVISOR
C: USED FOR INFORMATION SYMBOLS
Q: DIVISION REGISTER
S: START REGISTER
CONTROL: USED TO START DIVISION
t FINITE FIELD MULTIPLIERFINITE FIELD ADDER
Figure 4.3 The Systolic Cell Structure
31
each cell. In short, the above architecture is simply a
pipelined parallel processor which is composed of a systolic
array of identical cells, each performing a finite-field
multiplication and addition. Since the layout is simple and
regular, it is easily replicated and economical to produce.
[Ref. 9]
In Chapter VI the encoder and decoder for an RS code are
described in greater detail with the encoding and decoding




In this chapter we look at the theoretical concepts
behind the systolic implementation of an encoder and
decoder. We then apply these concepts to the actual imple-
mentation in the subsequent chapter. There, the binary case
is initially presented because of its simple architecture
and ease of understanding. It is then followed by the more
intricate and complex Reed- Solomon case.
We also, in this chapter, discuss in-depth the design of
a systolic array multiplier used in the RS encoder. Unlike
the binary case which deals only with the elements and 1
in the complete codeword, the Reed- Solomon codeword will
contain symbols which lie in a larger field than GF(2). As
a result, the systolic array multiplier is increasingly more
detailed and complicated than in the binary design which
simply uses a primitive binary shift register scheme.
B. PRIMITIVE BINARY SHIFT REGISTER DESIGN
A primitive binary shift register is a series of regis-
ters each capable of containing a zero or a one. The
contents of the register all shift on a designated time
signal via use of an external clock. The contents of the
newest stage of the register is defined as a function of the
33
current contents of the register. Because these shift
registers utilize this feedback property they are commonly
referred to as feedback shift registers or primitive shift
registers since the feedback is usually described by a
primitive polynomial [Ref. 10]. For example, the diagram in
Figure 5.1 describes a primitive shift register composed of
four registers, labeled 1, x, x^ , x^ and one modulo 2 adder
situated between registers 1 and x. Each register is
capable of storing one bit of binary information, i.e., a
"1" or a "0". The all zero contents of the register is
typically prohibited. This restriction is placed on the
primitive shift register to ensure a change of state when a
new clock signal is received. The register is allowed to
step from state to state, therefore the length of a primi-
tive cycle is independent of its initial state and is equal
to 2 in-l . The primitive shift register of Figure 5.1 will
move through 15 distinct binary patterns before repeating
(see Table II). This primitive shift register is said to
have a cycle length of 2^-1 or 15. Moreover, since all
nonzero patterns are included in the cycle, it is called a
maximum- length cycle. In general, a primitive shift regis-
ter composed of m stages will generate a maximum- length
cycle of period 2m-l. It is possible for each value of m to
determine a primitive feedback function for the shift
register so that a maximum- length shift register sequence of

































REGISTER CONTENTS AFTER SUCCESSIVE CLOCK SIGNALS



































Maximum- length cycles and maximum- length sequences have
broad applications in data communication systems and com-
puter simulation while primitive shift registers designed as
division circuits have applications in coding theory
[Ref. 10]. It is the objective of this chapter to utilize
the concepts of the latter to propose an RS encoder and
decoder
.
In order to generate a maximum- length cycle or sequence
we need to understand the necessary component connections
given a primitive polynomial. That is, given an arbitrary
primitive polynomial, how do we design the shift register?
For the example of Figure 5.1, assume p(x)=l+x+x4 is a
primitive polynomial over GF(2). We can consider GF(2^) as
an algebra of polynomials modulo p(x)=l+x+x^ and design a
register to produce a pattern cycle of length 2^-1. Using
four delay units (since we need a register unit for the
coefficient of each term x fc with 0<t<3) we need only decide
how the primitive polynomial affects the feedback to know
where to place the modulo 2 adder components and where to
make the necessary circuit connections. The feedback is the
coefficient of x^ , but in this polynomial algebra x^=l+x.
Thus, the feedback goes to the registers which contain the
coefficients of the x^ and x^ terms. Making these connec-
tions and supplying the modulo 2 adder component where we
have two inputs to the register, we arrive at the shift
register given in Figure 5.1. Then each step of the
37
register is equivalent to multiplying the contents of the
register by the primitive element g. Thus, the sequence of
contents are the powers of g modulo (l+g+g4). in this way
multiplication of the elements of the field is produced
simply as described in Chapter III and the powers of g are
as given as in Table I of Chapter III.
C. CODING THEORY
Suppose that we wish to transmit a sequence of binary
digits across a noisy channel. If we send a one a one will
probably be received and if we send a zero a zero will also
probably be received. Occasionally, the channel noise will
cause a transmitted one to be received as a zero or a trans-
mitted zero to be received as a one. Although we are unable
to prevent the channel from generating such errors, we can
reduce their undesirable effects with the use of coding
[Ref. 5]. The basic idea is simple. A set of k message
digits which we wish to transmit is concatenated to r check
digits. The entire block of n=k+r channel digits then forms
the transmitted codeword. Assuming that the channel noise
changes sufficiently few of these n transmitted channel
digits, the redundancy afforded by r check digits provides
the receiver with sufficient information to detect and
correct the channel errors. Figure 5.2 illustrates the
basic idea of the encoding process for an (n,k) encoder with





































that the message digits appear at the far right. The error
correcting capability of the generated code depends upon the
number of check bits added. To illustrate, the binary code
constructed using the encoder of Figure 5 .2 is capable of
correcting one error when, for example, n=2m-l, k=2 m-l-m for
each integer m > 2, the so called Hamming single-error-
correcting code.
D. MINIMAL POLYNOMIALS
In order for a code to correct every pattern of t or
fewer channel errors, the codewords must be generated by a
polynomial whose length is the product of at least t
distinct minimal polynomials [Ref. 5]. Occasionally, extra
error correcting capability is possessed by words of
a code beyond the designed capacity ' of the code. To
understand this situation and the general error correcting
capacity of the code, it is necessary that we discuss
some of the mathematical concepts and properties that
comprise minimal polynomials before discussing the actual
implementation
.
A minimal polynomial for a primitive element 3 over
GF(p) is the lowest degree irreducible monic (has leading
coefficient 1) polynomial M(x) with coefficients from GF(p)
such that M(3)=0 [Ref. 11]. For example, the Galois field
GF(24) i s constructed using the primitive element 3, the
root of the irreducible polynomial 1+x+x4 . Then the minimal
40






33 l+x+x2+x3 + x4
35 1+x+x2
37 l+x3 + x4
Furthermore, in GF(2m ) 3* and 32 * have the same minimal
polynomial. In general, if 3 1 is a root of a minimal poly-
nomial then so is 3P 1 (where p is the characteristic of
the ground field GF(p); in this case p=2 ) . To illustrate,
let us substitute the elements 3 and 32 into our minimal




Thus in GF(2 4 ) 34 =3+l and M( 3)=0 . Likewise, upon sub-
stituting 32 for x in the same minimal polynomial we obtain
1+
3
2+ 38 , which in GF(2 4 ) is also zero as can be seen in
Table I of Chapter III. Elements of the field with the same
minimal polynomial are called conjugates. In the same way,
the imaginary roots i and -i are referred to as conjugate
complex numbers— they both have the same minimal polynomial
x 2+l over the reals [Ref. 11].
From our preceding discussion, it is clear that 3, 32 ,
(32 ) 2 = 34 , (34)2 = 38 an have the same minimal polynomial
41
l+x+x4 . Likewise $3 t 36 r 3I2 f g24=^9 also have the same
minimal polynomial l+x+x^+x^+x^ . We see that the powers of
3 fall into disjoint sets, called cyclotomic cosets . In
fact, all 3J which are elements of the same cyclotomic coset
have the same minimal polynomial. The cyclotomic coset





\ S f A S / Z S/ Z. Sf •••/ Z S j
m
where m s is the smallest positive integer such that 2 s e
s(mod 2m-l ) [Ref. 11]. For example, the cyclotomic cosets




C5 = {5,10 }
c 7
= {7,14,13,11}
Other cyclotomic coset decompositions for various values of
m are listed in Table III.
If we let M(i)(x) represent the minimal polynomial of
3 ;i-eGF(pm ), it follows that if i is in the cyclotomic coset
C s , then






OVER GF(2 3 ) OVER GF(2 5 )
C = (0} C = {0}
Ci = {1,2,4} Ci = {1,2,4,8,16}
C 3 = {3,6,5} C 3 = {3,6,12,24,17}
C 5 = {5,10,20,9,18}
C 7 = {7,14,28,25,19}
Cii = {11,22,13,26,21}
C 15 = {15,30,29,27,23}
OVER GF(2 6 )
C = {0}
Ci = {1,2,4,8,16,32}
C 3 = {3,6,12,24,48,33}
C 5 = {5,10,20,40,17,34}
C 7 = {7,14,28,56,49,35}
C 9 = {9,18,36}
Cn = {11,22,44,25,50,37}
C 13 = {13,26,52,41,19,38}
C 15 = {15,30,60,57,51,39}
C 2 i = {21,42}
C 23 = {23,46,29,58,53,43}
C 2 7 = {27,54,45}
C31 = {31,62,61,59,55,47}
43
which is analogous to the generator polynomial g(x) in our
generic architecture of the previous chapter. Moreover, by
utilizing various techniques beyond the scope of this
thesis, we may determine all the minimal polynomials of
elements in GF(2 4 ), as depicted in Table IV. Using this
table we may construct all the Reed- Solomon codes of block
length 15 which correct t or fewer channel errors. These
codes have the following generator polynomials:
t=l g(x)=M(D (x)=l+x+x4
t=2 g(x)=M(D (x)*M(3 ) (x ) =l+x4 + x6+x7 + x 8
t=3 g(x)=MU) (x)*M(3 ) (x)*M(5 )=i+ x+x2+x4 + x5 + x 3+x10
Hence, the t-error correcting RS code of block length n is
then the cyclic code whose generator polynomial is the




g2t [Ref. 5]. Of noteworthy interest is the
fact that an RS code over GF(2^) which is designed to
correct up to 4 errors is also able to correct 5 errors.
This is because M(9)(x), the minimal polynomial of $ , is
identical to M(5)(x), the minimal polynomial of & . Simi-
larly, the 6 error-correcting RS code is identical to the 7
error-correcting code just as the 8-to-14 error correcting
codes of length 31 are all identical to the 15 error-
correcting code. In a similar way, codes over GF(2^) and
GF(2 7 ) are sometimes able to correct more errors than they
are designed to correct. The ability to correct these extra
44
TABLE IV
MINIMAL POLYNOMIALS OF ELEMENTS IN GF(2 4 )
M(D(x) = M(2)( X ) = M(4)( X ) = M(8)( X ) = 1+x+x4
M(3)( X ) = M(6)( X ) = M(12)( X ) =M(9)(x) = l+x+x 2+x3+x4
M(5)( X ) = M(10)( X ) = 1+x+x2
M(7)( X ) = M(14)( X ) = M(13)( X ) = m(H)(x) = l+x3 + x4
45
error patterns depends upon finding higher powers of 3 which
belong to cyclotomic cosets for the smaller powers of 3
which belong to the code for the designed error correcting
distance. The tables of cyclotomic cosets for GF(25),
GF(2 6 ) show that 39 belongs to 35 , 317 belongs to 3^ and 319
belongs to 3^3, etc. See [Ref. 11] for further discussion
of the error correcting capabilities of given error
correcting codes.
E. SYSTOLIC ARRAY MULTIPLIER
As mentioned earlier in this chapter, the systolic array
multiplier used in the generation of Reed- Solomon codewords
is much more complex than in the binary case. In this
section, we discuss the design of a systolic array multi-
plier developed by Yeh, Reed, and Truong [Ref. 7] to assist
us in our implementation of an RS encoder.
According to [Ref. 7] several circuits have been pro-
posed to realize multiplication in GF(2m ). Unfortunately,
these circuits are not suited for use in VLSI systems, due
to irregular wire routing, complicated control problems,
nonmodular structure and lack of concurrency. The systolic
array multiplier of [Ref. 7] performs the multiplication in
the field GF(2 m ) which overcomes some of these unwanted
attributes.
The systolic architecture is developed for performing
the product- sum computation, AB+C, in the finite field
46
GF(2m ) of 2m elements, where A, B, and C are arbitrary
elements of the field. The multiplier is a serial- in,
serial-out, one-dimensional systolic array which requires m
basic cells. To perform an isolated computation the multi-
plier requires 3m time units, however, the average time per
computation is only m time units if a number of computations
are carried out consecutively. Because the architecture is
simple and regular and possesses the desirable properties of
concurrency and modularity, it is well suited for VLSI
implementation. [Ref. 7]
Consider the nonzero elements of GF(2m ). They can be
represented as the powers of 3, a primitive element of
the field as discussed in Chapter III. Since F( 3)=0,
3m=fm_l 3^"^-+ . . . + f]_ 3+fg , where the coefficients fi are
determined by the polynomial f(x) which 3 satisfies.
Therefore an element of GF(2m ) is of the form
am-l 3m_1+. . . + ai 3+3q where ai e GF(2) for < i < m-1. In
the following discussion, the polynomial representation is
used to represent the finite field GF(2 m ).
Let A=am_]_ $™~1+ . . . + a]_ 3+a and B=bm_ 1 3m" 1 + . . , + b! 3+b be
two elements in GF(2m ). Then A+B=Sm-1 3^-1+ . . , + S± 3+S ,
where Si=ai+bi (mod 2) for < i < m-1. Therefore addition




~ 1+. . . + Pi 3+Pq is the product of A and B,
i.e., P=AB. Then P can be written as follows:
47
m-1 m-1 m-1











.+ai( k ) 3+ao^ k ^ for < k < m-1. From equation







The computation of A 3k can be performed recursively on k
for < k < m-1. Initially for k=0, A3^=A f i.e., an (0)=an
for < n < m-1. For 1 < k < m-1,
m-1





(k-1) m \ (k-1) n
= a ,
v 3+/ a, v '3






• . .+fi 3+f into equation (2),
yields
m-1
A3*=V (a (k-D +a (k-D f)6n + a (k-1)








" 1) + am-l (k





Table V indicates the step- by- step procedure for comput-
ing P=AB+C in GF(2 4 ). In Table V an ( k ), bn , c n , fn , and
pn are the n- th bits of A£k , B, C, F, and P, respectively,
where F is the primitive polynomial and Pn^ 1 ^ is tne partial
sum of pn .
Figure 5.3 depicts the systolic multiplier for our given
finite field. The primitive polynomial is F=f3 g^+f^ &^+f± 3*
f . Input d n receives the bit bn of B. The n- th bits c n ,
an and fn , of C, A, and F, respectively, are received
serially at inputs en , gg , and hg . Two control signals,
START (0001) and END (0111) are used in the design with
inputs rg and tg receiving the signals, respectively.
Output e4 serially transmits the n- th bit, pn , of the
result P out of the system. The order of the inputs and the
outputs is also shown in Figure 5.3. The flip-flops (FF)
associated with inputs tg and hg are used for the purpose of
synchronization
.
The circuit of cell Lj^ is shown in Figure 5.4. The
operation of the flip-flops in the system is synchronized
implicitly by a clock signal. When ri*=i, ui=g i * a t the
next time unit (through switch SW) . Additionally, when
49
TABLE V






1 p3 (0 ) = c 3 a 3 (0 ) = a 3
p3 (D = p3 (0 )+a 3 (0 )b , a 2 (°) = a 2
2
p2 (0) = c?
p2 (l) = p2 (° ) + a 2 (° )b , ax(°) = a x
3
Pl (0) = c-, a^d) = a 9 (0)+ a^(0)f^
p3 (2) = p3 (° )+a 3 (! )b^ , ag (° ) = ag
4 Pl (D = Pi(°)+a1 (0)b , a 2 U) = a 1 (0) + a 3 (0)f 2
PO (0 ) = cn
p2 (2) = p2 (l)+a2 (l) bl/ ai (l) = a (0)+a 3 (0)f 1
5
pn (D = p (0 )+an (0 )bn , a 3 ( 2 ) = a 2 ( 1 )+a 3 U)f 3
p3 (3) = p3 (2) + a3 (2) b2f a (D = a 3 (0)f
6
Pl (2) = pt (D + ai (!)bi , a? (2) = a-| U )+ar* (1 )f 2
p2 (3) = p2 (2) + a2 (2) b2/ ai (2) = ao (l)+a3 (l)f1
7
pn (2) = Po (l) + af) (lJb-L f a 3 ( 3 ) = a? ( 2 ) + a 3 ( 2 ) f^
p3 = p3 (4) = p3 (3 ) + a 3 (3 )b 3 , a (2) = a 3 (Df
8
Pt(3) = Pl (2 ) + ai (2 )b 2 , a 2 (3) = a ^ ( 2 ) + a 3 ( 2 ) f 2
p2 = p2 (4) = p2 (3 )+a 2 (3 )b 3 , a x (3) = a (2 )+a 3 ( 2 ) fx
9
p (3 ) = Po (2 )+an (2 )b2/
10 Pl = px (4) = p1 (3) + a 1 (3)b 3 , a (3) = a 3 (2)f (









ft°f f A A /N
^
^t •«j ^ ^r




/ < / V / \ A A
ro ro c f ro




/ ^ / V / \ / ^ y ^
CN CN CN CN CN






V / \ ; ^ / \ / s
m H iH r- i—
1





O O V df V /O * A \
<u -P Cn & U
ro
cM ro 'M
c 1-1 ro h
CM CM
U J ^ ro rV ^ r.
u rH <0 m o
o rH O CN o
























































































ri*=0 , Ui retains its value. Two principle operations of
the system are the following
:
e i+1 < (9i*di) © ei*
9i+l* < (uihi*)© (gi*ti*)
where < i < 3, ©denotes Exclusive-OR operation, i.e.,
modulo-2 addition, and the backwards arrow denotes the sub-
stitution operation.
A comparison of the procedure in Table V and the
structure in Figures 5.3 and 5.4 yields the following facts:
The signal u^ in L^ is equal to a3
(
1 ) in Agi. The signal
gi is equal to an ( 1 ) in A3 1 for some n. The signal e^* is
equal to the partial sum AB+C.
The multiplier in Figure 5.3 can be generalized to the
finite field GF(2m ) by simply concatenating m identical
cells. Furthermore, additional registers and control sig-
nals would be required if the b^'s are fed serially into the




In this section we discuss the encoding process for a
binary code and utilize a primitive shift register design to
implement both a single-error-correcting binary encoder and
a double-error-correcting binary encoder.
1 . Encoding Process
As discussed in Chapter V, an (n,k) code can be
generated with a polynomial of degree n-k. If the poly-
nomial is primitive of degree r and n=2 r-l, the code can be
encoded and decoded with primitive shift registers. Hence,
we restrict our attention solely to the case of primitive
polynomials
.
We illustrate this procedure by generating the
(15,11) binary code using the primitive polynomial p(x)=
1+x+x 4 . Here n-k=4 , r-4 , n=2 4-l=15 , and k=ll. The encoding
process for the 11-bit message 10101010101 proceeds as in
the example below.
Example of Encoding Process:
Message = 10101010101
1) Represent the message m(x)=l+x2+x4+x^+x3+xl0
as a polynomial
.
2) Multiply m(x) by xn" k x 4 m( x ) =x4+x6+ x8+x 10+x12+x14
to shift the message
digits to the far right.
54
3) Calculate the remainder r(x)=l+x+x3
when xn~^m(x) is
divided by p(x) .
4) Form the code c (x)=l+x+x3+x4+x 6+x8
polynomial as the sum +xl0+ x12+ x14
x n~^m(x)+r (x) , a
multiple of p(x).
Code Word = 110110101010101
Note that codewords in this code are formed as multiples of
the primitive generating polynomial p(x). As p(x) is of
degree r there are n-r=k information symbols which can be
chosen freely and then r check symbols are chosen so that
the resulting codeword satisfies this criteria, namely that
the codewords are multiples of the generator polynomial. In
other words, the check digits are the coefficients of the
remainder r(x) upon division of x n"^m(x) by p(x) as shown
below.
xl Q + x 8+x7 + x5 + x4 + x 3 + l
4^





12^ 9 A 8X +x + x
H^. ^^ 6X +x +x
9.. 8^ 7
_,_

































2 . Single- Error- Correcting Binary Encoder
By utilizing the previously discussed concepts, we
may now describe the encoding process of the binary (15/11)
code as implemented in a primitive shift register shown in
Figure 6.1. By simply feeding in the message m(x) at the
x^- stage we are able to simulate the effect of multiplying
m(x) by x^ . The switch remains in position 1 as m(x) is fed
completely into the shift register. The shift register
computes the remainder when x^m(x) is divided by p(x) as the
shift register is in essence a division circuit. The
register contents after the information bits have all been
fed into the register is the remainder after division of the
information polynomial by the generator polynomial p(x). In
the example the remainder is 1101=l+x+x3. The switch is
then changed to position 2 to allow the check digits to
follow the message digits producing the coded output
110110101010101 for the example given. [Ref. 10]
3 . Double- Error- Correcting Binary Encoder
To design a double-error-correcting binary encoder
to correct up to two errors, additional redundancy must be
added. Since we are now concerned with correction of up to
two errors the generator polynomial is the product of the
two distinct minimal polynomials m(1)(x) and Mw)(x) as
described in the previous chapter. Their product is the
polynomial l+x 4 + x6 + x"7 + x8 . The implementation of the encoder

































































Figure 6.1 A Single-Error-Correcting Binary Encoder
57
single-error counterpart. The encoder is presented in
Figure 6.2. Now n=15 and k=15-8=7 so that there are a
smaller number of codewords (2^) in this more powerful code.
As the error correcting capability of the code increases,
the number of information bits correspondingly decreases.
B. REED-SOLOMON ENCODER
In this section we draw upon the work of Liu [Ref. 8]
and our acquired knowledge of finite field theory and Reed-
Solomon codes to produce an RS encoder.
As discussed in Chapter IV, an RS codeword has (2 J-1)
symbols each of which is J-bits wide. Of the (2J-1) symbols
there are (2^-l-2E) information symbols and 2E parity-check
symbols, where E is the number of symbol-errors the RS code
is able to correct. If we treat the (2 J-1-2E) information
symbols as the coefficients of the polynomial
2J-1- 2E
a s \ c 2 J -l-i c 2J -2^ . 2J -3^ .- 2Ef X = / f.x = f,x + f x +...+f , XC^ L 1 2 2 J-1-2E
1 = 1
then the 2E parity-check symbols can be obtained as the
coefficients of the remainder of f(x)/g(x) where g(x) is the
generator polynomial of the code. Usually, g(x) is defined
as
2E 2E












































and g.'s are the coefficients of g(x) with g 2 =1.
A diagram of the RS encoder which generates the
remainder of f(x)/g(x) is given in Figure 6.3. It is
composed of 2E systolic array multipliers, 2E "exclusive-or"
adders, and 2E shift registers. The coefficients of the
generator polynomial g(x) are fed into their respective
systolic multipliers where the finite field multiplication
A*B occurs, as discussed in Chapter V. Upon completion the
partial product is "exclusive-or ' ed" with the contents of C
of the previous shift register and distributed down the line
to the next shift register in a pipeline fashion. The
switches are normally in the "ON" position until the last
information symbol goes into the encoder. At this moment
all the switches are turned to the "OFF" position and the
encoder behaves like a long shift register. The output of
the encoder is then taken from the output of the last shift
register. [Ref. 8]
C. BINARY DECODER
In this section we discuss the decoding process and
design a single-error-correcting binary decoder and a
double-error-correcting binary decoder both of which can be

























































The decoding process is, in general, much more
complicated than the encoding process. Not only must we
deal with the detection of errors but also with their
correction. As a result, we must be able to design a
decoder which simultaneously detects and corrects errors.
Error detection is usually much easier than error
correction. Recall that a code polynomial is a multiple of
the generating polynomial p(x). In other words, the
received polynomial u(x) will be a code polynomial if and
only if the remainder upon division of u(x) by p(x) is zero,
i.e., u(x) = modulo p(x). An example is given in
Table VI. The register contents after u(x) is fed com-
pletely into the detecting division register will contain
u(x) modulo p(x). If any of the register contents are
nonzero, u(x) is not a valid codeword. Thus the shift
register acts as an error detector by performing a division
of u(x) by p(x). In fact, the nonzero contents not only
indicate that an error has occurred in transmission, but
those contents also indicate the error pattern needed to
correct the error and the location of the error in the
transmitted codeword. [Ref. 10]
2 Single-Error-Correcting Binary Decoder
Because of the complexity of the decoding process,
we will initially design an error detection register
followed by its error correction counterpart and then
62
TABLE VI
VERIFICATION OF THE CODE POLYNOMIAL
HT 8
_,_
7 A 5^ 4^ 3^,X +x + x +x +x +x +1
x
4
+ x+l|x 14+x12+ x10 + x 8 + x 6 + x4 + x 3 + x+ l
14^ lr 8

























_,_ 5^ 4X +x + x
X +x +x





synthesize them together to implement the complete decoder.
To begin, we utilize the error detection register of Figure
6.4. It is identical to the encoding register of Figure 6.1
except that the received codeword is input to the decoder at
the left end of the register. If the received word is
110111101010101, then the nonzero contents 0110 after
division indicate that an error has occurred in trans-
mission. In order to correct the received word we need to
know the error position.
The received word can be viewed as a polynomial u(x)
which can be written as the sum of the code polynomial c(x)
and an error polynomial e(x), namely u(x) = c(x) + e(x).
The error polynomial e(x) has ones in its error positions
and zeros elsewhere, and addition is term by term modulo 2.
Since the codewords c(x) are generated as multiples
of the generator polynomial g(x) and since 3 is a root of
g(x), the code polynomials evaluated at B are equal to zero,
namely c(g) = 0. Thus u(B) = c( 3) + e(B) = e(g). Since we
assume in this sub-section that only single errors have
occurred in transmission we can also assume that if an error
occurs then e(x) is a power of x, say e(x) = xi for some i.
Thus u( B) = e( B) = $i .
In order to correct the error we need to compute
u(B) which is called the syndrome of the received word and
then find the specific value i for which u(B) = B1 * The




































Figure 6.4 The Error Detection Register
65
set c(x) = u(x) + x 1 to obtain the code polynomial c(x) a
multiple of p(x) which is "nearest" to the received poly-
nomial u(x). The primitive shift register facilitates this
task because while it is computing u(x) modulo p(x) it also
leaves the coefficients of u(e) = B 1 in the shift register.
[Ref. 10]
For example, in Figure 6.5 the primitive shift reg-
ister computes u(x )=1+x+x3+x4+x5+x6+x8+x10+x12+ x14 modulo
p(x)=l+x+x 4 and the syndrome is 0110 = x+x 2 . Note from
Table I of Chapter III that 0110 is the 4-digit represen-
tation of 3^. Hence the error in the received polynomial
occurs in the position of x^ . Therefore, the code poly-
nomial is c(x)= u(x)+e(x)=(l+x+x 3+x 4+x 5+x 6+x8 +x10 +x12+x14 )
+x 5 = l+x+x 3 +x4+x 6 +x8 +x 10 +x12+x14 . The corrected codeword
is 110110101010101 and the corrected information symbols are
10101010101. The same procedure is also illustrated in
Table VII by the actual long division process.
We now examine the primitive shift register decoding
process which performs the error correction. After the
syndrome is computed by the primitive shift register
division process, an additional primitive shift register of
the same type can be used to correct the error without
reference to a table of powers of the primitive element 0.
The correcting register shown in Figure 6.6 is basically the
same primitive shift register used throughout this chapter
































SYNDROME CALCULATION USING LONG DIVISION
10 8 7 5 4 3
X +X +X + X + X +X + X+1





12^ 9^8X + x +x
11^ 9^ 6
X +x + x
11 8^ 7
X + x +x
9 8 7 6 5
x +x +x +x +x
9 A 6^ 5X + x + x
8^ 1 ^ 4X +x +x
8^ 5^ 4




















SYNDROME: x+x 2 =0110 = 35

























































Figure 6.6 The Initial Single-Error-Correcting Binary Decoder
69
each of the four registers. If the correcting register
is set initially at 0100, the 4-digit representation for
the element 3, then, as it shifts, the output is the same
cycle as the 4-digit representation listing of the
$i ( i=l , 2 , 3 , . . . ,15 ) in Table I since a shift in the primitive
shift register is the same as multiplication by 3. No
matter which state the register is set to initially the
correcting register will output elements of that maximum-
cycle in the same cycle order as long as the register
continues to shift. If the register is set at 3 1 / it will
be in state B i+ J after j shifts. [Ref. 10]
Figure 6.7 (the complete single-error-correcting
binary decoder) shows the received word of our example,
namely 110111101010101 whose polynomial form is l+x+x^+x^+x^
+x6+x8+x!0+xl2+ x14 in a storage register and the syndrome
0110 in the correcting register. From our previous dis-
cussion we know that the error occurs in U5 . Thus, if the
detector register has output 1 as U5 leaves the storage
register and otherwise, the word 110111101010101 will be
corrected after fifteen shifts to read 110110101010101. We
illustrate how the correcting register is used to accomplish
this task by listing the new states of the correcting
register, and the outputs from the storage register and












































































































—I O H O






















































II II II II II II II II II II II

































































H H o «H
rH O rH O
O <H O H























Note from Table VIII that the incorrect digit 115
leaves the storage register when the correcting register is
in state 1000. If the detector is made to produce an output
1 when it detects 1000 and otherwise, then U5 will be
properly corrected. In general, if the syndrome is 3 1 , then
the error occurs in the coefficient of x 1 , namely u^, where
the received polynomial has the form
n-1
u(x) = ? Uix*
i=0
If j is such that un_j = ui, then ui leaves the storage







Since 3 i+ j=3n=l, the detector will correct the digit Uj[ and
the received word will be corrected to the nearest code word
after the decoder completes this process. [Ref. 10]
73
To recapitulate the correction process, the detect-
ing register computes the syndrome of the received word. As
each digit of u(x) enters the detecting register it simul-
taneously enters the storage register. When the syndrome is
determined, it is transferred to the correcting register for
the error-correcting procedure just described.
3 . Double-Error-Correcting Binary Decoder
To implement a double-error-correcting binary
decoder we begin with a general analysis of the three stages
that comprise decoding. The first stage is the Syndrome
Generator stage. The syndrome is defined as the nonzero
remainder of the received polynomial when it is divided by
the given primitive shift register. The second stage or the
Central Galois Field Processor finds the error locator
polynomial a(z) (usually accomplished by using Berlekamp's
iterative algorithm or Massey's linear feedback shift
register synthesis algorithm) . At this stage the polynomial
is determined which defines the location of the errors that
have occurred in transmission. Finally, the third stage or
the Chien Searcher stage finds the roots of a(z) to deter-
mine which digits should be corrected. Note, in the binary
code, correction is trivial when the location of the errors
is determined, i.e., the bit in error need only be
complemented. [Ref. 11]
Using our previous double-error-correcting generator
polynomial l+x^+x^+x^+x^ , which is the product of
74
(1+x+x4 ) (l+x+x 2+x 3+x4 ) , we are able to produce Stage I of
the decoding process as illustrated by the division process
in Figure 6.8. Similarly, we are also able to produce
Stages II and III (Figures 6.9 and 6.10, respectively) along
with a block diagram of the complete decoder in Figure
6.11.
The operation of the decoder is relatively straight
forward as in the previous section. Utilizing a buffer
capable of storing 2n digits, the Chien Searcher is in the
process of computing a(z) in order to determine whether or
not the next digit to leave the buffer should be corrected.
The Syndrome Generator at the same time computes the
syndrome of the received word while the Central Galois Field
Processor finds the error- locator polynomial for the
buffered word. Once the coefficients of the error-locator
polynomial are read out of the Central Galois Field
Processor and into the Chien Searcher, the syndrome or the
nonzero remainder of the next block of received words is
read back into the Central Galois Field Processor for con-
tinual operation. See [Ref. 5] for further details of the
multiple error correction process.
If the Central Galois Field Processor operates so
fast that it is able to compute the error location before
all of the new received word arrives, then the buffer size
may be reduced. In general, the buffer is made big enough
to accommodate the expected worst case for the time to
75





















































































































> "V / S
2 s
Pi iH « ro




































































compute the locations of the two errors. However, for
example, suppose that the Central Galois Field Processor is
able to compute the error locator in half the time required
for n digits to be received from the channel. In that case,
the buffer need only be capable of storing 3n/2 digits.
After a complete word is received, the central processor
computes its error location by the time the beginning of
this word is ready to leave the buffer. The error locator
is then fed into the Chien Searcher, and the central pro-
cessor sits idle until the rest of the incoming word is
received. See [Ref. 5] for details.
Although the above discussion pertains strictly to a
binary decoder capable of correcting two errors, it can be
generalized to correct t or fewer errors. By expanding the
hardware in Stages I and III to accommodate the additional
shift register size required by t distinct minimal poly-
nomials, we are able to implement the decoder with approxi-
mately the same effectiveness. Likewise, the same procedure
of utilizing the product of t distinct minimal polynomials
would also be used in the design of a multiple-
error-correction binary encoder.
D. REED-SOLOMON DECODER
As with any multiple-error detection and correction
process, the decoding of RS codes is very complex. As a
result, the known decoding procedures as discussed by Liu
80
[Ref. 12] will be presented in this section to obtain a re-
petitive and recursive technique which is suitable for sys-
tolic array development and eventual VLSI implementation.
Recall that the information symbols of an RS code are
treated as the coefficients of the polynomial f(x). If we
let
f(x) = fn+fix+. . .+fN_ixN_1
be the transmitted code vector (where N = codeword length),
and let
r(x) = ro+rix+. . .+r^_ixN_1
be the received code vector over a noisy channel, then the
error pattern added by the channel is
e(x) = r(x)-f(x) = eo+eix+. . .+eN-ixN "* 1 .
The first step of the decoding procedure is to store the
received code vector rj into the buffer register and then
compute the syndrome S^ using the equation
N-l
c . k+1, > (k+i)jS
i




where < i < 2E-1 . Since rj = fj+e-j, equation (5) can be
expressed as
N-l
:i=y<fj+e.S< = > (f J+ j)g (k+i »3
N-l N-l
=X f : e(k+i)J +2Z e 3 g(k+i)Jj=0 D =0
= Fk+i + Ek+i (6)
In the above equation
N-l









Note that in equation (8) Eyi+ ± is the finite field transform
of the ej ' s.
The second step of the decoding procedure is to compute
a l for 1 < % < v (where v = number of errors) using the
equation
v
S. + / S. a =0 for < i < 2E-1
l /_ l- 9. %
1=1
82
from the syndromes computed in the previous step. This
can be accomplished using Berlekamp's iterative algorithm or
Massey's linear feedback shift register (LFSR) synthesis
algorithm. [Ref. 12]
Upon obtaining the o t 's, the third step of the decoding
procedure is to use the recursive equation
v
:+i /R. = / E, . a = Q for 2 E < i < N-
1
}<: ^_ k+ l- i i
1=1
where
Ek+i = Ek+i-N for k+i > N
to compute the remaining Efc+ ^ for 2E < i < N-l.
After determining the transform of the error pattern
E]^ j_ for < k+i < N-l, by equation (8), we can then apply










for j=0 ,1 ,2 , . .
.
,N-1 . Then the corrected codeword is
obtained by subtracting the error pattern ej from the stored
code vector rj in the buffer register.
83
In summary, the decoding of an RS code is composed of
the following five steps:
1) Compute the syndrome S^ using the equation
N-l








2) Use Berlekamp's iterative algorithm or Massey's
LFSR synthesis algorithm to determine the coefficients
of the error locator polynomial a A from the known S-j_ =
Ek+i for i=0 ,1 ,2 , . . . ,2E-1 .




for 2E < i < N-l .
4 ) Compute the inverse transform
N-l
k+i=0
to obtain the error pattern, where (N)~l is the
inverse of N.
5 ) Subtract the error pattern ej from the received code
vector rj in the buffer memory to obtain the corrected
codeword
.
Note that in steps 1, 4, and 5, the processing time
is proportional to N*J, while in steps 2 and 3 the
84
processing time is proportional to 2E*J and (N-2E)*J,
respectively. Hence, a natural partition for pipeline
processing is to divide the decoder system into three pipe-
line stages. Stage 1 is used to perform step 1, stage 2 is
used to perform steps 2-4 , and stage 3 is used to perform
step 5 . To obtain a uniform throughput of one decoded
symbol per symbol clock cycle, each pipeline stage is
required to complete its computations in N symbol clock
cycles. As always, the throughput of the system is deter-
mined by the slowest stage in the pipeline. [Ref. 12]
The RS decoder architecture using the above pipeline
decoding technique is shown in Figure 6.12. The timing
chart of the decoder is shown in Figure 6.13. In both
figures, note that the first 2E input symbols of the inverse
transform, which are S
, S]^ , ..., S2E-1* can be processed in
parallel with the Berlekamp/Massey LFSR synthesis algor-
ithm. The remaining N-2E input symbols of the inverse
transform are obtained from the remaining transform. Each
of these N-2E input symbols is processed by the inverse
transform circuit immediately after its generation. In
stage 3, the buffer memory is read out symbol- by- symbol and
" Exclusive-OR'ed" with the output of the inverse transform.
A triple-buffered memory is required to store the three
active codewords in the pipeline. [Ref. 12]
85
s
o s on Ph
z S o
H O Eh ^s H O
Eh
CO Cm M>2 h < Ph CO D
H CO K > W 2 u
> < Ph^













OS U CJ\w rtl Ph




^ >H H H w SL J ?W W k d S 7 \J CO Eh U
« CO Z Oh
H
hJ
W < >h H w









Ph H O < PS2 § Eh Eh PS W
H O <
Pi k CO W E-Ph CO
tf Q w Ph- &H M
s 2 2 iz D UQ >H H H m w
o CO O J S
u H
w Ph










































































































































































































zO 2 O S
Z 2 H Pi
H O Eh w o
Z fc < CO Ph
H CO Pi Pi CO
< z w w z
is2^2


















In this thesis we have taken a modular approach to the
systolic implementation of a Reed- Solomon encoder and
decoder. By initially discussing the theory behind systolic
arrays and finite fields, we have shown how they play an
integral part in the overall implementation. The binary
case is presented first because of its simple architecture
and ease of understanding. It is then followed by a
design of a systolic multiplier and an RS encoder and
decoder .
The multiplier requires m basic cells for the finite
field GF(2m ). Because of its simple-control methodology,
regular interconnection pattern, and modular structure it is
highly suited for VLSI implementation. The encoder using
the systolic multiplier offers the advantage of requiring
less power, minimal size, and high reliability. The decoder
being modular in design is also highly suited for a systolic
architecture, thus the decoding speed can easily be
increased by using a distributive processing scheme. In
this way, several decoders can operate in parallel simul-
taneously, while each individual decoder can operate in a
pipeline fashion.
The design of both the RS encoder and decoder is simple
and regular. They can be constructed using a systolic array
88
of identical cells with every interconnection path occurring
between adjacent cells. This makes implementation in VLSI
extremely attractive since the layout of the cell need only
be done once and then replicated.
It is hoped that with this thesis as a guide, an
interested electrical engineering student could implement
the encoder or decoder in hardware. By building the four
cell-binary encoder first, the student would establish a
firm foundation vital to the development of the more
complicated RS encoder. This process could then be expanded
to produce an encoder of eight or sixteen cells, or the more
general case of 2m .
89
LIST OF REFERENCES
1. Berlekamp, E. R. , "Technology of Error- Correcting
Codes," Proc. IEEE , Vol. 68, May 19 80, pp. 567-593.
2. Kung , H. T. and Leiserson, C. E., "Systolic Arrays
(for VLSI)," Sparse Matrix Proc. 197 8 , Society for
Industrial and Applied Mathematics, 1979, pp. 25 6-282.
3. Hwang, K. and Briggs, F. A., Computer Architecture and
Parallel Processing , McGraw-Hill, New York, 19 84, pp.
768-770 .
4. Kung, H. T. , "Why Systolic Architectures?" Computer
,
Vol. 15, pp. 37-46, January 19 82.
5. Berlekamp, E. R. , Algebraic Coding Theory , McGraw-Hill,
New York, 1968, pp. 87-88.
6. Peterson, W. W. , Error- Correcting Codes , Cambridge, MA:
MIT Press, 1961, pp. 97-100.
7. Yeh, C. S., Reed, I. S. , and Truong , T. K. , "A Systolic
Multiplier for Finite Fields of GF(2 m )," IEEE Trans.
Comput. , Vol. C-3 3, pp. 35 7-3 60, April 19 84.
8. Liu, K. Y. , "Architecture for VLSI Design of Reed-
Solomon Encoders," IEEE Trans. Comput ., Vol. C-31,
pp. 170-175, February 1982.
9. Johl , J. T. , "VLSI Design for Reed- Solomon Encoder,"
IEEE Proc. of the 1984 Custom Integrated Circuits
Conference
, pp. 615-618, May 1984.
10. Fellin, J. A., "Primitive Shift Registers,"
Applications of Abstract Algebra and Finite Field
Theory to Computer Design and Data Communications
System
, pp. 1-34, November 1981.
11. MacWilliams, F. J. and Sloane, N. J. A., The Theory of
Error-Correcting Codes , Nor th- Holland , New York, 1977,
pp. 294-295 .
12. Liu, K. Y. , "Architecture for VLSI Design of Reed-
Solomon Decoders," IEEE Trans. Comput. , Vol. C-3 3,
pp. 178-189, February 1984.
90
BIBLIOGRAPHY
Berlekamp, E. R. , "Bit-Serial Reed-Solomon Encoders," IEEE
Trans. Inform, Theory , Vol. IT-28, pp. 869-874, November
1982.
Blahut, R. E. , "Fast Decoding Algorithms for Reed- Solomon
Codes," Secure Digital Communications , G. Longo (ed.),
Spr inger-Verlag Wien-New York., November 19 83, pp. 281-316.
Brent, R. P. and Kung , H. T. , "Systolic VLSI Arrays for
Polynomial GCD Computation," CMU Technical Report , March
1982.
Bromley, K. , Symanski, J. M. , and Whitehouse, H. J.,
"Systolic Array Processor Developments," VLSI Systems and
Computations , H. T. Kung, R. F. Sproull, and G. L. Steele,
Jr., (eds.), Carnegie- Mellon University, Computer Science
Press, October 1981, pp. 273-284.
National Aeronautics and Space Administration Report
32-1275 , Error Correction for Deep Space Network Teletype
Circuits , by H. M. Fredricksen, 1 June 1963.
Laws, B. A. and Rushforth, C. K. , "A Cellular- Array
Multiplier for GF(2m ), IEEE Trans. Comput. , Vol. C-20
,
pp. 1573-1578, December 1981.
Mandelbaum, D. , "On Decoding of Reed-Solomon Codes," IEEE
Trans. Inform. Theory , Vol. IT- 17 , pp. 70 7-712, November
1971.
Michelson, A., "A Fast Transform in Some Galois Fields and
an Application to Decoding Reed- Solomon Codes," Proc . IEEE




1 . Defense Technical Information Center 2
Cameron Station
Alexandria, Virginia 22304-6145
2. Library, Code 0142 2
Naval Postgraduate School
Monterey, California 93943-5100





4. Prof. Harold M. Fredricksen 3




5 . LTCOL Alan A. Ross 1
Code 5 2Rs
Department of Computer Science
Naval Postgraduate School
Monterey, California 93943-5100
6. LT Stephen S. McKenzie 2
2301-5 th Avenue, #4LL











A systolic array im-
plementation of a Reed-





c.l A systolic array im-
plementation of a Reed-
Solomon encoder and de-
coder.
m0

