A VLSI synthesis of a Reed-Solomon processor for digital communication systems by Chose, Philemon John




INFORMATION TO USERS
This manuscript has been reptoduc:ed from the mictofiIm master. UMI films
the text directly from the c:riginal or copy submitted. Thus, some thesis and
dissertation copies are in typewriter face, while others may be from any type of
computer printer.
The quality of this twproductJon is dependent upon the qUility of the
copy submitted. Broken or indistinct print. c::cHored or poor quality illustrations
and photographs, print bleedthrouoh,' substandard margns, and improper
alignment can adversely al'fed: reproduction.
In the rikefy event that the author did not send OMI • complete manuscript
and there are missing pages, these wiI be noted. Also, if unauthOrizecl
COpyrV'lt material had to be removed, a note wia indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand corner and continuing
from left to right in equal sections with small over1aps.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6° x 9" black and ..nile
photographic prints are availatMe for any photographs or inustnltions appearing
in this copy tor an additional charge. Contact UMI diredty to order.
Bell & HoweIIlnfonnation and laming
300 North Zeeb R~, Ann Arbor, MI 48106-1346 USA
8QO.521-0600
UMf
.+. National Ubraryof Canada
Acquisitions and
Bjb~ographic: Services
395 w.ulng1on su..(
=ONK1A0N4
Acquisilionset
servicesbibliographiques
=~~
The author bas granted a non-
exclusive licence allowing the
National Library of Canada to
reproduce, loan., distnbute or sell
copies of this thesis in microform.
paper or electronic formats.
The author retains ownership of the
copyright in this thesis. Neither the
thesis nor substantial extracts from it
may be printed or otherwise
reproduced without the author's
permission.
L'auteur a accorde une licence non
exclusive pennettant ala
Bibliotheque nationale du Canada de
reproduire. preter, distribuer au
vendre des copies de cette these sous
la forme de microfiche/film. de
reproduction sur papier au sur format
electronique.
L 'auteur conserve 1a propriete du
droit d'auteur qui protege ceUe these.
Ni la these oj des extraits substantiels
de celle-ci ne dawent etre imprimes
au autrement reproduits sans son
autorisation.
O.fJ12-54889_9
Canada
A VLSI Synthesis of a Reed-Solomon Processor
for Digital Communication Systems
By
@Pbilemon John Chose, H.Eng.
A thesis submitted to the School of Graduate Studies
in partial fulfillment of the requirements for the degree of
Master of Engineering
Faculty of Engineering and Applied Science
Memorial University of Newfoundland
July, 1998
St. Jobn's Newfoundland Canada
Abstract
The Reed-Solomon cocles have been widely used in digital communication systems
such as computer networks, satellites, VCRs, mobile communications and high-
definitioo television (HOTV), in order to protect digital data against erasures.
random and burst errors during transmission. Since the encoding and decoding
algorithms for such codes are computationally intensive, special purpose hardware
implementations are often required to meet the real time requirements.
One motivation for this thesis is to investigate and introduce rec:onfigurable Ga-
lois field arithmetic structures which exploit the symmetric properties of available
architectures. Another is to design and implement an RS encoder/decoder ASIC
which can support a wide family of as codes.
An m~programmableGalois field multiplier which uses the standard basis rep-
resentation of the elements is litst introduced. It is then demonstrated that the
exponentiator can be used to implement a fast inverter which outperforms the
available inverters in GF(2"'). Using these basic structures, an ASIC design and
synthesis of a reconfigurable Reed-Solomon encoder/decoder processor which im·
plements a large family of RS codes is proposed. The design is parameterized in
terms of thi! block length n. Galois field symbol size m. and error correction capa-
bility t for the various RS codes. The design has been captured using the VHDL
hardware description language and mapped onw CMOS standard cells available in
the O.S-~mBiCMOS design kits for Cadence and Synopsys tools. The experimental
chip contains 218,206 logic gates and supports values of the Galois field symbol size
m =3,4,5,6,7,8 and error correction capability t = 1,2,3, ... , 16. Thus, the block
length n is variable from 7 to 255. Error correction t and Galois field symbol size
m are pin.selectable.
Since tow design complexity and high throughput are desired in the VLSI chip,
the algebraic decoding technique has been investigated instead of the time or trans--
form domain. The encoder uses a selI.reciprocal generator polynomial which struc·
tures the codewords in a systematic form. At the beginning of the decoding process,
received words are initially stored in the first·in·6rst--out (FIFO) buffer as they en·
ter the syndrome module. The Berlekemp-Massey algorithm is used to determine
both the error locator and error evaluator polynomials. The Chien Search and
Forney's algorithms operate sequentially to solve for the error locations and error
values respectively. The error values are exclusive or-« with the buffered messages
in order to correct the errors. as the processed data lea\--e the chip.
Acknowledgements
I would like to take this opportunity to thank my supervisors Dr. P. Gillard, Dr.
R. Venkatesa.n and Dr. R. Donnelly whose constant encouragement and suggestions
kept me OD the right track.
The enthusiastic support from Dr. J .J. Sharp, Associate Dean of Graduate Stud-
ies and Research, and Dr. R. Seshadri, Dean of Engineering and Applied Scien~,
is greatly appreciated. r would also like to thank Mr. Michael Rendell of the Com-
puter Science Department and Mr. Brent Veitch of the Canadian Microelectronics
Corporation for their help with the Synopsys and Cadence suite of software tools.
The financial support of the Faculty of Engineering and Applied Science of
Memorial University of Newfoundland, and Natural Sciences and Engineering Re-
search Council of Canada is gratefully acknowledged.
I am especially indebted to my wife, Michele, for her love and emotional support
during my graduate studies.
ill
Contents
Abstract
Aclmowledgments
Table of Contents
List of Figures
List of Tables
List of Symbols and Acronyms
1 Introduction
1.1 Statement of the Problem
1.2 VLSI Architectures for Implementing Galois Field Arithmetic
1.2.1 An Overview of Galois Field Arithmetic
1.2.2 Multipliers.
iii
viii
ix
1.2.5 Summary
1.3 Scope of the Work
1.2.3
1.2.4
Dividers and Inverters
Exponentiators
10
13
16
16
1.4 Organization of the Thesis .
2 Theoretical Background on Reed-Solomon Codes
2.1 General RS Code Definition
2.2 Encoding
2.3 Algebraic Decoding . .
2.3.1 Bedekamp-Massey Algorithm
2.3.2 Euclid's Algorithm.
2.3.3 Chien Search
2.3.4 The Forney Algorithm
2.4 Time-Domain Decoding
2.4.1 Error Locator and Evaluator Polynomials.
2.4.2 Error Evaluation
2.5 Error Correction
2.6 Algebraic vs. Time-Domain Decoding Algorithms .
2.7 RS Encoder/Decoder Atthitectures
2.8 Summary
3 Proposed VLSI Arith.metic Architectures
3.1 m-Programmable Galois Field Multiplier
3.2 m-Programmable ExponentiatorfInverter ....•....•••.
17
,.
,.
21
23
27
2.
31
31
34
34
34
35
35
3.
41
42
43
58
3.3 Discussion and Summary. 67
4 Synthesis of the Reed-Solomon EncoderjDecoder ASIC 68
4.1 Design Flow, Functional Verification and Test 68
4.2 Chip Acchitecture . 71
4.3 Modules....... ••. 75
4.5 Discussion and Summary .
4.3.6 First-In-First-Out Buffer .
4.3.7 Finite State Machine.
4.4 Testing and Results .
4.3.3 Berlekamp ..
4.3.4 Error Magnitude Evaluation .
4.3.4.1 The Chien Search .
4.3.4.2 The Fomey Algorithm .
4.3.5 Error Correction and Verification
4.3.1
4.3.2
Encoder ...........••.
Syndrome
75
78
82
86
86
88
92
94
96
101
lOS
5 Conclusion and Future Work
5.1 Galois Field Arithmetic Architectures .
5.2 VLSI Reed-Solomon Encoder/Decoder
5.3 Future Work.
References
vi
III
III
112
114
116
List of Figures
2.1 A Digital Communication Systeem . . . .••••••••••••.
2.2 Algebraic Decoder
20
33
2.3 Time-Domain Decoder . 35
3.1 A ParaUel·In-ParalleJ.·Out Multiplier for GF(2") . 4S
3.2 Symbolic Architecture of the Programmable Multiplier 57
3.3 A General Exponentiation Architecture . 61
3.4 Exponentiation/Inverse Architecture for GF(28) 62
3.5 Symbolic Architecture of the Programmable Exponentiator/Inverter 64
4.1 Design Flow. 69
4.2 Symbolic diagram of the RS Encoder/Decoder. 73
4.3 RS Encoder. _ . . . . . . • • • • 76
4.4 Symbolic diagram of the RS encoder
4.5 SystOlic Array to compute Syndrome Polynomial
4.6 Symbolic diagram of the Syndrome Module.
4.7 Symbolic diagram of the Berlekamp Module
4.8 The Chien Search Hardware .
4.9 Error Evaluation Polynomial Circuit
vii
n
79
81
85
87
89
4.10 Derivative Circuit.
4.11 Block Diagram of the Error Magnitude Evaluation.
4.12 Symbolic diagram of the Error Magnitude Evaluation .
4.13 Symbolic diagram of FIFO.
4.14 RS Encoder/Decoder Finite State Machine ..........•.
90
91
93
9S
98
4.15 Symbolic diagram of the Finite State Machine 100
4.16 Gate Level Simulations of the RS Encoder ........•. 103
4.17 Gate Level Simulations of the Syndrome Module. 104
4.18 Gate Level Simulations of the overall RS ASIC (Encoding) . . lOS
4.19 Gate Level Simulations of the overall RS ASIC (Decoding) 106
4.20 Gate Level Simulations of the RS ASIC (Decoding Failure) . . 107
viii
List of Tables
1.1 Elements generated by P(x) = 1 + x + x3 .
3.1 Comparison ortbe programmable Unpipelined and Pipelined Multiplier 58
3.2 Features of the programmable ExponentiatorjInverter .
3.3 Comparison of Exponentiator/Inverter with other Inverters for fixed
4.1 RS Encoder/Decoder Characteristics
4.2 RS Encoder/Decoder I/O Pins .
4.3 Encoder I/O Pins
63
66
72
7'
78
4.4 Syndrome I/O Pins 82
4.5 Berlekamp Module I/O Pins . 86
4.6 I/O Pins of the Error Magnitude Evaluation . . 92
4.7 I/O Pins or the FIFO Module. . 94
4.8 I/O Pins of the FSM Module 99
4.9 RS Encoder/Decoder Data Rates 109
4.10 RS Encoder/Decoder Modules and Equivalent Gate Count 110
5.1 Projected RS Encoder/Decoder Data Rates 115
List of Symbols and Acronyms
ASIC Application specific integrated circuit
BCH Bose-Chaudhuri-Hocquenghem
SiCMOS Bipolar complementary metal oxide semiconductor
BJT Bipolar junction transistor
C(x) Codeword polynomial
CAD Computer aided design
CMOS Complementary metal oxide semiconductor
CMC Canadian Microelectronics Corporation
CODEC Coding and decoding
DDA Division·and-accumulation
E(,) Error polynomial
FEC Forward error correction
FIFO First-in first-out
FSM Finite state machine
GF Galois field
G(,) Generator polynomial
HDTV High-definition television
I/O Input output
IC Integrated circuit
Number or message symbols
Symbol size or a Galois field element
M(:~-) Message polynomial
MUX Multiplexer
Code leogth or an RS code
NASA National Aerooautics and Space Administration
NMOS n - type metal oxide semiconductor
p(3;) Primitive polynomial
ROM Read..on1y memory
RS Reed-Solomon
RTL Register transfer level
5(3;) Syndrome polynomial
Error correcting capability of an RS code
VHDL Very high speed integrated circuit hardware description language
VLSI Very large scale integration
A(3;) Error locator polynomial
N(3;) First derivative of the error locator polynomial
u(x) Error locator polynomial
0(3;) Error locator polynomial
Chapter 1
Introduction
1.1 Statement of the Problem
The Reed Solomon (RS) codes are widely used in digital communication systems
to increase the reliability and efficiency of the communication channel. They work
in finite or Galois field arithmetic and have the ability to efficiently protect digital
data against erasures, random, and burst errors during transmission. The interest
in the as codes was primarily theoretical until the concept of concatenated cod-
ing which uses a convolutional code/RS code channel system, was formulated and
first introduced in {II. Their success is now reflected in modem day digital au-
clio or compaCt discs, computer networks, deep space telecommunication systems,
spread-spectrum systems, computer memories, VCRs and high-definition television
(HDTV) applications [2J[31[41.
Since the encoding and decoding algorithms for such codes are computationally
intensive, special purpose hardware implementations are often required to meet the
real-time processing requirements. The choice of a specific RS code depends on the
characteristics of the communication channel. As such, RS encoders and decoders
are traditionally designed with fixed values of the error correction capability t, block
length n and symbol size m. The reason for choosing fixed design parameters is
that the exponentiator, multiplier, divider and inverter have different designs for
different values of m [2][5][6]. Such an approach is evidently inflexible and hence
inefficient because the system has to be redesigned if the channel characteristics
ultimately change. Moreover, the design complexity increases with the error cor-
rectiOD capability of the code, thus making it impractical to implement the system
using off-the-shelf discrete integrated circuit components. However, rapid advances
in VlSI technologies may offer attractive solutions because of higher reliability,
better performance, smaller area, lighter weight and lower power consumption (7].
Hence, there is a direct need for fast finite field arithmetic c;ircuits which operate
in GF(2m ), where m is variable. The ability to operate with different symbol sizes
of m·bits has been a limiting factor in past efforts to implement universal and
possibly reconJigurable AS hardware. Ifsuch arithmetic circuits could be developed,
it would be possible to design highly efficient single chip encoders/decoders whose
total design cost is amortized over a wide application base. Hence, the motivations
for this thesis are:
(1) to investigate and introduce programmable Galois field arithmetic structures
which exploit the symmetric properties of available architectures. It appears that
very little work has been done in the literature to develop reconfigurable hardware
which can operate in finite fields.
(2) to design and implement an RS encoder/decoder ASIC which can support a
wide family of RS codes whose symbol size m and error correction capability t can
be varied directly in hardware. Such a general-purpose ASIC would be suitable for
a wide variety of digital communication systems which require different RS codes.
The choice of m and t depends on the application and is usually based on the
overall cornction performance and throughput of the code. The specific Galois
field symbol size m = 8 has been standardized by the European Space Agency
and the National Aeronautics and Space Administration for satellite communica-
tion [21. The error correction circuits for advanced train control systems, mobile
radio systems, magnetic recording systems, data communications and digital signal
processing based modems use m = 5 (2][51.
1.2 VLSI Architectures for Implementing Galois
Field Arithmetic
The following subsections present an overview of Galois fie.ld arithmetic and a
literature review of the various arithmetic operations.
1.2.1 An Overview of Galois Field Arithmetic
Recently, Galois fields or finite fields have received great attention because of their
widespread applications in error control coding using linear block codes. They
have also been extensively used in digital. signal processing, pseudo-random number
generation, encryption and decryption protocols in cryptagraphy. The design of
efficient multiplier, inverter and exponentiation circuits for Galois field arithmetic
is needed for these applications. These circuits should have low complexity, short
computation delay and low latency when used in b..igh-performance systems [2J.
A finite field or a Galois field designated GF(p), is a finite set of elements which
has defined rules for arithmetic. These rules are not algebraically different from
those used in arithmetic with ordinary numbers except that there is only a finite
set of elements involved. All finite fields have the following properties:
1. Multiplication and addition are the two operations defined for combining the
elements.
2. The result of adding or multiplying two elements is always a third element
contained in the field.
3. The field always contains the multiplicative identity element I and the additive
identity element 0 such that a + 0 = a and a· 1 = a for any element a.
4. Every element a has an additive inverse element (-4) and a multiplicative inverse
element a-I such that a+(-4) = 0 and 4'4-1 = 1. The existence oithese elements
permits subtraction and division to be performed.
5. The associative [4 + (b + c) = (4 + b) + c and 4' (b· c) = (4' b) . cl, commutative
(4 +b = b+4 and a·b = b· aJ, and distributive [a.(b+c) = a' b+4' c]laws apply.
GF(p"') is an extension field of the ground field GF(p), where m is a positive
integer. For p = 2, GF(2m ) is an extension field of the ground field GF(2) of two
elements {O,l}. GF{2m ) is a vector space of dimension mover GF(2) and hence is
represented using a basis of m linearly independent vectors. The finite field G F(2"')
contains 2'" - 1 non-zero elements. All finite fields contain a zero element and an
element, called a generator or primitive element, such that every non-zero element
in the field can be expressed as a power of this element.
In order to introduce the mathematical concepts of the trace and dual basis,
the following definitions are necessary [8][9].
Definitian 1: The trace of an element (3 which belongs to GF(2m) is defined as
(1.1)
Definitian 2: A basis {~j} in GF{2m) is a set oim linearly independent elements
in GF(2"'), where 0 S j ::::: m - 1.
Definition 3: Two bases CPt} and {),.t} are the dual of one another if
Tr(Pi),.t) = 1, if j = Ir;
= 0, if j-F1r; (1.2)
where 0 Sj Sm-l and 0 Sir; Sm-1.
The elements of GF(2"') are usually expressed as powers of the primitive el-
ement a, where a is defined as the root of the primitive polynomial P(x) =
:r"'+1"'_1];"'-1+1"'_2:1;"'-2 ... + ftx+l where Ii ~ {O,l}. Each element Z ofGF(2"')
can also be written in (8][9}[101
• the standard basis as z = a"'_IQ"'-1 + + <l2a2 + alai + ao.
• the nonna! basis as z = a..._Ia2"'-' + + a2a22 + ala2' + aoc:r.
• the dual basis ),.t as z = E~ot z.t),.t = E~l Tr(zp.tp..t.
where
a; f GF(2) and z, =Tr(zp.t) is the Ir;·th coefficient of the dual basis.
The standard basis is commonly used in implementing algebraic Reed-Solomon
decoders in hardware. Since multiplication is the most dominant arithmetic opera·
tion, the standard. basis multiplier is often preferred for its lowest design complexity
compared to the nonna! and dual based multipliers [6][111. It does not require basis
conversion and thus can be more easily matched to any input/output system.
As an example, the power, polynomial and 3-tuple representations of the Ga·
lois field elements generated by the primitive polynomial P(x) = 1 + x + r are
tabulated in Table 1.1. The non-zero elements are generated using a 3-stage linear
feedback shift register initialized to 001, with taps defined by the coefficients of the
primitive polynomial P(x).
Power Polynomial 3-Thple
0 0 000
cI' cI' 00'
Q' Q' 0'0
a' a' 100
a' 0 1+0° Oll
Q' 0.2 +0.1 llO
Q' a 2 +a l +oP III
Q' 0.'+0.° LOI
Q' Q' 00'
Table 1.1: Elemen'5 generated by P(x) =1 + % + r
Examples:
Addition:
Multiplication:
Ditti8ion:
E%pOnentiation:
[nver.non:
since aT =0.0 = 1
a l +o/l=a4
a'.a'_a'
::~a'
(0.1):1 = all = a4
<!r =0.-2 =0.1-, =a!
Addition and subtraction in finite fields are relatively straightforward, but mul·
tiplication, division, exponentiation and inversion are not. Using a symbol size
m, addition and subtraction can be realized using m-bit exclusive-or gates. How-
ever, since the more complex operations are extensively used in RS encoding and
decoding algorithms, the development of their hardware structures have received
considerable attentioo_
1.2.2 Multipliers
For arbitrary elements A(x) = Lt'.-ol akr'= • B(%) =LM1 bkr'= in GF(2m), and the
primitive polynomial P(x) = Lh:Ql Ps.x*, the product C(%) of A(x) and B(x) is
(1.3)
given by
C(.) _ A(.)B(.) mod P(.)
=IE:;t .4(:z:)b",rtl mod P(:z:)
= (...(A(:z:)b,.,,_l:Z: + A(:z:)b"._2)% + ...):z: + A(:z:)bo
= c..._L:r"'-l + c..._2r"-2 + ... + CI% + Co
A direct. implementation of multiplication by combinational logic was proposed
by Bartee and Schneider [12]. A canonical basis is used to represent the elements
of the field. Depending on the primitive element, this implementation requires as
many as (m3 - m) ~input adders over GF(2). This approach has a high circuit
complexity and also lacks regularity suited for full custom VLSI designs.
A cellular array multiplier was originally conceived by Laws and Rushforth
in 1971 [131. The array requires approximately 2m gate delays, a considerable
improvement over the traditional linear feedback register type multiplier which
computes the desired product sequentially in m clock cycles. A simple parity check
circuit is incorporated in the design.
In 1984, Yeb et al [141 presented systolic multipliers for performing multipli~
cation of arbitrary elements in GF(2"') in O(m) time and area suitable for VLSI
implementation. In the design, the elements in the field are represented in the con~
ventional manner. The throughput rate for the serial-in serial-out one-d.imensional
systolic array is m clock cycles and the parallel-in parallel-out ~ensional
systOlic array, one clock cycle. Both designs have a latency of 2m clock cycles.
In 1985, Wang et aJ 110J developed a pipeline structure to implement the multi~
plication algorithm proposed by Massey and Omura for Galois fields based on the
normal basis representation. By taking advantage of the squaring property of the
normal basis representation, the same pipeline structure is reconfigured to com~
pute the inverse elements in GF(2"'). The throughput rate for the multiplier is one
product per clock cycle after an initial delay of m clock cycles. Since the design
is dependent on the primitive polynomial used to generate the field elements, the
number of XOR gates in the product function increases enormously for large m.
Hence, the pipeline struCture is only practical for small m.
In 1986, Scott et al (15) presented a bit-slice architecture of a serial·in serial-out
multiplier well suited for VLSI implementation. The multiplier has a latency of
m clock cycles and yields a computation time and implementation area of Oem).
It is shown that the architecture is attractive for use in data encryption systems
where data are segmented into long blocks to achieve high security and maximum
throughput.
A parallel-in parallel-out systolic array and a serial-in serial-out systolic array
proposed for f&.'lt multiplication in the finite fields GF(2"') with the standard basis
representation were presented by Wang and Lin in 1991 [11J. The architectures are
regular, concurrent and have unidirectional data Bow. A system with unidirectional
data Bow is highly desirable when designing high-speed. VLSI systems. It is further
shown that the proposed parallel implementation can more easily incorporate fault-
tolerance compared to previously published designs. The serial-in se.ria1-out anay
only requires one control signal instead of two as in {14J. U the inPUt data pass
in continuously, the parallel-in parallel-out array yields output results at a rate of
one output per .clock cycle after a latency of 3m cycles. It is worth noting that
the minimum clock period is governed by the propagation delay of an AND gate
in series with an XOR gate. All the operations of each basic cell are pipelined in
such a manner that each cell performs a small fraction of the multiplication and
passes the data to the neighbouring cells for further processing. Under the same
operating conditions, the serial-in serial-out array yields output results at a rate of
one per m cycles after an initial delay of 3m cycles.
A bit-serial systolic divider circuit and multiplier over GF(2m ) was presented
by Hasan and Bhargava in 1992 (16]. The design is based on the Gauss-Jordan
Elimination algorithm. and completely eliminates global data communications and
dependency of the time step duration on m. The division algorithm requires the
formulatioo of the supporting elements and the corresponding coefficient matrix
by using a one-dimensional systolic array. The resulting system of 2m - 1 simul·
taneous linear equations in 2m - 1 unknowns are solved using a two-dimensional
systolic array. With minor modifications, the same structure is used to perform
multiplication over GF(2"') in a computational time of 3m - 1 time steps. The
proposed inverter/divider requires three processors and a control signal consisting
of 2.5m' + Il.5m - 6 registers, 4m' + 12m - 5 AND gates, L5m2 + 7.5m - 2 OR
gates, and O.5m' + l.5m - 1 XOR gates. The structure has a computatioDal time
of 5m - 1 time steps and is independent of the irreducible polynomiaL
A divisioD and bit-serial multiplication algorithm were presented by Hasan and
Bhargava in 1992 (17). Using the coordinates of supporting elements, division over
GF(q"') is performed by solving a system of m linear equations over GF(q) when
the field elements are represented by polynomials. It is further shown that divisioD
can be performed with a lower order of computational complexity by solving a
WieDer-Hopf equation of degree m. The discrete-time Wiener-Hopf equatioD is
defined as a. system of m linear inbomogenoous equations, with m unknowns {17J.
Structures for parallel multipliers derived from irreducible all-one and equally
spaced polynomials were developed by Hasan et ai in 1992 [18]. It is shown that the
three basis modules of an all..ane polynomial based parallel multiplier of a small
field can be used to COllStruct all the corresponding equally spaced polynomials
of larger fields. A normal basis parallel-type multiplier for finite fields GF(2"')
generated by the irreducible a11·one polynomials was recently presented by Hasan
et al in 1993 [191. It is a modified version of the Massey-Omura multiplier.
A systolic power-sum circuit designed to implement the function AB'l+C where
A, B and C are elements of the field was presented by Wei in 1994 (201. By adding
one multiplexer and one demultiplexer, the power-sum circuit is configured to com-
pute eight different types ofcomputatiollS viz AB,AB+C,A'l,A'l+C,AB'l,AB'l+
C. A3 and A3 + C. All these computatiollS are needed in decoding multiple error
correcting BCH and Reed-Solomon codes in cases where the coefficients of the error
locator poLynomial are solved algebraically.
A bit-serial multiplier which has the same hardware requirements as the tra-
ditional Berlekamp multiplier was recently presented by Fenn et al in 1995 (211.
m the design, the variable multiplier is represented over the dual basis and the
constant multiplicand is represented over the polynomial basis. The reverse is true
with a COllStant traditional Berlekamp multiplier. It is shown that constant multi-
pliers based on the proposed approach can operate at a higher frequency than those
based on the traditional Berle1:amp multiplier.
1.2.3 Dividers and Inverters
Finding the inverse of an element over GF(2"') is computationally intensive in
hardware and still remains an active area of research. Finite field inversion and
division are critical in decoding Reed-Solomon and BCH codes. During the de-
10
coding process, the Berlekamp-Massey and Forney algorithms often employ these
arithmetic operations. The derived algorithms for decoding double error.col'Ittting
Reed-Solomon codes require the same functions as well. Thus, the latency and
throughput of the inver\er$ and dividers may dictate the overall speed of the de-
coder.
The traditional method for computing the inverse of elements in GF(2m ) uses
read-only memory (ROM), Fermat's theorem or Euclid's algorithm. The size of the
ROM is m2m bits. The coordinates of an element are used as the address of the
location in the ROM where the corresponding inverse is stored. The value ofm can
range from 3 to infinity. These methods are inefficient for VLSI implementation
if large values of m are required. In recent years, several algorithms and their
corresponding VLSl architectures for computing the inverse elements have been
presented in the literature. For an arbitrary element A in the finite field GF(2m ),
the inverse operation of an element A is denoted by A-I = A2"'-2. Rewriting the
exponent 2m - 2 as 21 + 22 + z3 + ... + 2"'-l. allows the inverse operatioo to be
expressed as (10]
(1.4)
10 1985, Wang d 4l [10] invented a parallel-in serial-out circuit for solving
Equation (1.4) based 00 the Massey-Omura multiplier. In their design, the normal
basis representa.tion of the elements in the form (~,a2',a22,... ,a2"-1) is used.
The method is impractical for large values of m since the number of XOR gates
in the product function correspondingly becomes large. Since squaring is a cyclic
shift operation in the nonnal basis, the inverse function is found in m clock cycles.
11
In 1989, Feng [22] developed a serial-in parallel-out architecture based on the
nonnal basis representation of the finite elements. The algorithm requires a com-
putational complexity of Oem log2 m). A throughput rate and latency of m(q + p)
clock cycles, where p is the number of ones in the binary expression of m - 1 and
q is the lower bound on 10g2 m, are needed to compute the inverse elements.
In 1993, Wang and Li [231 presented a serial-in serial-out systolic array architec-
ture for performing the inverse element in GF(2m ). In the analysis, the standard
basis representation of the field elements is used. The design for GF(2m ) mimics
the systolic array based on the Gauss-Jordan elimination algorithm for solving a
system of 2m -1 linear equations over GF(2) [24J. The proposed inversion circuitry
has a latency of7m-3 clock cycles and a maximum tbroughput rate of 2m-1 clock
cycles. Without any modifications in hardware, the multiply-and.divide operation
can easily be performed. The logic design of the architecture is independent of the
primitive polynomials used to generate the field elements. All the operations of
the serial-in serial-out systolic array are pipelined in such a manner that each cell
completes a small fraction of the computations and passes the data to the neigh-
bouring cells. The entire systolic array is made up of~ main array cells and
m boundary cells, where m is the size of the Galois field.
A fast normal basis inversion circuitry was presented by Fenn et al in 1996 [25].
The hardware scheme uses two registers, a multiplier, a squarer and a generator
device in GF(2m ). It exploits the properties of Fermat's theorem in order to pro-
gressively generate the solution in approximately T clock cycles. The inverter is
shown to be more efficient for odd values of m and its features make it suitable for
double error-correcting Reed-Solomon codes. The same design was recently reex·
12
amine<! and improved by Yen in 1997 (261. It is demonstrated that the number of
clock cycles pee iteration can be further redw:ed to around !f . Yen's algorithm
clearly outperforms the algorithm by Fenn et ai for large values of m. Another
modification to the algorithm by Fenn et al was reported by Calvo and Torres in
1997 [271. The generator and squarer devices have been totally eliminated from the
original circuit.
In 1997, Hasan [281 presented an algorithm to perform sequential computation
of division-and-accumulation (DDA) over GF(2"'). The algorithm can also be
used for the conventional rational numbers. It is shown that in the cases where n
multiplications and n inversions are required in the DDA, the new algorithm only
requires 3n+1 multiplications and one inversion. Such a proposition is advantageous
to fields where a division operation is at least three times more complex than
a multiplication. The DDA structure is suitable for the systolic Reed·Solomon
encoder [29J to efficiently compute the parity symbols during the encoding process.
1.2.4 Exponentiators
Exponentiation is extensively used in cryptosyStems and error-correeting codes.
The conventional approach for finding the exponent of an element in GF(2"') uses
read-only memory or table lookup. The value of m can range from 3 to infinity,
which would requite storing 2m elements of m-bit wide. This method is inefficient
when m becomes too large. In recent years, several exponentiation algorithms and
their corresponding VLSI architectures have been proposed.
For an arbitrary element 0 in the finite field GF(2"') and an integer N(1 :5 N :5
2m - 1), the exponentiation function is defined as 6 =ON. Clearly, 6 is in GF(2m ).
13
IfN is represented in binary form as l1o,nI.1l2, ...,n.,.-I such that N = r::G1ll\·2',
then 6 = pH can be expressed as follows [30](31]
6 =Pi'l =pr:.;;,' ...~
- (P)~ . (Ii')"' . (fJ")"' ... ur-'j"--'
=n:,olW)A;
= fC".(i1 Eo
where
Ei=pT ifn;=1
E;=1 Un;=O
(1.5)
(1.6)
(1.7)
In 1988, Scott d al (32] proposed several sequential and parallel VLSI architec·
tures for computing the product terms of the exponent in GF(2m ). As described
in the reference [32], the designs are targeted for applications that use Galois fields
GF(2"') for large values of m. Both the standard and normal based exponenti-
ations are considered. The sequential exponentiation unit requires O(m') clock
cycles assuming repeated use nf a multiplier which possesses a throughput rate
of one multiplication every m clock cycles. The fully parallel computation of the
product terms yields one exponentiation per m clock cycles, assuming the use of
(m-l) multipliers whose combined minimum latency is m+ 2m log, m clock cycles.
A multiplier latency of 2m clock cycles is assumed.
A VLSI design and implementation of an exponentiation circuit was also pre-
sented by Wang and Pei in 1990 {30]. The architecture can be used to generate
pseudorandom number sequences in spread spectrum, cryptographic systems and
digital signal processing applications such as noise generation. Elements in the fi·
nite fietd are represented in the nonnal basis. In this design, the exponentiation of
14
an element is found in m clock cycles. The architectural details and VLSI layout
of the chip for GF(24 ) are extensively illustrated.
In 1993, Arazi [33] presented two efficient exponentiation circuits which can be
adopted for smartcard applications. They operate over the standard basis repre-
sentation of elements in GF(2"'). In one scheme, the algorithm is completed in 2m
clock cycles instead of m. The shift registers can be implemented with dynamic
instead of static registers, owing to the limited space in a smartcard~mounted chip.
The second scheme is simpler and uses duplicates of the same cell to compute
exponentiation in 6m2 clock cycles.
A parallel-in-parallel-out bit-level systolic array architecture with unidirectional
dataflow for computing exponentiation was first presented by Wang in 1994 [311.
Using the systolic multiplier proposed by Wang and Li in the reference [l1J, two-
level pipelining is employed to achieve a maximum throughput of one output every
clock cycle after an initial delay of 2m2 +m cycles. Unidirectional dataflow is highly
desirable in designing high-speed systems. The design can easily incorporate fault-
tolerance.
An exponentiation algorithm based on a pattern matching and recognition tech-
nique was recently presented by KovaC and Rangathathan in 1996 [34]. Unlike the
cODventional methods which use repeated multiplications, the algorithm can per-
form the exponentiation operation on-the-tly. In the analysis, the nonzero elements
of the Galois field GF(2m ) are represented in the standard basis. The elements
are divided into subsets, where each subset corresponds to a pattern. More details
on the related theorems and proofs are given in the reference [341. In an effort to
obtain high speed and maximum throughput, a systolic architecture which uses a
15
multistage linear pipeline and parallelism is proposed by the authors. Once the pipe
is filled, a new result is obtained every clock cycle foUowing a latency of 2m clock
cycles. Thus, the a.reb.itecture is recommended for applications that use GF(2m )
for values of m less than or equal to eight. The hardware allows the program-
ming of different primitive irreducible polynomials of degree m less than or equal
to eight. The design issues related to the CMOS VLSf implementation of the chip
which performs the exponentiation operation over Galois field GF(~) are exten-
sively enumerated. A maximum computational rate of 40 million exponentiations
per second at a clock frequency of 40 MHz is possible.
1.2.5 Summary
An overview of Galois field arithmetic operations has been presented. The mul-
tiplication, inverse. division. and exponentiation .operations in GF(2m ) have been
extensively described. The traditional method for evaluating these functions uses
ROM. Fermat's theorem or Eudid's algorithm. However. these techniques are inef-
ficient for VLSI implementation if large values of m are required. Thus, the latency
and throughtput of the arithmetic: units may dictate the overall speed of the global
system. The development of more efficient algorithms and their corresponding VLSI
architectures still remains an active area of research.
1.3 Scope of the Work
In this thesis I propose an m-programmable Galois field multiplier which uses the
standard basis representation of the elements. A structure is also designed to
implement both the exponent and inverse functions over GF(2m ), where m is van-
16
able. The ability to operate with different symbol sizes of m-hits wide has been
a limiting factor in past attempts to implement universal. and reconfigurable en-
cod",/docod." [2][5][61.
By wing the proposed arithmetic circuits, coupled with a multiplexing technique
to select different RS code parameters m and t, an ASIC synthesis of a testable
RS encoder/decoder which implements a wide family of RS codes in GF(2"') is
developed. Unlike the chips which are customized for a specific m and t as reported
in (35)-[511, it is reconfigucable and supports values of the Galois field symbol size
m = 3,4.5,6, 7,8 and error corTeCtion capability t ranging from 1 to 16. This means
the total cost of such a design is amortized over a wide application base. Since low
design complexity and high throughput are desired in the experimental VLSI chip,
tbe algebraic decoding technique is preferred over the time or transform domain
metbods.
Gate arrays, standard cells and full-custom are three potential VLSI technologies
that could have been used to implement the RS encoder/decoder chip. However,
a CMOS standard cell based design methodology, which uses hardware description
language (HDL) logic synthesis, is found suitable because it allows easy mapping
and optimization of the logic level design into iotegrated circuit (IC) layout using
the state-of-the-art VLSI CAD tools. The design has been simulated at a frequency
of 50 MHz and contains 218,206 logic gates.
1.4 Organization of the Thesis
The remaining chapters of the thesis are organized as follows:
In Chapter 2, the mathematical background and necessary theoretical details
17
are described for understanding Reed-Solomon codes.
Chapur 3 proposes an m-programmable Galois field multiplier which uses the
standard basis representation of the elements. Using this multiplier, it is shown
that the exponentiation and inverse operations can be both performed using the
same reconfigurable hardware.
Chapter 4 discusses the design methodology, VLSI synthesis and operational
features of a new programmable Reed-Solomon encoder/decoder prO«SSQr.
Chapter 5 highlights the major conclusions of this research and recommenda-
tions for possible future worle.
18
Chapter 2
Theoretical Background on
Reed-Solomon Codes
In this chapter, the RS encoding and decoding algorithms are first explained. A
survey on the existing RS encoder and decoder architectures usually designed for a
fixed. m is given.
2.1 General RS Code Definition
Discovered by LS. Reed and G.S. Solomon in 1960, Reed-Solomon codes are an
important subclass of nonbinary BCH codes. They are among the most versatile
and powerful error control codes commonly used to correct hoth random and burst
errors in digital communications and magnetic storage systems ranging from the
digital audio disc to the Voyager spacecraft. A general block diagram of a digital
communication system is shown in Figure 2.L
The interest in RS codes was primarily theoretical until the concept of COD.·
catenated codes was formulated and first introduced by Forney in 1966 [lJ. Con-
catenated coding has since been adopted by the U.S. National Aeronautics and
Space Administration (NASA) for interplanetary space missions. It uses the con-
19
Figure 2.1: A Digital Communication System
20
NOISE
"'"
vo!utionaljRS channel encoding and decoding system.
For any positive integer m 2:: 3 and error correcting capability t ~ 1, there
exists a t-error correcting RS code from the Galois field GF(2"") witb the foUowing
parameters [52H57]
Block Length n = 2m- 1 symbols
Number of Parity Check 2t = n - k symbols
Minim'Jm Distance d...'n = 2t + 1
where k is the data message in symbols.
An (ft, k, t) RS code has & generator polynomial G(:z:) of degree n - k often
written as G(.r) = (:z: + a)(.r + a2) .. (.r + a't).
2.2 Encoding
The generator polynomial G(.r) of an RS code bas the form
G(.r) = El:.t-l(.r-a')
=E~g;.ri (2.1)
=90 + 91:Z:+"'+,9:uz2t
where b is a nonnegative integer often chosen to be 1. The D.umber of distinct
coefficients of G(:z:) can be reduced by almost half by carefully choosing b = 2"'-I_t
satisfying the relationship (81
2b+2t = 2'" (2.2)
There are two ways to encode the message M(.r). In nonsystematic encoding,
the codeword C(x) is generated simply as
21
C(x) ~ M(x)G(x) (2.3)
Thus the message M(x) is not explicitly present in the codeword C(x).
In systematic encoding, given a message polynomial M(x) and generator poly-
nomial G(x), the codeword C(x) is generated as foUows:
(1) multiply tbe message M(x) by X 21 to obtain M(x)x2t
(2) divide M(z)::r;2t by G{x) to obtain the remainder polynomial R(x) and form the
codeword C(x)
C(x) = x2tM(x) + R(x) = Q(x)G(x) (2.4)
where Q(x) is the quotient and R{x) = To + FIx + T2X2 + ... + 1'2t_IX21- 1 is the
remainder or parity polynomial.
Circuits for performing division by G(x) or any arbitrary polynomial are avail-
able. The number of distinct multipliers 90, 91> ... , g2t can be reduced almost by half
by choosing b = 2m- 1 - t.
Maki and Owsley [581 presented the VLSI design and implementation of the
parallel Berlekamp architecture which has tbe speed performance equivalent to the
conventional, but at a hardware cost 8 times the serial Berlekamp architecture. The
serial and parallel VLSI architectures by Berlekamp perform encoding in the dual
or trace orthogonal basis representation of the field elements.
A transmitted codeword C(x) may be corrupted in a noisy channel. The received
polynomial R(x) can be expressed as the sum of the transmitted codeword C(x)
and error polynomial E(x) as
R(x) =C(x) +E(x) = Tn_1X,,-1 + .. +T1X +ro (2.5)
22
The following sections describe available techniques which can be used to find
and correct the errors in the received polynomial R(x).
2.3 Algebraic Decoding
The first task of an algebraic decoder is to determine the syndrome polynomial
Sex) based on R(x). The coefficients of the syndrome polynomial are given by [54]
Sj = R(a!) = Sea!) =~ riO-,j
1 :s:: j ~ 2t for nonsymmetric coefficients of G(x)
or 2m- L - t ~ j :s; 2m- I + t - 1 for symmetric coefficients of G(x)
(2.6)
After the evaluation of the syndromes, the error values ea, e\, .., en_I can be
found. If v errors actually occur in R(x), at the unknown locations i lt iz, .., i", the
error polynomial can be expressed as
(2.7)
(2.8)
where Yi is the magnitude of the lth error at location il.
Prior to decoding, the values of v, i lt ••• , itt and Yi, ... ,Y" are initially unknown.
If XI is the field element associated with the error location i lt then the syndrome
coefficients are given by
Sj=~YiXf
for j = 1,2, .. ,2tor j=2m- 1 _t, ...,2m- 1 +t-l
where Yi is the error value and Xl is the error location of the lth error symbol.
23
An expansion of Equation (2.8) gives the foUowing set of 2t simultaneous equa-
tions in v u.nknown error LocatiolLS Xl •..• X. and v unknown error magnitudes
Yt •...• Y"
5 1(%) = YtXI + l'2X2 + ... + Y"X.
52 (%) = YtXI2 + Y;X22 + + y"X.2
53(%) = YiXl3 + YiX23 + + y"X.3
The above set of equations must have at least one solution because of the way
the syndromes Me defined. Tbis solution is unique. Thus. the decoder's task is to
find the unknowns, given the syndromes. This is equivalent to a problem in solving
a system of nonlinear equations.
Clearly, the direct solution of the system of nonlinear equations is too difficult
for large values of v. Instead, intermediate variables can be computed using the syn-
drome coefficients Sj from which the error locations Xl> ..• X. can be determined.
The error-locator polynomial is introduced as
The polynomial is defined with roots at the error locations X/- l for l = 1.2, .. " v.
The error location numhers Xl indicate errors at locations il for l = 1.2, ... , v. That
is to say,
24
A(%) = ~(1 - %X,) = (1 - %Xd(1 - %X,) ..(1 - %X.) (2.10)
wbereX, =a".
To determine the codIicients of A(%) from the syndromes, equate EquatioDS (2.9)
and (2.10) and multiply both sides by Y/xj- and set % = X,-I, i.e.,
Then the left side becomes zero, giving
0= y/X/+U(l + ALXI- 1+ A2X,-2 + .. + AU_LX,-(U-I) + AuXI- U)
y/(XrU+ A1X{+U-l + ... + AuX!) = 0
Such an equation holds for each I and each j. Summing up these equations from
1= 1 to I =v, for each j. gives.
Ei_1 Y/(Xl+u + A1xl--1+ ... + AuXI) = 0
Ei"". Y/xl+u + A1Ei••YiXI--1+ ... + A"Ei.tYtxl =0
The individual sums 5I!em to be the syndromes and thus the equation becomes
A.Sj+u_l + A2Si +u_'l + ... + AuSj = -SHu
where;=l,2, ...,v
This set of linear equations relates the syndromes to the coefficients of the error-
location polynomial A(z). It can also be expressed in matrix form as
25
s. s, s, s_. S. A. -5'-+1
S, S, S. S. S_. 11.-. -5_2
S, S. S, S_. S_, 11.-. -S..,
AA~ (2.12)
S. S_. S_, 5'211_2 s..-. A. -S~
The above system ofequations has a unique solution for A which can be obtained
by inverting the matrix A, if A is nonsingular. The matrix A is nonsingulari£v::; t
[54).
Peterson's direCvsolution algorithm solves for the error locator polynomial h(x)
in Equation (2.12) as follows [54J: as a trial value, v is set to the error correction
capability of the code t and the determinant of the matrix computed. If the deter·
minant is nonzero, it can he shown that this is the correct value of v. Otherwise,
if it is zero, then the trial value of v is reduced by 1 and the process is repeated
until a nonzero determinant is obtained...cUter the determinant has been obtained,
the coefficients of A(:z:) are determined using the value of v in Equation (2.12) by
standard techniques of linear algebra.
Peterson's direct.-solution algorithm is inefficient for codes with a large error
correcting capability t. The number of computations necessary to invert a v by v
matrix is directly proportional to tr. In most applications, designers often prefer
00 use codes that correct & large number of errors. The following subsections de-
tail two efficient decoding metb.ods: the Berlekamp-Massey algorithm and Euclid's
algorithm.
26
2.3.1 Berlekamp-Massey Algorithm
The Bedekamp-Massey algorithm relies on the fact that the matrix equation of
Equation (2.12) is not arbitrary in its form, rather, the matrix is highly structured..
This structure is used to obtain the vector A by a method that is conceptually more
complicated but computationally much simpler [54][59][60J.
IT the ve<:tor A is known, then the first row of the above matrix equation defines
8 01+l in terms of St •... , Su, The second row defines 8 01+2 in terms of~•...• Su+t and
so forth. This sequential process can be summarized by the recursive relation
Si=-tiAiSi-i, j=v+l, ...• 2v (2.13)
For fixed A, this is equivalent to the equation of an autoregressive filter. It can be
implemented as a linear-feedback shift register with taps given by the coefficients
ofA.
Using this argument, the problem has been reduced to the design of a linear-
feedback shift register that will consequently generate the known sequences of syn-
dromes. Many such shift registers exist, but it is desirable to find the smallest
linear-feedback shift register with this property. This will give the least-weight er-
ror pattern with a polynomial A(x) of smallest degree v. The polynomial of smallest
degree v is unique. since the v x tI matrix of the original problem is invertible.
Any procedure for designing the autoregressive filter is also a method for solving
the matrix equation for the A vector. The procedure applies in any field and does
not assume any special properties for the sequence 51>~' ... , 8'/t.. To design the re-
quired shift register, the shift register length L and feedback connection polynomial
A{x) must be determined. hex) has the form
27
where deg A(x) :5 L.
The Berlekamp-Massey Algorithm uses the initial conditions
MOl(x-) = 1, B(O) = 1, and L o = 0, to compute J\CZl}(.x) as follows;
(2.14)
.-,
.6.r = L A;r-t}Sr_i (2.15)
jri
L r = oAr - Lr_d + (1 - ,qLr _ 1 (2.16)
forr=1, ...,2t
Or =- 1, if both AT I- 0 and 2Lr _ 1 5. r - 1; and Or = 0, otherwise. (2.18)
At the end of the 2t iterations, the smallest.degree polynomial M2IJ(X) with i\~2t) =
1 satisfying the relation
Sr +EA;21)Sr_j = 0
;e><L
where r = L:2! + 1, .. ,2t will be obtained.
Then if we define the error evaluation polynomial flex) by the relation
5(x)1\(x) =O(x) mod xZc
then we can use O(x) to solve for the error magnitudes 11, .., Y".
28
(2.19)
2.3.2 Euclid's Algorithm
Euclid's algorithm is a recursive procedure for calculating the greatest common
divisor (OeD) of two polynomials [611. fn a slightly expanded version. the algorithm
will always produce the polynomials a{x) and b{x) satisfying
GCD[s(x),'(x)l_ a(x)s(x) +b(x)'(x)
Euclid's algorithm uses the initial conditions
R(Ol{x) = x2!, T(O){x) = EJ~I SjXi-1, and
to compute A(2t){X) as foHows:
(2.20)
(2.22)
(2.23)
(2.21)Q"'(x)-l~J
A(r+t)(_) _ [ 1 0 ] A(r)(x)
... - 0 Q(r){x)
[:::,«;) ]- [~ Q":(x)] [ ;.,~;) ]
The algorithm stops when the degree of T(r) is less than t.
At the end of the iteration, the error evaluator and error locator polynomials
are found using
fl(x) = ~-IT(r){x)
A(x) =.6. _1A~(X)
29
(2.24)
(2.25)
respectively, where .6. = A~;?(O) and A22 is the element of the matrix A(~) in the
second row and second column.
This algorithm has been modified by Shao et al to avoid the computation of
the inverse elements of the Galois field [36](62]. The modified Euclid's algorithm
recursively finds the i-th remainder Ro(x) and the quantities '7;(x) and A,(X) that
satisfy the relation
'7,(x)A(x) + A,(X)S(x) = Ro(x)
and stops when the degree of the remainder polynomial R;-(x) is less than t, where
A(x) = X21 and Sex) = E~~, Skx21-k.
Using the initial conditions Ro(x) = A(x), Qo(x) = Sex), AO(X) = 0, Jlo(x) =
1, '7o(x) = 1, 17o(X) = 0, it computes Ro(x), A;(X) and '7;(x) as foHows:
R;-(x) = (I1,_tbj_IR;-_I(X) +8j _ tC1;_IQi_t(X)]
-xll,-d[Q"_Ia.:_IQ;_I(X) + (1i_ tbi_1R.:_l(X)]
Ai(X) = [cri_Ib;_tAi_I(X) +O'i_IC1;_IJ4_I(X)]
-xll'-ll[crj_Ia;_t~_I(x) +8;_lbj_1Ai_I(X)]
'7,(x) = [Q',_lbj_I'7i_t(X) + (1j_Ia.:_111,_I(X)]
- XII,-Li[Q'._I{lj_tJ4_I(x) + (1i_tb;_I'7,_l(X)]
Q;(x) = cri_tQi_I(X) + (1i_lR.:_I(X)
1];(X) = cri_I1];_I(X) +81_I'7j_l(x)
(2.26)
(2.27)
(2.28)
(2.29)
(2.30)
(2.31)
where C1;-1 and b'_1 are the leading coefficients of R.:_t(x) and Qi-I respectively,
1'_1 = deg(R.:_I(x)]- [deg(Qj_I(X)1. cri_t = 1 if 1._1 ~ 0 and crj_1 = 0 if l,_t < O.
The iterations stop when deg(R.:(x)] < t, after which the error locator polyno-
mial A(x) = Ai(X) and error evaluator polynomial O(x) = R.:(x).
30
Once the error locator A(x) and error evaluator n(x) polynomials have been
determined using the above techniques, the error locations and error values or
magnitudes can be found using the Chien search and the Forney algorithm. These
methods are described in the following subsections.
2.3.3 Chien Search
Once the coefficients of the error locator polynomial All .., Au have been found,
the roots of A(.:z:) can be computed using the Chien search. The Chien search is
a systematic means of evaluating the error locator polynomial at all elements in a
field GF(2m ) [63}. The evaluation of each element is performed in
(2.32)
to check for A(x) = O.
2.3.4 The Forney Algorithm
The Forney algorithm is an efficient method often used to compute the error mag4
nitudes. The error evaluator polynomial n(.:z:) is defined as [59}
fl(x) = S(x}A(x) mod x2t
where A(x) = A..xu+ A.._1x"- 1+ ... + A1:t + 1 = ni'=l(l- :tXtl
and
Sex) = E;~l Sjzi = E;~l E:':l YtX!x j
Equation (2.33) can now be expanded as
31
(2.33)
n(%) =%t y;x, 11(1- x,%) (2.34)
"=1 ljli
Instead of using matrix inversion to find the elTOr magnitudes, the Forney al-
gorithm calculates them as
where the derivative of A(x) is defined as
A'(%) ~ - t X, 11(1- %xi )
i=( i~-;
and hence
A'(Xj- l ) = -Xl II(l- XiX/-I)
i#'
32
(2.36)
(2.37)
Figure 2.2: Algebraic Decoder
In summary, the algebraic decoding algorithm works as follows:
Step 1; Calculate the syndromes according to Equation (2.6).
Step 2: Perform the Bedekamp-Massey or Euclid's algorithm to obtain the error
locator polynomial I\(x). Also find the error evaluator polynomial O(x).
Step 3: Perform the Chien Search to find the roots of I\(x).
Step .+: Find the error values Y(x) = E(x) according to Equation (2.35).
Step 5: Correct the received word C(x) = E(x) + R(x)
The structure of the algebraic decoder is shown in Figure 2.2.
33
2.4 Time-Domain Decoding
2.4.1 Error Locator and Evaluator Polynomials
The time-domain decoding algorithm was ftnt proposed by Blahut (641. It is vc-
plained in detail in the references [51[61[54J and is only summarized in this subsec·
tion.
The time-domain algorithm uses the initial conditions ~O) = b!O) = w~O) = 1 and
>:,(0) = b~(OI = ~O) = 0 for aU t, to compute the following set of recursive equations:
tl.~ = ~o/r[~r-I)TjJ
L. = o.(r - L.._d + (1 - 6..)4_1
[ j:;] [a'\5 (l-~)~'-. ~ ~] [ ~:~:: ]~(" ~ '0' -t;. 1 -(;.a"' ~(."')b~(r) 0 (1 - Or) 6.;l6,. (1 - 6,.)0-( b;Cr- 1)
[ ~:: ] = [6;116,. (l__~~:'-i] [~:=:: J
fori =0, .. ,n-I, r = l,2•...• 2t.
L =0 and 0" = 1 if both 6 .. #0 and 2£::; T -1, and 0=0 otherwise.
2.4.2 Error Evaluation
(2.38)
(2.39)
(2.40)
(2.41)
Using the error locator vector A, the vector>..' = ,\,1(2C). the error evaluator vector
w = w(2t), the error magnitudes are computed 88
~:~~. ~~;~
The structure of the time-domain decoder is shown in Figure 2.3.
34
Figure 2.3: Time-Domain Decoder
2.5 Error Correction
Once E(x) is known, the corrected codeword C(x) can be obtained. from C(x) =
R(x) + E(x).
2.6 Algebraic vs. Time-Domain Decoding Algo-
rithms
Based on the above discussion, the fundamental differences between the algebraic
and time-domain decoding algorithms are listed below:
(1) The time-domain algorithm has one major computational step. Unlike the
algebraic decoding algorithm, it does not compute the syndromes or perform the
Chien search to find the eCTor locations.
(2) The time-d.omain algorithm deals with vectors which have n components while
different length vectors and different degree polynomials are used in the various
steps of the algebraic algorithms.
(3) By changing the error correction capability of the code t, the operations in the
35
time-domain algorithms essentially remain th@ S&m@, whil@ those in the algebraic
algorithm are dependent 00 t.
(4) Although complex to design, the algebraic decoding technique is recommended
for high speed applications. The major drawback of the time-domain algorithm. is
its high computation count. This is brought about by the fact that it has to operate
on the complete data sequence of length n, while the algebraic algorithm needs to
work only OD. the syndrom@ sequence of length 2t =(n - k) m-bit symbols.
2.7 RS EncoderIDecoder Architectures
[n 1984, Blahut [64] originally presented two architectures for universal RS decoders
based OD. the time-domain algorithms. The decoders work directly on the received
data to generate the error sequence. They are attractive for VLSI design since
one major computational step is required. Unlike the algebraic decoders, neither
the syndrome evaluation nor the Chien search is required. Such a decoder can be
used to decode any RS or BCH codeword up to the limits of the storage registers
associated with the chip. Within these limits, it can correct any number of random
errors and erasures depending on the received data. Shayan d aJ restructured.
the time-domain a1goritlun to implemeD.t a versatile time-domain [5] and a cellular
decoder [6] which can operate in a Galois field GF(2"') with a fixed m.
Conceptual models for the logic structures of the RS eD.coder and decoder chips
were presented. in [65][66]. The encoder is constructed by cascading and intercon·
necting a group of YL$I chips. The decoder architecture is based OD. the repetitive
and recursive properties of RS decoding procedures.
Truong et al [8][91 reported a single chip VLSI RS eD.coder implemeD.ted in
36
NMOS technology. The encoding algorithm is a bit·serial multipLication algorithm
developed by Berlekamp for the encoding of RS codes using a dual basis over
a Galois field. Compared to the conventional RS encoder for long codes, which
often requires lookup tables to perform the multiplication of two lield elements,
Berlekamp's algorithm requires only shifting and exclusive-OR operations.
Sbao et at [621 developed a pipeline structure of a transform. decoder similar to a
systolic array to decode R.S codes. The error locator polynomial is computed by the
modified Euclid's algorithm which avoids computing inverse elements. The modified
Euclid's algorithm architecture is based on the pipeline architecture suggested by
Brent and Kung [67] to compute the greatest common divisor of two polynomials.
A full-custom CMOS implementation of a RS encoder was proposed by Maki et
al in 1986 [351. In order to reduce the transistor count, domino logic was used. Its
architecture is invariant in operational speed or silicon area to the field polynomial,
generator polynomial or operation in the dual basis or normallield. With k encoder
chips operating in parallel, a k - 1 fault tolerant system can be constructed.
A pipelined RS decoder based on the transform decoding algorithm presented
earlier by the authors is described in [36J137J. The transform decoding technique
is replaced by a time domain algorithm to permit efficient pipeline processing with
reduced circuitry. By using multiplexing, the proposed Euclid's algorithm maintains
the throughput rate with little additional complexity.
In 1990, Tong [38J presented an 8-error correcting RS encoder·decoder. The
encoder and decoder can independently process 40 Mbytes of data per second. The
chip was designed using a standard ASIC methodology and fabricated in a 1.JJm
CMOS compact-array technology.
37
I.n 1991, Seroussi (29) presented a systolic architecture for a RS encoder. The
architecture completely eliminates the global feedback signal found in the conven-
tional encoder architectures which use the linear feedback. shift register (LFSR).
The encoding algorithm. is based on the Cauchy representation of the generator
matrix of the code. The areh.it.eeture is suitable for very high speed applicatiollS,
where global signals and the need for global synchronization may pose restrictions
on the achievable switching speed of the encoder.
A full-custom CMOS VLSI implementation of a Reed-Solomon decoder for the
Hubble Space Telescope and television applications was presented by Whitaker et
al in 1991 [3][391. The architecture is similar to others presented in the referellces
[40][41J. It is implemented in a 1.6 ~m double metal CMOS technology and operates
at a data rate of 80 Mbits/s using a 10 MHz system/data clock:. In these designs,
Euclid's algorithm is used to determine both the error location and error magnitude
polynomials.
In order to solve the problem of multiple notations and multiple algorithms
often faced by designers, high level synthesis is used to study the different SCH
and RS decoding algorithms 142). Special VHDL packages are created to describe
the various operations on Galois fields. A VHDL synthesis tool consequently allows
efficient exploration of various areh.iteetures in order to select an optimum one.
Methods for reducing the computation count in the time domain algorithm for
RS decoding were presented by Choomchuang and Arambepola in 1993 [43J. An
architecture for an error correction circuit suitable for high-rate data decoding of
RS codes was proposed in (441. The operational steps for multiple-error decoding
are reduced by a 4-stage pipeline and a superscalar processor of a Galois field. The
38
experimental chip achieves 16 Mbytes/s of data decoding sufficient for compressed
video signals of higb...<Jefinition as well as those of standard.definition TV's.
The use of high level synthesis techniques to realize a high-speed Reed-SOlomon
CODEC was reported by Cools et al in 1994 [451. High level synthesis allows rapid
design exploration over a large range of arc:hiteetures. An error free transfer is
guaranteed between all the levels of the design process. The design was captured
using a combination of Mentor Graphics and a Cathedral-} compiler. The archi-
tectural design phase concentrates on the composition of the data path and global
cycle count; logic synthesis performs local optimizations in terms of hardware and
timing; whereas the pLace-and-route tools compose the 6nallayout.
A low circuit complexity architecture for a Reed-Solomon encoder suitable for
satellites and pocket size wireless terminals was presented by Hasan and Bbargava
in 1995(46]. The encoder uses the triangular basis multiplication algorithm. Using
pipeline and bit-serial operations the encoder is able to obtain code rates ranging
from unity to a minimum value determined by the associated hardware circuitry.
In 1995, Chen d al [47] presented a three stage pipelined VLSI architecture of a
Reed-Solomon decoder. The decoder has an etasUIl!: function and uses the modified
Euclid's algorithm to solve the key equation. The block length is variable. The
hardware complexity is shown to be only dependent on the number of parity check
bytes. The modified. Euclid's algorithm allows the error evaluator and error loca-
tion vectors to be determined sequentially by using a smaller amount of hardware.
The algorithm state machine and architecture were verified using Verilog hardware
description language.
In 1995, Iwamura et al [48) proposed a class of systolic arrays to perform binary
39
RS decoding procedures including erasure correction. Such an RS decoder is suit-
able for VLSI implementation since the arrays consist of simple processing elements
of the same type.
[n 1997, Hsu and Wang [491 presented a pipelined VLSI architecture of a Reed-
Solomon decoder which combines a modified-time domain Berlekamp-)'1assey algo-
rithm with the remainder deaJcling concept. For a t-error correcting RS code with
block length n, only 2t consecutive symbols, instead of n are required to determine
the discrepancy value during the decoding process.
A VLSI architecture for an area efficient Reed-Solomon product-code encoder
and decoder was published by Kwon and Shin in 1997 [41. The architecture uses
functional block: sharing to implement the encoder, modified syndrome and era-
sure locator polynomial evaluations. The modified Euclid's algorithm is used to
determine the error/erasure locator and error/erasure evaluator polynomials. The
architecture is recommended for encoding/decoding audio and video signals over
GF(256).
Rapid prototyping was used to implement a Reed-Solomon decoder in (50].
Erasure correction is supported. The chip includes two 256-byte ROMs, a table
look-up for the inverse of the elements in GF(28 ) and one 512-byte RAM or buffer
registers.
A Reed-Solomon decoder which operates in the GF(28) was presented by Saodt
in [51]. The ASIC is targeted for military anti-jamming applications in microwave
links. It uses FIFO buffers that are external to the chip.
40
2.8 Summary
The various decoding algorithms for Reed-Solomon codes have been presented.
A survey on the existing as encoder and decoder architectures usually designed.
for a fixed m has been given. Universal RS decoder architectures based on the
time-domain algorithms first appeared in 1984. Versatile time-domain and cellu-
lar decoders were subsequently derived from them. They require only one major
computational step in locating the error patterns. Single chip RS decoders that im-
plement the algebraic and transform decoding algorithms have also been reported.
The BeriekamI>Massey or Euclid's algorithm is often used to find the error Location
and magnitude polynomials.
41
Chapter 3
Proposed VLSI Arithmetic
Architectures
This chapter introduces and describes an approach which exploits the symmetric
properties of available VLSI arithmetic architectures to perform multiplication.
exponentiation and inverse operations in GF(2"'). Traditionally, such operatioos
are performed using hardware which has been design~ to function over GF(2m ) for
a fixed value of m. The requirement to operate with different symbol sizes of m-bits
seems to recur throughout the design of the RS encoder and decoder circuits. VLSI
chips which have been reported in the literature always use a fixed block length
n and a fixed symbol sUe m since the exponentiation, multiplicatioQ and division
circuits in Galois fields have different designs for different values of m. One of the
major contributions of this thesis has been to demonstrate that tbe parameter.,; m
and n can be variable without a significant increase in hardware.
The proposed approach defines a standard symbol of ffi..bits which readily a1~
lows any symbol from GF{2"') where m $; mto be represented as an ffi.bit symbol
whose (m - m) most significant bits have been set to zero. This principle facili-
tates all arithmetic functions in the Galois field with the symbol size m :5 iii to
42
be implemented as subsets of m = m with a. small penalty in hardware. TQ il-
lustrate the concept, an rn-programmable Galois field multiplier which uses the
standard basis representation of the elements is first proposed, where m ~ 8. By
using this multiplier, it is shown that the exponent and inverse functions can be
implemented using the same hardware structure. The resulting circuits are sys-
tolic and have simple, regular communication and control structures. They also
allow unidirectional data flow which is advantageous over systems with contraflow-
ing data streams [68][69J. These circuits will be used in the design of an m and
t-programmable RS encoder/decoder which is later described in Chapter 4. The
choice of a fixed symbol size m = 8 is fairly common in a wide range of practi~
cal applications [2][3](4](8][35J[391(45][47][65][66], but is made variable for values of
m = 3,4,5,6,7 and 8 as an illustration in this thesis. The architecture can be easily
extended to accommodate larger values of m.
3.1 m-Programmable Galois Field Multiplier
For arbitrary elements A(x) = 2:~ol a.l:x.l: , B(x) = 2:4',;01 b.l:X.l: in GF(2m ), and the
primitive polynomial P(x) = 2:4'=01P.l:x.l:, the product C(x) of A(x) multiplied by
B(x) is given by
C(.) ~ A(.)B(.) mod P(.)
= [E4'.:oL A(x)b.l:x.l:] mod P(x)
= (...(A(x)b"'_LX + A(X)bm._2)X + ...)x + A(x)bo
= Cm_1Xm - 1 + Cm_2x",-2 + ... + C1X + CO
(~.l)
As described in [15J, the product C(x) as defined in Equation (3.1) C8.II. be
computed recUrsively as
43
To(x) =0
Ti(x) = [T;_l(X)X] mod P(x) + A(x)bm_i , i = 1,2, ... , m (3.2)
C(x) = Tm(x)
where
Denoting the most significant bit (MSB) ofTi(x) as Mil the recurrence relation
can be rewritten as
T;(x) = T;_l(X)X + P(x)Mi_ 1+ A(X)b",_i
wherei=1,2, .. ,m.
(3.4)
The above computation can be implemented using a parallel-in-parallel-out two
dimensional systolic array with m x m basic cells. Each cell at position (i, k) would
perform the logic operation (111
ti,k = t,_l,k+L EEl (Pm_Ii: . Mi-tl EEl (am-Ii: . bm _ i) (3.5)
where i = 1,2, ..,m and k = 1,2, ... ,m.
In the case where m = m= 8, the systolic array with 8 x 8 hasic cells is shown
in Figure 3.1. As shown in the figure, the coefficients of A(x) and P(x) enter the
array from the top whereas those of B(x) enter from the left-hand side, such that
the operation defined in Equation (3.5) is performed at the ith row. It consists of
279 logic gates as reported by the Synopsys synthesis tools.
44
"I" •• p••, p, •• PO" p'" P' " p, •• ""
::j~oYl~~-
b..... b ...
P"'-P-
M., 'OM ..,
........,~.. _ (p_ .... 101.')_(.__ 11.<)
Figure 3.1: A Parallel-In-Parallel-Out Multiplier for GF(211 )
45
The Boolean Equatioru ti"k, as defined by Equation (3_5), ror all the 64 cells or
the GF(~) multiplier are as foUows:
m - 8: Primitive Polynomial P(:r) - r' + z4 +:r' + r + 1
Row 1: Mo 0, i 1 Row 2: M1 tl.l. i 2 Row 3: M2 f,,1, i 3
tl,1 (b-r·ar) ~,l (bs·at)EEltl,2 t3,t '~~·ar!EElt2,.2
t l,2.= (b-r.Q.ci;) ~,2.=(bs·Q.ci;)EEltl,3 t3,2=(bs-Q.ci;)EDt2,J
tl,3 = (b-r . as) t2,3 = (bs· as) ED tt.t t3,3 = (bs . as) ED t.2.t
tl,t=(b-r-a..) t2,t=(be;·C4)EBM\EBtl,.5 t3,,=(bs·a.c)EBM2ED~,.5
tt,.5 = (b-r, 0.3) t.2,.5 = (bw; . 0.3) EEl MI EBtl,' t3,5 = (bs ·aJ)l1lM2 EElt:t,l
tl" = (b-r'02) t:t,. = (bw;·a2)EilMI EBtl,r t3,. = (bs·I12)EllM2 lDt.2,t
tl,t = (b-r . ail t.2,7 = (bs ·ad EBtl,l h,t = Cbs· ad EB~,I
t l =ib-r'~) ~,=ibs'ao)EBMt tu=(bs 'ao)EBM2
Row 4: M3 t3.1, i 4 Row 5: M, _ t,,1> i 5 Row 6: Ms tS,11 i 6
t"L (b,'ar)et3,2 tS,1 (~'at)EDt,,2 tll,1 :~~.arl$ts,2.
t,,2 = (b4 • 06) ED t3,3 t5,2 = (b3 .~) $ t,,3 tll,2 = (~ , 06) $ tS,3
t',3=(b,·as)$t3,4 t5,3=(~'as)EBt", lts,3= (b:z'as)ets"
t'.4 =(b'·I1.I)EBM3EBtu ts,4=(ll;,'a.c)EBM,$tu tll,,=(b:z·a..)eMs EBts,.5
t4,.5 = (b, -a3) ED M, EDt3,' tS,.5 = (ll;, 'a,)E9M, EBt'A te,.5 = (b:z ·a3)lDMsEBtu
t4,l = (b, . a2) tB M, tB t3,t t s,. = (II, . a2) EEl M, tB Lt,r tll,l = (b:z . C2) ED Ms lD ts,t
t"r=(b,·adEBtu ts.r=(b;,·adEElt,,1 4.t= (b:z·adEBts,l
Lt,l=(b"co)EBM3 tS,I=(b;,-ao)EBM, ts,l=(b:z'co)EBMs
Row 7: M, lts 11 i 7 Row 8: Mt ttl, i 8
tt.1 (bl ·arlEB4,2 ta,l (llo,ar)etr,2
tr,2=(bl'~)EB4..3 t&,2=(bo'aa)E9tr,3
tr,3 = (bl . as) tB 4" t.,3 = (1.10 . as) ED tt.,
tr., = (bt . ot) tB M,EB 4,.5 t•., = (1.10 '04) EEl Mt ED tr,.5
tr,.5 =(bt ·a3)E£lM,ED4,l tu = (llo·a3)EElMtEBtr,l
tt,ll = (bt 'C2)E£lM,EDlts,r t.,11 = (llo·C2)tBMt EBtr.r
tt,t = (b l · all ED 4.. t•.r = Clio· all tB tt..
tt.. = (bl 'ao)EDM. tl,l = (bo'ao)EDMt
46
The Boolean Equations for cases where m < 8 are as foUows:
m = 3: Primitive Polynomial P(%) = x3 + % + 1
m = 4: Primitive Polynomial P(%) = %4 + % + 1
m = 5: Primitive Polynomial P(%) = r + r + 1
Row 1: Mo O. i 1
tl,l (b.. '111.!
t 1,2=(b"'a3)
t 1,3 = (b.. ·az)
t 1•• =(b,,·ad
tt = (b... ao)
tu (b;,·I1I.)6'lt l •2
tz,2 = (b)·a3)6'lt1,3
tz,J = (b) ·az)mMt $tl ,.
tz,. = (b)·at)eft,5
t2,S=(b)·ag)$Mt
47
t.u .~~·I1I.)EBtz,2
t3,2 = (b:z·a3)ffitz.3
tU=(b:z'112)ffiM2 EBtz,4
t 3 ,4 = {b:z·ad6'lt2,5
tU=(b:z·ag)ffiM'l
m = 5: Primitive Polynomial P(z) = .r + z2 + 1 (continued)
t.,1 (bl 'a.) ffit3,2
t.,2=(bt - a3)ffit3,3
t.,3=(b l -a:;,;)EBM3 EBt3,.
t.,. = (b l • ad ffi t3,5
t.,s = (b i - ao)ffi M3
tS.l (bo-~)ffit."
tu=(bo'a3)ffi t.,3
t S ,3 = (bo·a,) ffiM.EBt.,.
ts,.=(bo·adEBtu
tss = (bu,t1(I) eM.
m = 6: Pr-imitive Polynomial P(z) = x6+ X+ 1
Row 1: Mn O,i 1
tl,1 (bs - as)
t 1"= (bS 'a4)
h,3= (bs-as)
tl,.=(bs'a:;,;)
h,s=(bs-ad
t l,6=(bs,t1(I)
Row 4: M3 t3,1. i 4
t.,1 (b.· aslffit3,2
t." = (b.·a.)ffit3,3
t.,3 = (b.'a3lffit3,.
t.,. = (b.'a,)ffit3,.5
t.,5 = (b. ' ad e M3 ffi t3,6
tu =(b.,t1(I)EBM3
t2.1 (b. 'as) ffih"
t", = (b.-a.)ffit l ,3
t,,3 = (b.,a3)etl,.
t2,. = (b. -a,) etl,s
t2,5 = (b. ,adeMI EBt l ,6
t,,6=(b.·an)EBMI
Row 5: M. t •. I , i 5
tS,1 (b l -as)EBt4,2
tS,2=(bl'~)EBt.,3
ts,3=(bt - as)E9t.,.
ts,.=(b\'Cl2)e t.,5
ts,s= (bl ·allffiM.EBt.,6
tu = (bl -l1(J) eM.
48
t3,1 (bs' as)EBt",
h.,=(bs-~)et"s
t3,3 = (bs,a3)EDt".
t3 ,4 = (bs -a,) EBt,,5
t3,5 = (b3 -a!lEBM,EBt',6
tU =(b3 -t1(I)EBM,
Row 6: Ms tS,l> i 6
t6,1 (bo-as)E9ts"
t6,,==(bo·a.)ets,s
te,3==(bo- a 3)EBts,.
te,. == (bo'a,)ets,s
t6 ,5 == (bo' ad ffi Ms EB tS,6
t6,6==(bo-an)EBMs
m = 7: Primitive Polynomial P(x) = xT + r' + 1
Row 1: Mo 0, i 1
t1.1 (br;-f:16)
h:z={br;·a,)
tl,2={~-Q.j,)
tl.• ={~·aJ)
t 1,.5 =(~-o,)
tU={~'al)
t l •7 =(~-oo)
Row 4: M3 t3hi 4
t.,l (b)'f:16!EDt3.2
t4,2 = (b:J -a~)EB tu
t 4 ,3 = (!l:J-a..)lil t3,4
t 4 ,4 = (b:J·aJ)EBM3 EBt3,5
t4,.5 = (b:J -a,) Iiltu
t4., = (b:J-a!l EBtJ,T
t4.T=(b:J-Oo)EBM3
Row 7: M, 4,11 i 7
tT.! (bo 'Oc)EB 4,2
t7,2=(bo-~)EB4,2
t7,2 = (bo'04)EB4,4
t7.4 =(bo-a,)EBM,EB4,s
t7,.5 = (bo-0,)EB4.,
tr,ll = (bo-adEB4,7
tT,T=(l/o'Oo)IilM,
t2,l (b$-llf;)EBt1,2
t",=(bs-aS)e t l,2
~,2=(bs-Q.j,)EBtl.4
t,.4 =(bs-a3)eMtEBt1,s
t2 ,.5 = (bs-o,)EBtu
tu =(bs -atllilt1,7
t2•7 =(bs-ao)EBMl
Row 5: M4 tu,i 5
ts,l (b,-l1tl)EBt4,2
ts,2= (b,- aS)EBt4,3
tS,3= (b,-G.t)E9 t 4.4
ts•4 =(b,-aJ)EBM4 EB t4,S
tS,.5= (b,'a,)EBt4"
tS,ll = (b,- a l)EB t4,T
t s,T=(b,-ao)EBM4
Row 3; M, t2,l, i 3
t3.1 (b4'lZ6)ED~:z
h:z=(b4-as)e~,2
tu=(b4 -04)etz,4
h.4 = (b4 -aJ) ED M2 EB tu
ts,s = (b4 -o,)EBtu
t3,6 = (b4-al)EBt'l.T
t3,T=(b4 'ao)EBM2
Row 6: Ms tsbi 6
t'.1 (bl -f:16)EBtS•2t,., = (b l -a~)EBtS,2
t,,3 = (bl-a4)etS•4
t,,4=(bl-aJ)EBMs EBts,.5
f"s = (b1 '0,) EBts,'
t". = (bl·atlEBts,T
t'.T = (b1 - (0) EB Ms
Careful exam.ination of the Boolean Equations in all ca.ses of m = 3,4,5,6,7,8
clearly shows that a two-input AND gate and a two-input or three-input XOR
gate are required to implement the function t>,k of each cell. It is thus possible to
reuse a subset of the available 8 x 8 cells in Figure 3_1 to realize the logic function
of the m x m cells for which 3 S m < 8. Due to the sequential nature of the
multiplication algorithm and the fact that each symbol is represented as an eight-
49
bit symbol whose 8-m most significant bits have been set to zero, the logic function
ti,J: of the m x m cells for m < 8 has been realized using the cells which occupy a
square with coordinates (9 - m, 9 - m), (9 - m, 8), (8,8) and (8,9 - m). Where
necessary, redundant terms have been added to the Boolean equations of some of
the 8 x 8 cells in rows 2 to 8. A simple relationship has been devised wb.ereby each
row of the GF(2I) uses a local controller which sets or clears the redundant terms
in order to correctly implement the desired. function t;,J: for m :5 8 using the same
hardware. Each controller has been modelled as a multiplexer. Emphasis here bas
been placed on hardware reusability.
It should be noted that in the circuit implementation, the control variables
Mj, Mjs, Mjs , M,"7, and Mj_w-, have been introduced to the Boolean equations,
defined in Equation (3.5), for m = 8 as overrides to allow the programmability
of the multiplier for different m = 3,4,5,6,7,8. Implementations of the various
overrriding local cell equations are detailed below in algorithmic fDem.
The control variable Mj replaces M l in row 2 in cells (2,4), (2,5), (2,6) and
(2,8) modifying them as follows:
t,.• ('" 44)eMjlB t l,5
.... ('" a3)EBMj EB t l.,
.... ('" ·Q2)EBMj lBtt.7
t,. ('" ao)EBMj
The local controller then operates as follows:
um =8 it sets M J =tl,li
else ifm =7 it sets t 1,3 =tl,4 =tl,5 =tt,S =tl,7=tl.a=Mj =0;
50
end if;
Accordingly, row 2 of the GF(:z3) also correctly implements row 1 of the function
4.l- ofthe GF(2T) multiplier whose Boolean equations are defined in Equation (3.5).
The control variables Mj7 and Mj have hem introduced to row 3 in cells (3,4),
(3,5), (3,6) and (3,8) modifying them as follows:
t 3 ,( (b, a()lBMj7 lBt2,5
t,. (b, ·a3)lBMj EIlt2,1I
t .. (b, lJ2)EBMj7 EBt2,7
t .. (b, ao)EBMj
The local controller then operates as follows:
ifm = 8 it sets M j7 = Mj = t2,1;
else ifrn -= 7 it sets Mj =t:z,.2,Mj7 =0;
else ifrn =6 it sets t2,( = t2,.5 =t,,lI = t,,7 = t,,1 = M j = Mp =0;
end if;
Accordingly, row 3 of the GF(~) correctly implements rows 2 and 1 of the
function ti.l- of the GF(27) and GF(211 ) multipliers respectively.
Control variables M j7, M j_ump , M j7 and M j have been introduced to row 4 in
cells (4,4), (4,5), (4,6), (4,7) and (4,8) as foUows:
.... (b, C4)lBMj1 fIltu
'"
(b, a3) Ell Mj_kmp EEl tu
t(,6 (b, ·a,)EBMj7 EBh,1
t4,7 (b. ·adEBt3,lleMjll
t .. (b. ao)eMj
51
The local controller then operates as follows:
if m = 8 it sets Mj = MjT = Mj _ lemp = tJ.lr MjG = 0;
else if m = 7 it sets Mj = Mj _ l _ p = tJ ,2. !vljG = Mp = 0;
else ifrn = 6 it sets Mj "" M jG = t',J.Mj1 = Mj _ I4mp =0;
else ifrn::; 5 it sets tJ ,5 =ts,a = t,,1 =t:J,l = M~ = MJ? = Mj-'-'P = M j =0;
end if;
The above procedure permits implementation of the logic functions of row 1 of
the GF(~) multiplier, row 2 of the GF(2') multiplier. row 3 of the GF(21) and
row 4 oftbe GF(2G) multiplier using the same hardware.
Control variables Mp • M j _ l -.-. M js , MjG and Mj have been introduced to row
5 in cells (5,4), (5,5), (5,6), (5,7) and (5.8) as foUows:
tS•• (b, Got)eMJ?$tu
t" (b, a,) ED Mj_tttrop ED t.,ll
t.. (b, . '12) $ Mj1 $ Mjs e t.,7
t S•7 (b, ·adet•.aeMja
t.. (b, "ao)EDMj
The local controller then operates as follows:
ifrn = 8 it sets Mj =MJ? = Mj_ump = t.,ltMJ~ = M~ =0;
else ifrn = 7 it sets Mj = Mj_'-'P = t..,2,Mjs = l\{~ = M j7 = 0;
else ifrn = 6 it sets Mj = Mj , =t.,."Mjs = Mj7 = Mj_tttrop =OJ
else if m = 5 it sets M j = Mjt> = t.,4. MiG = Mi7 = Mj_ump =0;
else if m = 4 it sets Mi7 = Mj_t-.p = M js = Mi = Mi , = t•.a = t •.1 = t.,a .. Oi
end ifi
52
Control variables Mj7, M j_tcnp , Mjs • M j & and Mj have been introduced to row
6 in cells (6,4), (6,5). (6.6). (6,7) and (6,8) as foUows:
4" (b-z-'l4)EBMj7EBts,s
4. (0, -(3)EBMj_~petS,8
4. (0, (2) e M j7 e Mjs e tS,7
4" (0, (1)ets.. WMj6
t.. (0, ao)eMj
The local controller then operates as follows:
ifm = 8 it sets M j = M j7 = Mj_ktnp = ts.l,Mjs = Mj& =0;
elseifm = 7itsets M j = Mj_unop=t5.2,Mjs = Mj &= MJ"7 =0;
else if m = 6 it sets M j = M j & = t5.], Mjs = MJ"7 = Mj-lcnp = 0;
else ifm = 5 it sets Mj = Mjs = ts.4 ,Mj& = Mj1 = Mj_,-::IE 0;
else ifm = 4 it sets MJ"7 = Mj_ktnp =O,Mj = M js = t$,5;
else ifm = 3 it sets MJ"7 = M js = M js = M j = tS.7 = ts.. = 0;
end if;
Control variables M J"7. Mj_unop, Mjs , Mjfj and M j have been introduced to row
7 in cells (7,4), (7,5). (7,6), (7,7) and (7,8) as follows:
t7•• (b, "'l4)eMJ"7 EBtu
t" (b, - (3) ED Mj_U:m.p e tfj.6
t7,6 (b, - (2) ED M j7 ED MiS ED t S,7
t7.7 (b, -adetS.6EBMjs
t7 ., (b, -ao)eMj
53
The local controller then operates as foUows:
um = 8 it sets M j = M fT = M j _-..., = tt.l,Mjs = Mje -0;
else um = 7 it sets M j = Mi_u.tq> = 4"" MiS = Mje = M jl =0;
else um = 6 it sets M j = MiS = 4;.J,Mjs = M J"1 = Mj_~ =0;
else um = 5 it sets M j = M js = tt,4,MjS = M J"1 =M j __,. =0;
else um = 4 it sets M J"1 = Mj-'-'P =O,Mj = Mj$ =4;,.5;
else ifm = 3 it sets M jl = M js = O,Mj = M js = tu;
end if;
Finally, the control variables M jT , Mj_kmp, Mis, Mjs and M j have been intro-
duced. to row 8 in cells (8,4), (8,5), (8,6), (8,7) and (8,8) as foHows:
te,. (b, G.!)G:lMj7 W!r,s
t.. (b, a,)ffiMj_tcmpffi!r.o
t.. (b, (12)ffiMjT WMjs Wtr,T
t., (b, ·adWt7~G:lMj6
'..
(b, ao)eMi
The local controller then operates as follows:
ifm = 8 it sets Mj = MJ"1 = Mj_Wrlp = tr,hMjS =MjS = 0;
else um = 7 it sets M i = M j _ l _ = tr,."Mjs = M j• = M jT =0;
else ifm =6 it sets M j = M j • = t 7,J,Mjs = M jT = Mj_ump =0;
else um = 5 it sets M j = M js = tT,4,Mj6 = M j7 = Mj _ mnp =0;
else if m = 4 it sets M jT = M j _ r-.p = 0, M j = MiS = t7,S;
else ifm = 3 it sets Myr =MJs = a,Mj = Mjt. =t7,s;
end if;
54
.>\nother controller assigns the output, i.e. product, elements as follows:
urn = 7 it sets tu =0;
else urn =6 it sets tll,l =18,2 =0;
else ifrn =5 it sets tll,l = t8,2 = tl,3 =0;
else ifrn .. 4 it sets tu = 11,2 = tl,3 = ta,,, =Oi
else um E: 3 it sets tl,l = t,,2 =t,,3 =t8,,, = t,l,5 = Oi
end if;
followed by
~=~,~=~~=~,~=~,~=~.~=~,~=~.~=~l
according to Figure 3.l.
Based on the above analysis, it can be seen that a two-input AND gate and
a two-input or th.ree--input XOR gate implements the function 'o,.t. Registers and
D-8ipHops have been placed between adjacent rows in order to facilitate pipeline
processing of data between neighbouring cells. The pipelined version of this Tn-
programmable multiplier outputs the product C at a rate of one output per cycle
after an initial delay of m cycles. The clock period is governed by the propagation
delay of a signal through a multiplexer, a 2-input AND gate and a 2-input or 3-input
XORgate.
The resulting GF(2"') multiplier is systolic and has a simple, regular commu-
nication and control structure. It also allows unidirectional data. How which is
advantageous over a system with contraHowing data streams. Most fault tolerance
schemes which are suitable for linear arrays route information around faulty cells
[68J(69]. This can introduce significant transmission delays between cells. In unidi-
rectional data How arrays, latches are often inserted in all data streams which are
55
rerouted around a faulty cell but at the expense of increased system latency. This
does not change the required data interactions, since the relative delays between all
data paths are zeros. This technique is not suitable for arrays with contrafiowing
data streams because the relative delay between paths would be non-zero and hence
data interactions may be corrupted.
The symbolic architecture of the multiplier is shown in Figure 3.2. A and B are
the 8-bit elements to be multiplied; elk is the clock signal; m is the symbol size;
test..se, test..si, test..so are the test ports; 0 is the 8-bit product of A and B.
A comparison of the unpipelined and pipelined multiplier is shown in Table
3.1. The number of gates with and without scan chain, number of detected faults
and fault coverage are automatically generated by the Synopsys synthesis tools.
The maximum clock frequency is estimated by interactively simulating the VHDL
gate level netlist file, using repeated functional verification and timing analysis
techniques. The pipelined version has a lligher gate count because of the added
registers between neighbouring cells. Both versions of the multiplier have a 100%
fault coverage which ensures high quality and ease of testing after fabrication. The
multiplexed scan chain improves the controllability and observability of the internal
circuit nodes, thereby reducing the complexity of test generation.
56
0< 7, 0 >
SyS7:0! icMtJl tip! Ie:
Figure 3.2: Symbolic Arch.lltecture of the Programmable Multiplier
57
Circuit Properties
Latency
Throughput rate
Number of Gates
Number of Gates with Scan Chain
Number of Detected Faults
Maximum Clock Frequency (MHz)
Fault Coverage %
Unpipelined Pipelined
1
1 1
517 2,583
551 3,419
2050 8892
60 200
100 100
Table 3.1: Comparison of the programmable Unpipelined and Pipelined Multiplier
The design procedure can be summarized as follows:
1. For a seleeted m, derive all the Boolean Equations t i ): for all the m2 cells. Also
derive the Boolean Equations for m < musing t;): = ti_I):+l e (P"'-k . Mi_ l ) e
(am-k.b",.-i) where i = 1,2, ... ,m and k =1,2, ... , m.
2. Beginning with m = m- 1 and adding control variables to each cell where
appropriate, restrict implementation of t i ): to a square with coordinates (ffi + 1 -
m,m+l-m), (m+l-m,m), (m-,m) and (m,m+l-m). Repeat the procedure
forallm=m-2,m-3, ... , 4,3.
3. Add registers between neighbouring cells to obtain the pipelined version of the
multiplier.
3.2 m-Programmable Exponentiator/Inverter
Definition 1: For an arbitrary element A in the finite field GF{2m), the inverse of
an element A is denoted by A-1 = A2"'-2 [10). Rewriting the exponent 2m - 2 as
21 + 22+ 23 + .. + 2"'-1, allows the inverse operation to be expressed as
(3.6)
58
Definition 2: For an arbitrary element A in the finite field GF(2"') and an integer
N(1 :$ N :$ 2'" - 1), the exponentiation functioD. is defined as 5 '"" AN. Clearly,
5 is in GF(2"'). If N is represented in binary form as no, nIl fl2, ... ,n",-I such that
N =E::G' n; 2;, then 6 = AN can be expressed as follows [30][311
where
o = AN =AE;:;',.,·2'
= (A)"O. (A2)'" . (A2'(> ... (A2"'-')"--'
= n:O l (A2')""
=n~'E.
E;=A:zO un.=1
E;=1 ifn;=O
(3.7)
(3.8)
(3.9)
Assuming the temporary result is ~ = n{..o Ei , then the following recursion is
derived.
Flo = I·Eo,
R I = Flo· E l ,
Rio: =~_i·EIo:, (3.10)
R.n_l = R...._2 . E.._I
=AN
By using the definition for the exponentiatioD. function, an alternate method
can be derived to evaluate the inverse of an element in GF(2"'). Equations (3.6)
and (3.7) show that if N = (no,nl,fl2, ... ,n.",_I) such that N = I:r;oIn..2' as in
Definition 2, then the inverse function is a. special case of exponentiation. They are
equivalent, that is AN = A-I, if and only if the foUowing conditions are satisfied
59
no= °and
nl = n, = ... = nrn_1 = 1 (3.11)
These conditions are always valid as we can observe that Equation (3.6) can be
restructured as Equation (3.7) in the fonn
when N = (1\(J, nil n2, ... , 1lrn-d = (0, 1, 1, ... , 1)
Henceforth, similar to the exponentiation function, the inverse element can be com-
puted as
where
A-I = AE:';'n;o2'
= (A)o. (A2)l. (A2)1 ... (A2"'-')1
=m~,;ol(A2')
= m~oIE;
(3.13)
(3.14)E; =A2' if ilOEo =1 if i=O
Let the temporary result be 14 = m~;iil Ei , then the following recursion is also
obtained,
Rn._1 = Rm_2 . Ern_ 1
= A-I
60
(3.15)
Figure 3.3: A General Exponentiation Architecture
From a hardware implementation point of view, the exponentiation architecture
can be used to compute the inverse element as welL It can be implemented using
registers to hold the data, control circuitry and repeated use of a single multiplier
or use of multipliers in parallel. According to the above analysis, multiplication
stands out as the most critical arithmetic operation. Thus, the ideal multiplier
circuit structure must be modular, easily expandable and require a simple control
scheme. A global system diagram comprising the three main components is depicted
in Figure 3.3.
In the ease wherem =8, one only needs to set N =(no, nl, floz, nJ, n., n5, ne, n1) =
(0,1,1,1,1,1,1, I) in order to evaluate inverse elements by using the same exponen-
tiation hardware. A structure for computing exponentiation or inverse elements of
the GF(2') is shown in Figure 3.4.. It is an extended version of the array described
in [311.
The word-level systolic array consists of 14. multipliers (MULl to MULl.), 8
8-bit multiplexers (MUXo to MUX1 ). 28 I-bit one-eycle delay elements (D I ) and
one 8-bit one cycle delay element (DfI ).
The multipliers On the left bank (MULt to MUL,) evaluateA2' fori = 1,2, ..,m-
61
-
N
II. '" ... Do
Figure 3.4.: Exponentiation/Inverse Architecture for GF(28 )
62
Latency m
Throughput rate 1
Number of Gates 7,735
Number of Gates with Scan Chain 8,821
Number of Detected Faults 29,705
Fault Coverage % 99.8
Table 3.2: Features of the program.mable Exponentiator/lnverter
1 while those on the right bank (MULa to MUL l4 ) evaluate Rt = Ro_l·Ei fori = 1
to m - 1. The multiplexers (MUXo to MUX7 ) select A 2' if 11; = 1 or the B-bit
identity element I = uOOOOOOOl" if n; =0 as the output E j •
Thus the output 0 = A .... is available as R7 • The m-programmable GF(2m )
multiplier has been used to implement the MULj such that the output is also ac-
cessible at various points Rm_t form = 3,4,5,6, 7,8 as specified in Equations (3.10)
and (3.15). The output points R2 to R7 are directly connected to an independent
module which assigns them to 0 = AN based on the word size m.
The symbolic structure of the combined exponentiation/inverse architecture is
shown in Figure 3.5. CHlPrnode configures the chip to operate as an exponentiator
or inverter; E%ponentln is the port for the exponent; GF..ELEMENT is the Galois
field element A; elkIn is the clock signal; m is the symbol size; testJe, test.....ri, wuo
are the test ports; VALUE is the B-bit inverse or exponentiation of GF...ELEMENT.
Its circuit properties are shown in Table 3.2. The number of gates with and
without scan chain, number of detected faults and fault coverage are automaticaUy
generated by the Synopsys synthesis tools.
63
test-s:=
Figure 3.5: Symbolic Architecture of the Programmable Exponentiator/I:nverter
64
A comparison of the programmable exponentiator/inverter when operating as
an inverter for fixed m with other inversion circuits is illustrated in Table 3.3. The
algorithm in [22} requires a computational complexity of O(mlog2 m). p is the
number of ones in the binary expression of m - 1 and q is the lower bound on
log2m. If the input data pass in continuously, the VHDL gate-level simulations
show that the parallel-in parallel-out inverter can produce results at a rate of one
output per clock cycle after a latency of m cycles. The new inverter is flexible and
clearly outperforms circuits proposed in 122][23].
Since the standard basis is commonly used in implementing algebraic RS de-
coders in hardware, the exponentiator/inverter is implicitly based on the normal
basis, and therefore exponentiation can be easily implemented via cyclic shifts.
However, additional circuitry would be required to convert between the normal and
standard basis representation of the Galois field elements, thus making the design
of the RS decoder more complex.
65
Circuit Properties Proposed Inverter Inverter in [23J Inverter in [22]
Latency 7m-3 m(p+q)
Throughput 2m-l m(p+q)
Computational Complexity 0(1) Oem) O(mlog2 m)
Regularity High High Moderate
Dependence on
Primitive PoLynomial Yffi No Yffi
Basis Standard Standard Normal
I/O format Parallel-In Serial-In Serial-In
Paralle[~Out Serial-Out Serial-In
Table 3.3: Comparison of Exponentiator/lnverter with other Inverters for fixed m
66
3.3 Discussion and SUlllInary
An m-programmable Galois field multiplier, which can operate in the GF(2"'),
where m is variable, bas been presented.. It uses a simple controUer in some of
the basic ceUs. The cases where m = 3,4,5,6,7 and 8 have been considered. Its
pipe!ined version has a speedup factoe of about four. Using this multiplier and the
modified version of the word level systolic array for exponentiation discussed in [311.
it bas been discovered that both the inverse and exponentiation functions can be
evaluated using the same hardware structure. The results snow that the proposed
method of performing inversion of Galois field elements is more efficient and faster
than available circuits. These arithmetic circuits are systolic and have simple. reg·
ular communication and oontrol structures. They also allow unidirectiooal data
flow which is advantageous over systems with oontraflowing data streams [681(691.
A very high fault coverage has been obtained by using a full scan test methodol-
ogy which uses multiplexed fiip-Hops. This means they will be easy to test using
automatic test equipment after fabrication [70). All the gate-level simulations Cor
the proposed architectures have been performed using the Synopsys YHDL System
Simulator. These programmable arithmetic circuits are easily expandable, heDce
can be tailored for a wide range of applications requiring variable symbol size m.
67
Chapter 4
Synthesis of the Reed-Solomon
EncoderjDecoder ASIC
This chapter presents the design methodology, circuit synttlesis and functional ver-
ification of the major modules of the RS encoder/decoder ASIC.
4.1 Design Flow, Functional Verification and Test
The circuit synthesis of the RS encoder/decoder ASIC was realized using the 0.8-
p.m BiCMOS design kits for Synopsys and Cadence tools licensed by the Canadian
Microelectronics Corporation (CMC).
The design flow made use of a O.8-p.m CMOS standard cell library, which did
not include any bipolar junction transistors (BJTs), provided in the BiCMOS fab-
rication software. It supported a top down VLSI design methodology in which
the functional abstraction of the digital Ie could be initially specified using the
YHDL hardware description language. The circuit models were described using a
subset of the VHDL constructs called Register Transfer Levels (RTL). Once logic
simulation was completed and verified, the RTL circuit models were then synthe-
sized to obtain the gate level (structural) circuit models using the Synopsys suite
68
VHDL (RlL code)
Rn.. Simulation
Logic Synthesis
Scan-Chain Insertion (OFT)
Gate·[.evel Simulation
PIac::e&Route
Figure 4.1: Design Flow
of tools. The design could then be imported into the Cadence environment as a
VeriIog gate level netlist file, automatically generated by the Synopsys tools. It
could then be automatically placed and routed in order to create the IC masks
required for the fabrication process. These steps were independent of each other
and are summarized in Figure 4.l.
As shown in Figure 4.1 the modelling, verification and implementation processes
were integrated. The integrated design Bow reduced the amount of code that had
to be maintained and the risk of inconsistencies between models. Thus, an error
free transfer was ensured between all the levels in the design process. Rapid design
exploration of the different architectural styles could easily be made. The Synopsys
69
tools focus on the composition of the datapath and global cycle count while the
Cadence suite of tools concentrate on the creation of the integrated circuit layout.
Once the behavioral model of the RS encoder and decoder had been captured
using the VHOL hardware description language, each block was then partitioned
into smaller modules which were modelled separately using a subset of the VHOL
constructs suitable for logic synthesis. The size of each synthesizable module varied
from 45 to a maximum of 20,000 gates although a reasonable gate count (250 to
5,000 gates per module) was recommended in order to reduce the compile time
{71]. Larger modules were characterized by sequential processes which had heavy
dataflow dependencies. They required large CPU time, huge memory and logic syn-
thesis run times of up to three days. The functional correctness of each VHDL RTL
model was verified using an interactive UNIX based RS encoder/decoder simulator
written in C {72]{73].
Hierarchical compile is the simplest method for compiling a hierarchical design
[74]. However, a bottom-up compile strategy whereby individual modules were
compiled first foUowed by higher modules, was adopted. This way once a module
had been compiled, it was assigned a danLtau.ch Rag so that Synopsys did not need
to compile or read it again. The bottom-up compile method worked well when the
entire chip was synthesized into gates since the entire design was not required. to be
stored in memory. Unlike the hierarchical compile, this led to significant savings in
CPU and swap space. The design rules were checked and an initial fault coverage
reported on each module as it was developed. The report helped identify the block
that had design rule violations or an unacceptable fault coverage so that testability
problems could be fixed at an early stage. Testability analysis was then performed
70
on the top-level core design before scan insertion, because test design rule violations
could be introduced due to interconnect between the hierarchical blocks. At the
time, a partial or full scan test methodology which used the multiplexed flip-flops
was the only design.for-testability (DIT) style supported in the O.8-Jlm SiCMOS
technology.
The architecture of the chip and its major modules are described in the following
section.
4.2 Chip Architecture
This thesis implements an algebraic encoder/decoder chip using CMOS standard
cells. The standard algebraic decoder for decoding RS codes is described in de-
tail in Chapter 2. The complex arithmetic operations needed in the encoder and
decoder generally require the use of the m-programmable multiplier and exponen-
tiator/inverter proposed in Chapter 3. As previously reported, the first step in
the decoding algorithm is to calculate the syndrome polynomial S(x) which con-
tains the information to correct correctable errors or detect uncorrectable errors.
The Berlekamp-Massey or Euclid's algorithm can be used to determine the error·
locator polynomial [3]. The Berlekamp-Massey algorithm was selected because its
low design complexity makes it suitable for VLSI synthesis. Another module ex-
ists to determine the error magnitude polynomial using the relationship between
the syndrome and error location polynomials, i.e., O(x) = S(x)u(x) mod x 2! or
flex) = S(x)A(x) mod x2l . Once the location and magnitude of the errors have
been determined using the Chien Search and the Forney algorithm respectively, the
received messages can he corrected.
71
Sizes of the field
Primitive Polynomials, P(x)
Generator Polynomial, G(x)
Error Correction Capability, t
Encoder Latency
Decoder Latency
Gate Count
Technology
GF(2 ),GF(2 ),GF(2")
GF(2S ),GF(24 ),GF(23 )
x'+x4 +r+x2 +1
x7 +r+l
x5 +x+l
r+x'+l
x4 +x+l
x3 +x+l
(x + o::)(x + o::') (x + 0::2t)
t=1,2,3, ,16
n
2n+2t+m+3
218,206
O.8-~m SiCMOS
Table 4.1: RS Encoder/Decoder Characteristics
The symbolic architecture of the programmable RS encoder/decoder is shown
in Figure 4.2. Values of the Galois field symbol size m = 3,4,5,6,7,8 and error
correction capability t ranging from 1 to 16 are supported. Thus the block length
n = 2m - 1 varies from 7 to 255 symbols. The device contains 218,206 gates, where
a gate is defined as a 2-input NAND gate. Its characteristics are given in Table 4.1.
The size and description of each I/O signal are given in Table 4.2.
Implementation details of the main modules of the RS encoder/decoder are
described in the roUowing subsections. All the required arithmetic operations in
Galois fields are accessible to global entities as components or junctioru defined
in a VHDL package. Each module has been modelled in VHDL using component
instantiation.
72
",' 3: 0'
-<':;'f!;
E~~,:r::>~s_t.:., -.g,'
Er'-crVa i ... e',; -2.·
f= ini$i'1~dOe':::e
Fir.ishedEnC'lce
t'-Ioi::,-,-ors
Figure 4.2: Symbolic diagram of the RS Encoder/Decoder
73
Signal
Clock
Data
Message
Mode-port
Reset
Start
Codeword
Data-Correct
ErrorPosition
ETTorValue
FinishedDecode
FinishedEncode
NoErrors
Bit Size Description
clock signal
input port for the received word
from channel
input port for the message symbols
selects the encoding or decoding mode
reset signal
starts the encoding or decoding process
Galois field symbol size
error correction capability
output port for the codeword polynomial
to channel
output port for the corrected data
output port for the error positions
from Chien Search
output port for the error value/magnitude
from Fomey Algorithm
goes high when the decoding process is complete
goes high when the encoding process is complete
goes high when there is a functional error
Table 4.2: RS Encoder/Decoder I/O Pins
74
4.3 Modules
4.3.1 Encoder
Let the message polynomial be M(x) = C2CX21 +C2I+lX2tH + _. + c.._lX,,-l and the
parity check polynomial be P(x) = Co + elI + ... + C2t_1X2t- 1 . Then the encoded
RS code polynomial, often called the codeword, can be expressed as
C(x) = M(x) + P(x)
~ Q(x)G(x) (4.1)
.. M(x) ~ Q(x)G(x) - P(x)
The quantity Q(x)G(x) means that a valid code polynomial C(x) must also be
a multiple of the generator polynomial C(x). Hence, the encoder must find P(z)
from M(x) and G{x). Tbis is achieved by the division algorithm. That is, dividing
M(x) by G(x) gives the remainder polynomial R{x) such that
M(x) ~ Q(x)G(x) + R(x) (4.2)
where Q(x} is the quotient.
In this thesis, the RS encoder uses a conventional architecture to perform the
division of M(x) by G(x) to obtain the parity check polynomial F(z) = -R(:z:)
defined in Equation (4.1). Its structure is shown in Figure 4.3.
In the figure, G(i)'s axe the symmetrical coefficients of G(x). Initially all the
registers are cleared and both switches set to position A. The message symbols
c",-l' ... , C2t are fed into the division circuit and are also transmitted from the encoder
symbol by symbol every clock cycle. Immediately after k clock cycles, both switches
are set to position B to allow the parity check symbols to be serially shifted out of
the encoder to form the complete C(x). The shifting process takes 2t clock cycles.
75
Figure 4.3: RS Encoder
The encoder module consists of 9,644 gates. It has been designed as a 32~
stage 8-bit linear feedback shift register (LFSR) whose components include 16 m-
programmable Galois field multipliers, a modulo-255 counter and 8-bit registers. m
and t are variable for 3, 4, 5, 6, 7, 8 and 1,2,3, ..., 16 respectively. If m had been fixed.
as in [35][65J, the multipliers for the coefficients G(i)'s and the feedback connectIon
could have been designed as constant multipliers consisting of a tree of XOR gates.
This version of the encoder is also available for fixed m = 8 and fixed t = 16. The
VHDL code for any coustant multiplier in GF(2m ) can be automatically generated
using a C++ program written by the author.
Figure 4.4 shows a symbolic diagram for the RS encoder. The calculation of
the various coefficients ofG(z) is generally tedious and hence was automated using
a C program available in [72][731. Their values are heavily dependent on t and m
and are stored in the appropriate registers during chip initialization.
The bit size and description of each I/O signal are given in Table 4.3.
76
';essaSE /'B:11
'3tar ~~nt:odinoj
c:" Co:!~wo:,,:jO:0.'
NeOCE?
"' .... 3,8> FiliishedEncodio:g
:-st
t(4 :2:'
Figure 4.4: Symbolic diagram of the RS encoder
77
Signal
Message
StartEncoding
<II<
m
c,t
t
Codeword
Finishe4Encoding
Bit Size Description
input port; for the message symbols
starts the encoding process
clock signal
Galois field symbol size
reset signal
error correction capability
output port; for the codeword polynomial
to channel
goes high when encoding is complete
Table 4.3: Encoder I/O Pins
4.3.2 Syndrome
As noted earlier, the 2t syndromes or syndrome polynomial coefficients axe com-
puted as
.-,
Sj=~r,cij, 2m - l _t::;j::;2m - l +t_1 (4.3)
where T,(O::; i::; n - 1) are the coefficients of the received polynomial R(x).
By using Horner's rule, Equation (4.3) can be rewritten as
Figure 4.5 shows a block diagram for computing tbe syndrome values defined in
Equation (4.4).
As shown in the figure, each cell implements the following register transfer
relation:
78
r;:-l~,..--v-1•II S,
~,
I .~. ,
=:=7".L:-J: ..
"",
ar"'·'··-'
Figure 4.5: Systolic Array to compute Syndrome Polynomial
79
cell 1: Bl +- Al e B.Ql"'-'-1
ct!ll 2: lh +- A 2 e lhQl"'-'-I+1
cdl 3: B) +- A, e S,?-'-I+2
(4.5)
(0'
2m - 1 -t 5i 5 2m - L +t_1; 15 k 5 2t
where +- implies the operation "is replaced by".
Based on the values of t and m, the corresponding ~ variables are stored in the
cell registers during chip initialization. After the complete R(z) has entered, the
required syndromes 5j are contained in the registers Bjo After n clock cycles, the 2t
syndromes are shifted out in parallel and fed into the Berlekamp-Massey module.
A maximum of 32 syndrome cells are supported by the device.
The symbolic structure of the syndrome module is shown in Figure 4.6. It
consists of 22,515 gates. The size and description of each I/O signal are given in
Table 4.4.
80
" g:
.'U'g
"" v.lv._'.·.'
s. v •••••1<,.>
s.,.· H'"'
I , • ••
..- _,., ..
s <>I.
s.. ' ..
'I , ' ••
.......- , -
So_""_'''''''''''' I.
s.. _ " ..
.........V.,••. ':".>
S ·.".,••• I~<1 I.
1 _..... , •••,"'"
............,••• "<'1>
s. ,••.,. .
S , ••_"" I>
S••••••••••••• :I·, ••
...............",...
1 • ..... ' •••1.<'.>
I • ••••••. ll·· ..
1.__.·.·., ...... • ••
1 ••_ •• , ... " •••:]" ._
""_",,,01••.: .. , I_
I •••••·.v.,••.,.".>
5. •••••• " " ••
••••••••v., :1<••>
., ••••••v"••.,....>
•..•...." :." ..
1 •••'0•• " " •••..•...."., " ..
'-------"-' "
Figure 4.6; Symbolic diagram of the Syndrome Module
81
Signal
Codeword
Compv.teSyndrome
elk
m
N',
ALPHA_CONSTANT
Finished$yndrome
NoErrors
Syndrome Valv.e....KX
Bit Size Description
8 input port for the received word
from channel
starts the decoding process
clock signal
Galois field symbol size
reset signal
error correction capability
offset term required in
the Forney algorithm
goes high after the syndromes
have been calculated
1 goes high when all syndromes are zero
256 32 syndrome values
Table 4.4: Syndrome I/O Pins
4.3.3 Berlekamp
The Berlekamp-Massey Module implements the following algorithm(53][54][57]. As
shown below, minor modifications to the original Berlekamp-Massey algorithm are
necessary to facilitate RTL logic synthesis. u(X), 5(X) and {J(X) are 16-byte
registers. Land l' can have maximum integer values of 16 and 32 respectively since
tmoa =16.
The Massey-Berlekamp Algorithm
Step 1: If Reset. 1 then Initia~i2:e att the flip-flops
lind registers
l' = 0; u(X) = 0; £=0, P(X) = 1; 5(X) = 0; .CI. = 0;
else
Step 2: For l' = 1 to 2.. t c~ock cyc~es
--compute the error discrepancy.CI. and 6-1
82
6. = Ef~ UiS.,_i ;
if 6. > o then
8(X) ~ "(X) - <>XP(X);
if 2£ < "Y then
P(X) ~ 6.-1u(X);
"(X) ~ 8(X);
L ~ 1 - L;
else
P(X) XP(X);
"(X) 8(X);
end if;
else
P(X) XP(X);
end if;
end for;
;k;
After the error locator polynomial u(X) has been determined, the error evalu-
ator polynomial O(X) is found using the relationship O(X) = S(x)u(X) mod xu.
The 16 coefficients ofO(X) are determined in parallel after "Y = 2*t cycles as shown
below:
Step 3: Error Evaluation Polynomial
If "Y = 2*t cycles then --Compute the 16 error evaluator polynomial
-- coeffidents in paraUet
83
end if;
The symbolic structure of the Berlekamp Module is shown in Figure 4.7. Its
major components include 170 multipliers, an exponentiator/inverter, a 32-byte
register, a 16-to-l multiplexer. a 32 integer counter, a 16-byte shift register, 4 16-
byte registers. It requires 2t+1 clock cycles to determine u(X) and fI:(X). The total
gate count is 107,015. If gate count had been a design issue, an aggressive design
could drastically reduce the number of multipliers by almost one-third but at the
expense of speed. All the 16 coefficients of fI:(X) could be determined sequentially
using the last expression fl:ul = Sus xor 0"15L5 xor 0"25t4 xor ... xor O"t6. In this case.
a minimum of 2t + 17 clock: cycles would be needed to compute O"(X) and fI:(X).
The size and description of each I/O signal are given in Table 4.5.
84
~.~~.."." ~.
.....,.."...," ..
,,,.,.~;, ...>
1""00""'1""
, ...oO"llt",,>
"N(:OCW'C'"''
"'01"""""
S'>C\I'O'1['"''
''-OO<J''IE' ••",
,.-.co""''''''>
, .....01:"....
'''':»CM<,J.'''1-"'01:"" •.
.. ''0'""",,,., 0>
'''U'CIt1tll''''
'·'>0'0'1[11""
,.'-oClt1t"" .>
"O"Q~: •• , •.
'~COClt1t:,,, .>
•...,."""':J...>
"-e>'O"Ilt: •• , ..
.."'"""""":......
irioOtC7<l::•• ·.'
'<>OIC7<I:I"">
,·o.ooCJ"l[II""
.. ....,.O·'I!:3I" "
..".ooO"ll!J,,"'>
'·..oo......n"·.>
CJ"l[G<>"'"
O"i'""'''<''>
I:...,."">
1I:G""''''>
Ir-.-''- .>
11~"""
1:G'V>f""
11'"""""""
11~"'>
1I:GO'""""
1I:co'<A""'>
11(;><1>.'1<7'_
s:lOo"All<7'>
1II......"".·
Figure 4.7: Symbolic diagram of the Berlekamp Module
85
Signal
CHIPmode
-"',Syndrome.J(.J(
St4rtBerlekamp
""m
,."
,
FinuhedBerlekamp
OMEGA...xX
SIGMA...xX
Bit Size Description
1 configures the exponentiator/inverter
to operate as an inverter
8 set toFE~ for inversion
256 32 syndrome values
1 starts the evaluatioD of a and n
1 docksign.al
4 Galois field symbol size
1 reset signal
5 error correction capability
1 goes high after (T and n have been found
128 1611 coefficients
128 16 (T coefficients
Table 4.5: Berlekamp Module I/O Pins
4.3.4 Error Magnitude Evaluation
4.3.4.1 The Chien Search
The Chien search evaluates the error locator polynomial
A(%) =,g(1-XX1) =a.:r;-+a.-I.J:--1+ .. +olz+1 (4.6)
at all elements of the Galois field GF(2m ), where Xj = a i ,.
The actual error locations are indicated by Q';', ~' •..., ai.. Obviously h(:)
has roots a-il, o-i" ... , 0:-". Clearly an error occurs at position i if and only if
A(a-i) =Oor
A(O-l) =1 + tAla- i' =0 ~ tAIQ:-i, = 1 (4.7)
1=[ I_I
An implementation of Equation (4.7) is done using the Chien search circuit
shown in Figure 4.8. It consists of an mt-input exclusive-or gate, t multipliers and
t registers. In this design, a maximum of 16 multipliers and 16 registers can be
86
__.."....."'_ ... 0
Figure 4.8: The Chien Search Hardware
osed_
The Chien search operates as follows::
Eacli coefficient of A(:) is repeatedly multiplied by ai, where a is the primitive
element in GF(2"'). Each set of the products is then summed by the mt-input
exclusive-or gate to obtain the Output = Li"'l AIQ-~
U a i is a root of A(x) then the Output = 1, and an error is indicated at the
coordinate associated with a-i = a n- i , Otherwise, if the Output = 0, there is no
87
4.3.4.2 The Forney Algorithm
If Ct-'''' is a zero of A(x), then the error magnitude module computes the error value
at location R.._i... using [541
(4.8)
wherej=2+t~2m-l,
N(x) is the first derivative of A(x) with respect to x and Ct''''; is the offset term.
The calculations indicated. by Equation (4.8) are almost identical to those re-
quired in the Chien search, and the realization shown in Figure 4.8 can be modified
to evaluate each of the polynomials n(Ct-i ... ) and N(Ct- .... ). The structure of the
circuit which evaluates the error evaluation polynomial n(x) at x = 0:-1 is shown in
Figure 4.9. It consists of an mt-input exclusive-or gate, t m·bit registers and t mul-
tipliers. Unlike the Chien search circuit, this circuit does not have a one-detector
at its output [531.
88
Figure 4.9: Enor Evaluation Polynomial Circuit
89
Figure 4.10: Derivative Circuit
The first derivative of A(x), i.e., N(x) required in the Forney algorithm is given
by
l"¥J 2.1:_2
A/(x) = Al + ~x2 + Asz4 + A7z6 + .. = E A2.1:_lx (4.9)
where k is an integer.
An evaluation of N(O-i) suggests the implementation shown in Figure 4.10.
The derivative circuit consists of an m(2k - 2} input exclusive--or gates, 2k - 2
multipliers, and 2k - 2 registers, where k is an integer. The offset term o,... j and
the inverse of N(o-'''') as specified in the Forney algorithm are evaluated using
the proposed exponentiator/inverter configured to operate as an exponentiator and
90
-(1.'.. 1
EJ.ponenliation EmJrVallle
Figure 4.11: Block Diagram of the Error Magnitude Evaluation
inverter respectively. The block diagram of the Forney algorithm is shown in Figure
4.11.
As shown in Figures 4.8, 4.9 and 4.10 the error location polynomial A(x), error
evaluation polynomial n(x) and the first derivative of A(x) are evaluated for the
same field elements. The values of ai, which are based on user defined t and m, are
stored in the cell registers during chip initialization. All the polynomials A(x), A'(x)
and f.!(x) are evaluated in parallel followed by one inversion, one exponentiation
and two multiplications to obtain the error values. This process takes n + m + 2
clock cycles. The symbolic structure of their module is shown in Figure 4.12. It
consists of 37,952 gates. The bit size and description of each I/O signal are given
91
Signal
ALPHA..CONSTANT
E$pOnem
Mode...EXPONENT
MrxkJNVERSE
OMEGA.J(X
SIGMAJCX
StartCalculaU Value
dk
m
....",
,
Cyclu
ERROR_VALUE
ErrorPosition
FinishedCalcuJate Value
Bit Size
128
128
1
1
4
1
5
Description
offset term in Forney algorithm
exponent
conJigures the exponentiator/inverter
to operate as an exponentiator
configures the exponentiator/inverter
to operate as an inverte.r
l6 n coefficients
l6 (T coefficients
starts the evaluation of error values
clock signal
Galois field symbol size
reset signal
error correction capability
clock cycles for error value evaluation
error value/magnitude
error position
goes high after tile errors are found
Table 4.6: I/O Pins of the Error Magnitude Evaluation
in Table 4.6.
4.3.5 Error Correction and Verification
The error com!<:tion module CODSist5 of eight 2.input XOR gates and eight mul-
tiplexers which are equivalent to 45 gates. It performs the Galois field addition
operation C(:r) = R(z) + E(z) which exclusive or's the error values E(z) with the
buffered messages R(z) in order to correct the errors.
The error verification module computes the syndrome values of the corrected
symbols to check if they are all zero as the processed data leave the chip through the
Data_Correct output port. If the syndrome values are not all zero, an error Bag is
generated by the NoErrors signal indicating that the data contains an uncorrectable
number of errors.
92
.:.....'-' ..
';.:>00". ~•
.",..... , ..
"~""
"="''''''>
'","",""',
~::::;;:;:::,::.,.
Figure 4.12: Symbolic diagram of the Error Magnitude Evaluation
93
Signal
eLK
DATAJN
RD
RESET
WR
DATA_OUT
Bit Size Description
clock signal
input port ror the received word
enables reading the symbols from the stack
reset signal
enables writing the symbols to the stack
output port ror the bufFered symbols
Table 4.7: I/O Pins of the FIFO Module
4.3.6 First-In-First-Out Buffer
The first-in-first-out (FIFO) buffer is a 255-byte register stack which is used to tem-
porarily store the received code polynomial as the decoder determines the location
and magnitude of the erroneous symbols.
When a write (WR) request is generated by the 6.nite state machine, the symbols
are pushed onto the stack ir it is not full. The FlFO stack does not write to a full
stack, hence this condition is monitored.
When a read (RD) request is generated by the FSM, the symbols are read from
the "bottom" of the stack at depth 2'" - 1 where m =3,4,5,6,7,8. U the stack is
empty, tben no symbol is read.
The symbolic architecture of the FIFO is shown in Figure 4.13. It consists of
18,127 gates. The size and description of each I/O signal are given in Table 4.7.
94
CLY.[}:IJo,"iA_L:'J;." 7; a',
?-D Or.Tr._QU:-<i:~:·
~C:~ET
WR
Figure 4.13: Symbolic diagram of FIFO
95
4.3.7 Finite State Machine
As sbown in Figure 4.14, the finite state machine (FSM) for the as encoder/decoder
bas ten (10) states which can change on the rising edge of the docie. These states
are detailed below;
L Resetting: In this state all the counters, B.i~8opsand registers in the as modules
are initialized to all zeros.
2. SetModeo' In this state the chip can be configured to operate as an encoder or
decoder.
3. Encoding..BWe1: In this state, the encoder calculates the codewords based on
the eITOr correction capability, symbol size and message symbols. The parity check.
symbols are also shifted out of the encoder.
4. Encoding..$tate2: The FSM module monitors whether the encoding process is
complete before it advances to the Resetting state.
5. Syndrome..SWeJ: In this state, the Syndrome module calculates the syndrome
values based on the error correction capability, symbol size and received word. At
the same time, the received word symbols are stored in the FIFO.
6. Svndrome..Bto.te2: The FSM module monitors if the syndrome evaluation is
complete before it advances to the Berleko.mp..Bto.tel state.
7. Berlekamp..statel: In this state, the Berlekam.p module calculates the error
location and error evaluation polynomials based on the error cotn!Ction capability,
symbol size and syndrome values.
8. Berlekamp..sto.te2: The FSM module monitors if the error location and error
evaluation polynomials have been calculated before it advances to the ErrorVal-
ueCorrut..Statel state.
96
9. ErrorValueConuLSt4te1: In this state, the Error Magnitude module finds the
error locations and error values/magnitudes and corrects erroneous symbols. An
uncorrectable error condition is also tested and reported.
10. ErrorValueCorrect..State2: The FSM module monitors if the error correction
has been completed before going back to the Ruetting state.
The finite state machine is a Moore machine, Le., a sequential state machine
whose outputs depend only on the current state, independent of the inputs. In
other words, the functionality can be expressed as;
Next State (N) = [unction [current state (P), Input (1)1
Outputs (0) = [unction [current state (PH
The FSM has been completely described by a single VHDL proce88 with a
synchronous reset signal. It uses a one-hot encoding style which requires the use of
one positive edge triggered !lip-Bop per state, the current state being determined
by the Sip-flop that is on.
The symbolic an:bitecture of the FSM is shown in Figure 4.15. It consists of
313 gates. The size and description of each I/O signal are given in Table 4.8.
97
Figure 4.14: RS Encoder/Decoder Finite State Machine
98
Signal
CYCLES
Clock
FinishedBerlekamp
FinishedCalculate Value
FinishedEncoding
FinishedSyndrome
Mode-port
Reset
Start
CHIPmode
ComputeSyndrome
ComputeSyndrome_verify
Exponent
Mode...EXPONENT
ModeJNVERSE
RD
ReseLBerlekamp
Reset...Encoder
ReseLSyndrome
ReseLSyndrome....verify
ReseLValue
StartBerlekamp
StartCalculate Value
StartCorrectErrors
StartEncoding
WR
m..out
Lout
Bit Size Description
8 the clock cycles for
error magnitude evaluation
clock signal
notifies the FSM after (f, n are found
goes high after the errors are found
notifies the FSM when encoding is done
signals the FSM
after syndromes are found
selects the encoding or decoding mode
resets the RS encoder/decoder
starts the encoding or decoding process
Galois field symbol size
error correction capability
configures the exponentiator/inverter
to operate as an inverter
starts the decoding process
tests the uncorrectable error condition
set to F Ehuakcima! during inversion
configures the exponentiator/inverter
to operate as an exponentiator
configures the exponentiator/inverter
to operate as an inverter
enables reading symbols Crom the stack
resets the Bedekamp module
resets the encoder
resets the Syndrome module
resets the syndrome module
which determines uncorrectable errors
resets the Error Magnitude module
starts the evaluation of (f and n
starts the evaluation of error values
starts the error correction
starts tile encoding process
enables writing symbols to the stack
Galois field symbol size
to modules
error correction capability
to modules
Table 4.8: I/O Pins of the FSM Module
99
F;I1I~"!d8e~ie'MP
;" :": .....e'1C~ leu (a te'/a lue
~oct_:Jo~ :
"eH·
Star:
C';ill~U (e5Y"'Clr~Ilf
(o",putl!S.Ild~o"e_vt.. ;, I
E'lJa"l!IlI' 1 8~
Mode..EXl'CtE'-IT
Modi_INVERSE
R!J
Re<a 1_3e~ le~allp
~e ..<! t_Enecde~
Reset_Syncr'JIlt_vlrlry
Reset.Val.., ..
Star 1ger tl!~ al'lll
Starl(,llcut":I!Va!ul
StarlCorrectErrors
ShrtE"codl1:g
'"
"_Oul(3=;))
L J"l.... Lllut(;·8>
Figure 4.15: Symbolic diagram of the Finite State Machine
tOO
4.4 Testing and Results
As previously mentioned, once the behavioral model of the RS encoder and decoder
has been captured using the VHDL nardware description language, each block is
then partitioned into smaller modules which are modelled separately using a subset
of the YHDL constructs suitable for logic synthesis. The size of each synthesizable
module varies from 45 to a maximum of 20,000 gates. Larger modules are char-
acterized by sequential processes which have heavy dataBow dependencies. The
functional correctness of each VHDL RTL model has been verified using an inter-
active UNIX based as encoder/decoder simulator written in C [72](73].
This section presents partial sample simulations showing the encoding and de-
coding stages of the ASIC at a frequency of 50 MHz, which is equivalent to a
clock period of 20 nanoseconds. Test cases where error-correction succeeds and
where it fails are considered for selected values of m = 8 and t = 3. The user
can reconfigure the ASIC on-the-fiy using any combination of m = 3,4,5,6,7,8
and t = 1,2,3, ... , 16 depending on the application. To simplify the examples, it is
assumed that a message of k = n - 2t = 255 - 6 = 249 zero symbols is input to
the ASIC for encoding. Each symbol is m =8 bits so that the generated codeword
contains n = 2'" - 1 = 255 symbols.
In the first instance, it is assumed that an error value of 000OOOI0IJin..... occurs at
position 1 during the transmission of the codeword via the communication channel.
In the second scenario, it is assumed that 4 errors occur in the received data at
positions 1, 2, 3 and 4. In the timing diagrams, the time scale is in nanoseconds
and all the Signal values are represented as hexadecimal numbers.
Figure 4.16 illustrates the states of the input/output signals of the RS encoder
101
module described in section 4.3.1, during the encoding stage of the ASIC. As indi-
cated by the FinishedEncoding signal, the encoding process takes 255 clock: cycles,
for the chosen variables m =8 and t =3.
Figure 4.17 illustrates the input/output signals of the syndrome module de-
scribed in section 4.3.2, during the first the step of the decoding stage. As indicated
by the FinuhedSyndrome signal, syndrome evaluation takes 255 clock cydes. The
meaningful 6 syndromes are indicated, all other syndrome values are zero. These
can easily be verified using the equations described in section 4.3.2.
Figure 4.18 illustrates the simulation cycles of the overall RS ASIC during en-
coding. As shown in the figure, the encoding process takes 255 clock: cycles for the
chosen parameters. The relevant signals are indicated.
Figure 4.19 illustrates the simulation cycles of the overall RS ASIC during the
decoding process. As shown in the figure, the decoding process takes 553 clock
cycles for the chosen parameters. The relevant signals are indicated. The error
position Ls correctly denoted at position 1 of the received data.
Figure 4.20 illustrates the simulation cycles of the overall RS ASIC during the
decoding process, where error correction fails. Since there are 4 etTors in the received
data at positions 1, 2, 3 and 4, the decoder fails to correct the errors because the
actual number or errors u = 4 Ls greater than the selected t = 3. As shown in the
figure, the error positions and error values cannot be determined by the ASIC.
102
.noo""""
. """"'...
noo""","""",
fTOf'lSTARttNCOOlNG
~ ITOPICOOEW0R0(7:Q)
~ ... ~.'," 2000.1
HHH±_'Mjl4M1U
Figure 4.16: Gate Level Simulations of the RS Encoder
103
I
nfiLiMtlMtil.'+M!i_!iI__
I I
I
I
,------
.. rrOplayndrom."ak"U(7:O) I SC i
.~~_-"'" ~'~~~~~j~~~~~~I~~2'~~__"'".fTOP~_4{7;01 85i
.fTOP~..._S(1:01
.ITOP~W:O} i U
Figure 4.17: Gate Level Simulations of the Syndrome Module
104
Figure 4.18; Gate Level Simulations of the overall RS ASIC (Encoding)
105
·"""....,,,,,
"""-'-
t.
J
.. ffOPIl(.;Ol
.fTOP~(7:ll);
·1T~:o)l
.troP~:G'l.·1
"""--
rl
Figure 4.19: Gate Level Simulations of the overall RS ASIC (Decoding)
106
v.
~I
I
Figure 4.20: Gate Level Simulations of the RS ASlC (Decoding Failure)
107
4.5 Discussion and Summary
By using the proposed arithmetic circuits and a multiplexing technique which se-
lects the different values of m and t for various RS codes, a new programmable RS
encoder/decoder is designed and implemented. Values of the Galois field symbol
size m "" 3, 4, 5, 6, 7,8 and error conection capability t = 1,2,3, ... , 16 are supported
in the illustration. The chip contains 218,206 gates, where a gate is equivalent to
a 2-input NAND gate...o\n inverse ROM is completely eliminated for performing
Galois field element inversion as suggested in the various decoder implemeotations
presented in the literature [2}[36Jf38][a9J[45][47][49][SO][51]. They have been ellS-
tomized for a specific m and t. The reason for avoiding a ROM is that six different
ROMs would have been required because the inverse elements in Galois fields arl!i
different for variable symbol size m. The same argument applies for the aponen·
tiation operation. Therefore, a total of twelve different ROMs would have been
required for the inverse and exponentiation operations. The constant multipliers
have also been replaced with the general m.progra.m.mable multipliers throughout
the encoder, syndrome, Chien search and error value evaluation circuits.
The design is parameterised directly in VHDL in terms of tbe symbol size m
and the error correction capability t. The syndrome values are calculated in n clock
cycles, the &Tor locator and error evaluator polynomials in 2t + 1 clock cycles, and
the error value calculations and error corrections in n + m + 2 clock cycles. The
overall clock cycles for the encoder and tbe decoder are n and 2n + 2t + m + 3
respectively. Thus, the encoder can generate codewords at a sustained rate of ~ x
1()3 Mbits/sec whereas the decoder can process incoming data at a maximum rate
of (iIiO+~:+3x..j x l@ Mbits/sec, where T is the clock period in nano"econd", and
108
m
"- Encoder (Mbits/sec) Decoder (Mbits{sec)
3 3 150 40
4 7 200 59
5 15 250 78
• ,. 300 1137 ,. 350 150
8 ,. 400 184
Table 4.9: RS Encoder/Decoder Data Rates
n = 2'" - 1 is the block length. The VHDL gate-level simuIatiol1S were performed
at 50 MHz. The estimated data rates when t = t....., and T = 20 os are shown in
Table 4.9.
Clearly, higher data rates can be expected at higher frequencies or by using the
more aggressive technologies such as the O.35-Jlm CMOS. It also appears that the
decoder datapath could be constructed with three linear pipeline stages in order to
further increase the decoding throughput rate (75][76J. Pipeline registers would be
required between the syndrome, Berlekamp and error magnitude modules. Thus,
using the O.8-Jlm CMOS standard cells, the pipelined version of the chip sbould be
able to process data at three times the estimated rates in Table 4.9.
The gate counts for the various modules are shown in Table 4.10.
109
MODULE
Encoder
Syndrome
Berlekamp
Error Magnitude Evaluation
Error Correction
Error Verification
First-In-First-Out Buffer
Finite State Machine
Glue Components
TOTAL NUMBER OF GATES
GATE COUNT
9,644
22,515
107,015
37,952
45
22,515
18,127
25.
138
218,206
Table 4.10: RS Encoder/Decoder Modules and Equivalent Gate Count
no
Chapter 5
Conclusion and Future Work
Forward error correction (FEe) is a common technique used to improve the reli·
ability and efficiency of communication channels. The RS codes are widely used
in modern day digital communications systems to correct erasures, random and
burst errors during data transmission. As a contribution to the field, this thesis
introduced
(1) new parameterizable Galois field arithmetic VLSI structures.
(2) an algebraic encoder/decoder ASIC which implements a wide family of RS
codes. The design is parameterized in terms of the RS code variables m, nand t.
Hence, it can be configured to operate in various communication channels which
require different RS cod.~.
5.1 Galois Field Arithmetic Architectures
An overview of Galois field arithmetic operations and their corresponding VLSI
implementations was presented in Chapter 1. Only the most complex operations
namely exponentiation, inversion and multiplication were considered. Chapter 3
introduced new m-programmable arithmetic structures which exploited the sym-
III
metric properties of available architectures. These could be configured for the
symboL size m = 3,4,5,6,7 and 8. The standard representation for the elements
was used. It appeared that little work had been done in the literature to develop
such structures. For this purpose, exponentiation, inversion and multiplication cir-
cuits were investigated in detail. It was also demonstrated that inversion was a
form of exponentiation in Galois fields. An m-programmable array which evalu-
ated both operations was designed and simulated. It had a low design complexity,
low latency, high throughput rate and a very high fault coverage compared to other
structures. The proposed exponentiator/inverter outperformed the inverters pre-
sented in [22][231 when it was configured to compute field element inversion. All the
proposed architect.ures were implemented in standard cells using a VHOt based de-
sign entry. Thus, they could be used in applications that required a variable symbol
sizem.
5.2 VLSI Reed-Solomon Encoder/Decoder
The different RS decoding algorithms were described in Chapter 2. A survey of
the existing encoder and decoder structures was also presented. A multiplexing
technique and the proposed arithmetic circuits were used throughout the design
and implementation of the new programmable RS encoder/decoder in CMOS stan~
dard cells. The chip supported a wide family of RS codes whose symbol size m
and error correction capability t could be parameterized to meet different user re-
quirements. Unlike the decoders customized for a fixed m and t as presented in
the literature [21[3ti][38][39][45][47][49][50][51]. it was found to be flexible since the
symbol size m, block length n and error correction capability t were all variable.
112
Constant multipliers and inverse ROMs were also completely avoided to allow ease
of reconfigurability. In the thesis, example values of the Galois field symbol size
m = 3,4,5,6,7,8 and error correction capability t = 1,2,3, ... , 16 were supported.
The main advantage of such an ASIC is that its total design cost is amortized. over
a wide application base.
The algebraic encoding/decoding technique was used.. The encoder used. the
self-reciprocal generator polynomial which structured. the codewords in a systematic
form. The first step in the decoding algorithm calculated the syndrome polynomial
8(r). The Berlekamp-Massey algorithm determined. the error-locator polynomial.
Its low design complexity made it suitable for VLSI synthesis. The error magnitude
polynomial was calculated using the expression n(x) = S(x)u(x) mod x2'. Once
the location and magnitude of the errors had been determined. using the Chien
Search and the Forney algorithm respectively, the received messages were corrected
and verified. as they left the chip.
It was found that the overall clock cycles for the encoder and the decoder were
n and 2n + 2t + m + 3 respectively. Hence, the encoder could generate codewords
at a sustained. data rate of ~ x 103 Mbits/sec whereas the decoder could process
incoming data at a maximum data rate of (2n+2t~:+J)(r) x loJ Mbits/sec, where r
was the clock period in nanoseconds, and n = 2m - 1 was the block length. All
the YHDL gate-level simulations were performed. at a frequency of 50 MHz. The
equivalent gate count was 218,206 gates.
This thesis fully demonstrated that the parameters m, n and t can indeed be
variable in RS encoder/decoder design by using the same hardware.
113
5.3 future Work
As indicated in the thesis, emphasis was placed on the algebraic decoding technique
alone. Other algorithms could be investigated to see if their designs could be
parameterized in terms of m, n and t as well. Due to limitations in the design
kit, it was Dot possible to investigate design issues such as power dissipation and
backannotation. 8acka.nnOtatiOD would have allowed the original gate-level netlist
to be annotated. with extracted pacasitic:s from the layout SO that a more a.ce::urate
VHDL simulation could be performed. These simulations would confirm the timing
and help estimate power dissipation as well. One direction for future research is
to investigate the effects of parasitic:s and power dissipation on increasing values
of the RS code parameters Tn, n and t when advanced design kits are released by
CMC. The only major changes required in the current ASIC are the increase of the
sizes of the bus signals and redesign of the multiplier and exponentiator/inverter
to accommodate larger values of m > 8. Such an exercise rl!quires a small fraction
of the effort and cost of the original design if a maximum of m =64 was required,
for instance.
It can be inferred from Chapter 5 that the design complexity increases with the
block length, error correction capability and symbol size of the code. One could
further investigate how the overall gate count varies with these parametera as a
measure of design complexity. A relationship between the clock cycles and these
design variables has already been found.
The sequential nature of the decoding algorithm. suggests that the datapath
may be constructed with three linear pipeline stages in order to further increase the
decoding throughput rate [75][76]. A substantial portion of the decoder is always
114
m ,_ Encoder (Gbitsjsec) Decoder (Gbits/sec)
3 3 1.80 0.48
4 7 2.40 0.71
5 15 3.00 0.94
• 1. 3.60 l.367 ,. 4.20 1.80
8 ,. 4.80 2.21
Table 5.1: Projected RS Encoder/Decoder Data Rates
idling during the decoding process. Pipeline registers would be required between
the main modules. Higher data rates in the Gbits/sec region could be expected if
the pipeline version was implemented using the more aggressive technologies. The
technology roadmap projects a O.llhtm CMOS technology to be available in 1999
and O.l-~m in 2001 [77J. It is projected that the encoder and decoder could have
maximum throughput rates of ~ x 3 Gbits/sec and (2..+21~:+3Jt..) x 3 Cbits/sec
respectively. These are shown in Table 5.1 for T = 5 ns. To meet the specifications
for a k*current data rate Gbits/sec channel, it also seems that k chips could be
configured to operate in parallel, wh.ere k = 1,2,3, ... is an integer. One could
investigate the design issues and limitations involved. Work in this direction is also
re<:ommended.
us
References
(IJ G.D. Forney, ConCQ,tenated Codes, Cambridge: M.lT. Press, 1966.
[21 S. Wicker and V.K. Bhargava, Reed-Solomon Codes and Their Application.s,
New York: The Institute of Electrical and Electronics Engineers, 1994.
[3) S. Whitaker, J. Canaris, and K. Cameron, "Reed Solomon VLSI Codec for Ad-
vanced Television," IEEE Trans. Circuits and Systems for Vidw Tech., vol. 1, no.
2, pp. 230-236, June 1991.
[4J S. Kwon and H. Shin, "An Area·Efficient VLSI Architecture of a Reed-Solomon
Decoder/Encoder for Digital VCRs," IEEE Ihm". Consumer Electronics, vol. 43,
no. 4, pp. 1019-1027, Nov. 1997.
[51 YR Shayan, T. Le-Ngoc, and V.K. Bhargava, "A Versatile Time-Domain Reed-
Solomon Decoder," IEEE J. Seleded Areas in Gomm., vol. 8, pp. 1,535-1,542, Oct.
1990.
116
[6) Y.R. Shayan and T. Le-Ngoc, KA Cellular Structure for a Versatile Re.ed-Solomon
De.coder," fEEE 7hnu. Comput., vol. 46, no. 1, pp. 8()..85, Jan. 1997.
[7] W. Wolf, Modern VLSf Duign: A SysteTM Approach, New Jersey: Prentice
Hall,1994.
[8] T.K. Truong, t.J. Deutsch, I.S. Reed, LS. Hsu, K. Wang, and C.S. Yeb, "The
VLSI Design of a Reed.-Solomon Encoder Using Berlekamp's Bit-$erial Multiplier
Algorithm,n Third CalUdl ConI. on VLSI, pp. 303-329, 1983.
(9J (.S. Hsu, I.S. Reed, T.K. Truong, K. Wang, C.S. Yeh, and L.J. Deutsch, "The
VLSI Implementation of a Reed-Solomon Encoder Using Berlekamp's Bit-Serial
Multiplier Algorithm," IEEE funs. Comput., vol. C-33, no. 10, pp. 906-911, Oct.
1984.
110] C.C. Wang, T.K Truong, H.M. Shoo, L.J. Deutsch, J.K. Omura, and I.S. Reed,
"'VLSI Architectures for Computing Multiplications and Inverses in GF(2'")," IEEE
nuns. Compul., voL C-34, no. 8, pp. 709-716, Aug. 1985.
(11) C.L. Wang and J.L. Lin, "Systolic AIray lmplementation of Multipliers for
Finite Fields GF(2"')," IEEE Trans. CircuiLJ Sy8t., vol. 38, pp. 796-800, July
1991.
117
(12] T.C. Bartee and D.l. Schneider, "Computation witb Finite Fields," In/orma.
non and ComptdU3, vol. 6, pp. 79-88, March 1963.
[13] B.A.Laws and C.K. Rusbfortb, "A CeUular-Array Multiplier for GF(2"'),"
IEEE Tron3. Comput., pp. 1573-1578, Dec. 1971.
(141 C.S. Yeh, 1.S Reed, and T.K. Truong, "Systolic Multiplien for Finite Fields
GF(2"')," IEEE Tran.!. Comptd., vol. C-33, no. 4, pp. 357-360, Apr. 1984.
[15J P.A. Scott, S.E. Tavares, and L.E. Peppard, "A Fast YLSI Multiplier for
GF(2"') ," IEEE J. Select Area." Commun., vol. SAC-4, nO.1 pp. 62-66 Jan.
1986.
[16] M.A. Hasan and V.K. Bbargava, "Bit-Serial Systolic Divider and Multiplier for
Finite Fields GF(2"')," IEEE ThUl.'. Comput., vol. 41, no. 8, pp. 972-980, Aug.
1992.
[17] M.A. Hasan and V.K. Bhargava, "Division and Bi....Serial Multiplication over
GF(2m)," lEE Proc., part E, voL 139, no. 3, pp. 23()"236, May 1992.
[18) M.A. Hasan, M.Z. Wang and V.K. Bhargava, "Modular Construction of Low
Complexity Parallel Multipliers for a Class of Finite Fields GF(2"')," IEEE nun.!.
Comput., vol. 41, no. 8, pp. 962-971, Aug. 1992.
U8
[19] M.A. Hasan, M.Z. Wang and V.K. Bhargava, "A Modified Massey~Omura Par~
allel Multiplier for a Class of Finite Fields," IEEE Trans. Comput., voL 42, no.
10, pp. 1278-1280, Oct. 1993.
[201 W. Wei, "A Systolic Power-Sum Circuit for GF(2m)," IEEE 7h1n.s. Comput.,
voL 43, no. 2, pp. 226-229, Feb. 1994.
[21] S.T.J. Fenn, M. Benaissa, and D. Taylor, "Bit-serial Berlekamp-like Multipliers
for GF(2m)," Electronics Le.tte.rs, vol. 31, no. 22, pp. 1893-1894, Oct. 1995.
[22] G.L. Feng, "A VLSI Architecture for Fast Inversion in GF(2m)," IEEE Trans.
Comput., vol. 38, nn. 10, pp. 1383-1386, Oct. 1989.
[23] C.L. Wang and J.L. Lin, "A Systolic Architecture for Computing Inverses and
Divisions in Finite Fields GF(2m)." IEEE Trans. Comput., voL 42, no. 9, pp.
1141-1146, Sept. 1993.
[24] G.I. Davida, "Inverse of ~lements of a Galois Field," Electronics Letters, vol.
8, pp. 518.520, Oct. 1972.
[25] S.T.J. Fenn, M. Benaissa, and D. Taylor, "Fast Normal Basis Inversion," Elec-
tronics Le.tters, voL 32, QO. 17, pp. 1566-1567, Aug. 1996.
119
[261 S.M. Yen, "Improved Normal Basis InversioQ in GF(2"')," Electroniu Lettu~,
voL 33, no. 3, pp. 196-197, Jan. 1997.
(27) LJ. Calvo and M. Torres, "Complexity orthe Inversion in GF(2"')," Electronia
Letters, vol. 33, no. 3, pp. 194-195, Jan. 1997.
[28] M.A. Hasan, "Division-and-Accumulation over GF(2"')," IEEE fum. Com-
put., vol. 46, no. 6, pp. 705-708, June 1997.
[291 G. Seroussi, "A Systolic Reed-Solomon Encoder," IEEE 7rans. Inform. The-
ory, vol. 37, no. 4, pp. 1217·1220" July 1991.
[3OJ C.C. Wang and D. Pei, "A VLSI Design for Computing Exponentiation!! in
GF(2"') and Its Application to Generate Pseudorandom Number Sequences," IEBE
Traru. Comput., voL 39, no. 2, pp. 258-262, Feb. 1990.
(31] C.L. Wang, "Bit-Level Systolic Array for Fast Exponentiation in GF(2"'),"
IEEE Traru. Comput., VO). 43, no. 7, pp. 838-841, July 1994.
[32J P.A. Scott, S.J. Simmons, S.E. Ta.vares, and L.E. Peppard, "Architectures for
Exponentiation in GF(2"')," IEEE J. Select. Areo.J Commun., vol. 6, 00.3, pp.
57a.586, Apr. 1988.
'20
(33) B. Arm, ....euchitectures for Exponentiation Over GF(2"')," IEEB Trans. Com-
put., vol. 42, no. 4, pp. 494-497, Apr. 1993.
[34] M. KovaC and N. Ranganathan, "ACE: A VISI Chip for Galois Field GF(2"')
Based Exponentiation," IEEB Tran.s. Circuits S~$t. II: Analog and Digital Signal
Proca.3ing, wI. 43, 00.4, pp. 289-297, Apr. 1996.
[351 G. Maki, P. Owsley, K. Cameron, and J. Shovic, "A VLSI Reed Solomon En-
coder: An Engineering Approach," IBBE Tram. Cwtom Integrated Circuits ConI.
Rec., pp. 177-181, May 1986.
[361 H.M. Shao and I.S. Reed, "On the VLSI Design of a Pipeline Reed&lomon
Decoder Using Systolic Arrays," IBBB 7hms. Comptd., vol. 37, no. 10, pp. 1273-
1280, Oct. 1988.
(37] H.M. Shao, T.K. Truong, LS. Hsu, and L. J. Deutsch, -A Single Chip VLSI
Reed&lomon Decoder," International Coni. Acowtic. Speech and Signal Pruceu·
ing, pp. 2151.2154, 1986.
[38J P. Tong. "A 4G-MHz: Encoder/Decoder Chip Generated by a Reed-Solomon
Code Compiler," IBEE fum. Cwtom IntegraUd Cin:uits Coni RB.., pp. 13.5.1-
13.5.4, May 1990.
121
(39J S. Whitaker, K. Cameron, G. Maki, J. Canaris, and P. Owsley, "'VLSI Reed
Solomon Processor for the Hubble Space Telescope," VLSI Signal Processing IV,
IEEE Press, Chapter 35, 1991.
(40J K. Winters, P. Owsley, and G. Maki, "A VLSI Error Correction Decoder for
Satellite Communication," Proc. Int'l. Cont on Systems Engineering, pp. 37-44,
Sept. 1984.
[41J G. Maki, P. Owsley, K. CarneroD, and J. Venbrux, ''Vl.SI Reed SolomoD De-
coder Design," IEEE Military Communications Cont Rec., pp. 46.5.1-46.5.6, Oct.
1986.
(421 F. Mendez, "VHDL and Cyclic Corrector Codes," IEEE European Design Au-
tomation Conf. flAC, pp. 526-531, 1994.
[431 S. Choomchuay and B. Arambepola, "Time Domain Algorithms and Architec-
tures for Reed-Solomon Deeoding," lEE Proc., part I, vol. 40, no. 3, pp. 189-196,
June 1993.
[44J T. Iwaki, T. Tanaka, T. Okuda, and T. Sasada, "'Acchitecture of a High Speed
Reed-Solomon Decoder," IEEE 7hJns. Consumer Electronics, vol. 40, no. 1, pp.
75-81, Feb. 1994.
122
[45] K. Cools, D. Devisch, K. Van Nieuwenhove, S. Vemalde, Bolsens, K. Chansik,
O. Younguk and R. Lee, "ASIC Synthesis of a Flexible 80 Mbit/s Reed-Solomon
Codec," IEEE European Design Automation Can/. DAC, pp. 658-663,1994.
{46j M.A. Hasan and V.K. Bhargava, "Architecture for a Low Complexity Rate--
Adaptive Reed-Solomon Encoder," IEEE funs. Comput., vol. 44, no. 7, pp.
938-942, July 1995.
[47] H.W. Chen, J.C. Wu, G.S. Huang, J.C. Lee, and S.S. Chang, "A New VLSI Ar-
chitecture of Reed Solomon Decoder with Erasure Function," IEEE Global Telecom-
muncations Conj., pp. 1455-1459, 1995.
[48] K. Iwamura, Y. Dohi, and H. Imai, "A Design of a Reed-Solomon Decoder with
Systolic.Array Structure," IEEE Trans. Comput., vol. 44, no. 1, pp. 118-122, Jan.
1995.
[491 J.M. Hsu and C.L. Wang, "An Area-Efficient Pipelined VLSI Architecture
for Decoding of Reed-Solomon Codes Based on a Time-Domain Algorithm," IEEE
Trans. Circuits and Systems JOT Video Tech., vol. 7, no. 6, pp. 864-871, Dec. 1997.
{50] J.L. Politano and D. Deprey, "A 30 Mbits/s (255,223) Reed-Solomon Decoder,"
EUROCODE Int'l Symp. on Coding Theory and Applications, pp. 385-392, Nov.
1990.
123
[511 Ph. Sadot, "VLSI Imp(emetation of Error Correcting Codes," Electrical Comm.,
vol. 65, no. 2, pp. 161-167, Jan. 1992.
[52] S.W. Wei and C.H. Wei, "High-Speed Decoder of Reed.-Solomon Codes," IEEE
Trans. Commun., vol. 41, no. 11, pp. 1588-1593, Nov. 1993.
[531 G.C. Clark and J.B. Cain, Error Control Coding For Digital Communications,
New York: Plenum, 1981.
[54) R.E Blahut, Theory and Practice of Error Control Codes. Reading, Mass; Ad-
dison Wesley, 1984.
{55] A. Michelson and A. Levesque, Error Control Techniques for Digital Commu-
nication, New York; Wiley, 1985.
[561 T.R Rao and E. Fujiwara, Error-Control Coding For Computer System8, New
Jersey: Prentice Hall, 1989.
[57] J. Anderson and S. Mohan, Source and Channel Coding An Algorithmic Ap-
proach, Boston; Kluwer Academic Publishers, 1991.
[58] G. Maki and P. Owsley, "Parallel Berlekamp vs. Conventional VLSI Architec-
tures," Government Microcircuit Applications ConI- Rec., pp. &-9. Nov. 1986.
124
[59J S. Wicker, Error Control Systems for Digital Communication and Storage, New
Jersey: Prentice Hall, 1995.
[60! J.L. Massey, "Shift-Register Synthesis and BCH Decoding," IEEE Trafl3. Info.
Theory, vol. IT-IS, pp. 122-127, Jan. 1969.
[61] Y. Sugiyama, S. Kasahara, and T. Namekawa, "A Method for Solving the Key
Equation for Decoding Goppa Codes," IEEE Trafl3. Contr., vol. 27, pp. 87.89,
Jan. 1975.
[62J H.M. Shoo, T.K. Truong, L.J. Deutsch, J. Yuen and 1.8. Reed, "A VLSI Design
ofa Pipeline Reed·Solomon Decoder," IEEE Trans. Comput., vol. C-34, no. 5, pp.
393-403, May 1985.
{63] S. Lin and D. Costello, Error Control Coding: FiJndamentals and Applications,
New Jersey: Prentice Hall, 1983.
(64) R.E. Blahut, "A Universal Reed-Solomon Decoder," IBM J. Re3earch and De-
velopment, voL 28, no. 2, pp. 150-159, Mar. 1984.
[65J K.Y. Liu, "Architecture for VLSI Design of Reed-Solomon Encoders," IEEE
Trans. Comput., vol. C-31, no. 2, pp. 170-175, Feb. 1982.
125
[66] K.Y. Liu, "Art:hiteet.ure for VLSI Design of Reed-Solormon Decoders," IEEE
7huu. Comptd., vol. C-33, 00. 2, pp. 178-189, Feb. 1984.
[67) R.P. Brent and H.T. Kung, "Systolic VLSI Arrays for PUllynomial GCD Com-
putation," IEEE 7huu. CompuL, vol. C-33, no. 8, pp. 731-:736, Aug. 1984.
{68] H. Kung and M. Lam, "Fault tolerance and ~Ievel Pij1)elining in VLSI Sys-
tolic Arrays," Proc. MIT ConI Advanced Res. VLSI, pp. 74=-83, Jan. 1984.
{69] J. McCanny, R. Evans, and J. McWhirter, "Use of Uoidir.-ectional Data Flow in
bit-level Systolic Array Chips," ElectroniCJ Letter3, vol. 22, pp. 540-541, May 1986.
[70] F. Wang, Digital Circuit Testing: A Guide to DFT and 0tMr Techniquu, San
Diego: Academic Press, 1991.
i7l) Synopsys: Guiddinu ond Praclice.s for SUCCU$ftJ Log-ie Synthui.J: Online
Manual, 1996.
[1'21 Z. Young, "A Reed-Solomon Code Simulator and Periodicity Algorithm,"
M.Eng. thesi$, Memorial University of Newfoundland. St. Jmbn's, Newfoundland,
1994.
126
(73J Y. Ye, "A General Purpose Reed-Solomon CODEC Simulator and New Pe-
riodicity Algorithm,n M.Eng. thesis, Memorial University of Newfoundland, St.
John's,Newfoundland, 1995.
[74] Synopsys: Design Compiler Family Reference Manual: Online Manual, 1996.
175) K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Program-
mability, New York: McGraw-Hill, 1993.
[76) D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Ap-
proach, San Francisco: Morgan Kaufmann Publishers, 1996.
[77J L. Geppert, IEEE Spectrom, New York: The Institute of Electrical and Elec-
tronics Engineers, vol. 35, no. I, pp. 23-28, Jan. 1998.
127




