University of Windsor

Scholarship at UWindsor
Electronic Theses and Dissertations

Theses, Dissertations, and Major Papers

1997

Architectures and implementations for the Polynomial Ring
Engine over small residue rings
Sami Bizzan
University of Windsor

Follow this and additional works at: https://scholar.uwindsor.ca/etd
Part of the Engineering Commons

Recommended Citation
Bizzan, Sami, "Architectures and implementations for the Polynomial Ring Engine over small residue
rings" (1997). Electronic Theses and Dissertations. 8290.
https://scholar.uwindsor.ca/etd/8290

This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor
students from 1954 forward. These documents are made available for personal study and research purposes only,
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution,
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or
thesis from this database. For additional inquiries, please contact the repository administrator via email
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208.

Architectures and Implementations for the Polynomial Ring
Engine over Small Residue Rings

by

Sarni Bizzan

A Dissertation
Submitted to the Faculty of Graduate Studies and Research through the
Department of Electrical Engineering in partial fulfillment of the requirements for
the Degree of Doctor of Philosophy at the University of Windsor

Windsor, Ontario, Canada
1997

© 1997 Sarni Bizzan

All Rights Reserved. No part of this document may be reproduced,
stored or otherwise retained in a retrieval system or transmitted in
any form, on any medium or by any means without the prior written
permission of the author

Approved By:

I

_. j

/1 ' '
l'

0
/ // )
..I

' -, {~j(
_ ;'

J\ ,'

--,,

,,1

/!

./

,,_ 1r:--

Dr. W. K. Jenkins (External examiner)

Dr. G. A. Jullien (Supervisor)

Dr. M. Ahmadi (Departmental Reader)

~

Dr. W. C. Miller (Departmental Reader)

r

Dr. S. Cameron (Dean of Graduate Studies)

Abstract
This work considers VLSI implementations for the recently introduced Polynomial Ring Engine
(PRE) using small residue rings. To allow for a comprehensive approach to the implementation of
the PRE mappings for DSP algorithms, this dissertation introduces novel techniques ranging from
system level architectures to transistor level considerations. The Polynomial Ring Engine
combines both classical residue mappings and new polynomial mappings. This dissertation
develops a systematic approach for generating pipelined systolic/ semi-systolic structures for the
PRE mappings. An example architecture is constructed and simulated to illustrate the properties
of the new architectures.

To simultaneously achieve large computational dynamic range and high throughput rate the basic
building blocks of the PRE architecture use transistor size profiling. Transistor sizing software is
developed for profiling the Switching Tree dynamic logic used to build the basic modulo blocks.
The software handles complex nFET structures using a simple iterative algorithm. Issues such as
convergence of the iterative technique and validity of the sizing formulae have been treated with
an appropriate mathematical analysis.

As an illustration of the use of PRE architectures for modem DSP computational problems, ·a
Wavelet Transform for HDTV image compression is implemented. An interesting use is made of
the PRE technique of using polynomial indeterminates as 'placeholders' for components of the
processed data. In this case we use an indeterminate to symbolically handle the irrational number,

J3 , of the Daubechie mother wavelet for N

= 4.

Finally, a multi-level fault tolerant PRE architecture is developed by combining the classical
redundant residue approach and the circuit parity check approach. The proposed architecture uses
syndromes to correct faulty residue channels and an embedded parity check to correct faulty
computational channels. The architecture offers superior fault detection and correction with online data interruption.

•

Acknowledgments

I would like to express my sincere thanks and appreciation to my supervisor, Dr. G. A. Jullien for
his invaluable advice, guidance, and constant encouragement throughout the progress of this
thesis. I would like also to thank Dr. N. M. Wigley and Dr. W. C. Miller for their support and
advice. Thanks go to all committee members for their time and advice.

University of Windsor

Table of Contents
Chapter 1 Introduction ......................................................................................................... I
1.1
Introduction .................................................................................................. 1
1.2
Research Objectives and Review ................................................................. 8
1.3
Thesis Organization ..................................................................................... 9
Chapter 2 Algebraic Structures for Multidimensional Digital Signal Processing 11
2.1
Introduction ................................................................................................ 11
2.2
Algebraic Structures and Relations ........................................................... 12
2.2.1 Binary Operation ............................................................................ 12
2.2.2 Groups ............................................................................................ 12
2.2.3 Homomorphism, Isomorphism, and Factor Groups ...................... 14
2.2.4 Rings and Fields ............................................................................. 17
2.2.5 Polynomial Rings ........................................................................... 18
2.2.6 Direct Product Rings ...................................................................... 20
2.3
Polynomial Based Mappings ..................................................................... 21
2.3.1 QRNS Mapping ............................................................................. 22
2.3.2 Polynomial Residue Number System ............................................ 24
2.3.3 Polynomial Ring Engine (PRE) ..................................................... 25
2.3.4 Comparisons Among Polynomial Mappings ................................. 32
2.4
Summary .................................................................................................... 34
Chapter 3 Small Moduli Polynomial Ring Engine35
3.1
Introduction ................................................................................................ 35
3.2
Polynomial Number System ...................................................................... 36
3.2.1 Finite Polynomial Ring .Representation ......................................... 36
3.2.2 Polynomial Mapping ...................................................................... 38
3.2.3 Polynomial Mapping Implementation ........................................... 44
3 .3
Polynomial Ring Engine Implementation .................................................. 58
3.3.1 Mapping Order ............................................................................... 58
3.3.2 Constructing PRE Architectures .................................................... 60
3.3.3 Probability of Overflow and Computational Accuracy .................. 64
3.4
Summary .................................................................................................... 66
Chapter 4 Transistor Level PRE Synthesis ........................................................................ 67
4.1
Introduction ................................................................................................ 67
4.2
Analytical Approach to nFET Chain Sizing .............................................. 68
4.2.1 Single nFET Chain Sizing ............................................................. 68
4.2.2 Validity of the Analytical Approach Assumption ........................... 69
4.3
Complex nFET Logic Sizing ..................................................................... 72
4.3.1 Switching Tree nFET Logic Structure ........................................... 73
4.3.2 Sizing Algorithm ............................................................................ 74
4.3.3 Sizing Software .............................................................................. 76
4.3.4 Convergence ....................................................................... :........... 79

University of Windsor

4.4
4.5

4.6

Sizing Results and Discussions .................................................................. 83
dircuit Modules ......................................................................................... 85
4.5.1 Circuit Structure ............................................................................. 85
4.5.2 ROM Sizing and Simulation .......................................................... 89
Summary .................................................................................................... 91

Chapter 5 Some Specific PRE Architectures ..................................................................... 92
5.1
PRE Architecture for a Wavelet Transform ................................................ 92
5.1.1 Wavelet Transform ......................................................................... 93
5.1.2 Considerations for HDTV .............................................................. 95
5.1.3 PRE Mapping Parameters .............................................................. 95
5.1.4 Hardware Requirement .................................................................. 99
5.2
Fault Tolerance ......................................................................................... l 02
5 .3
PRE Fault Tolerant Architectures ............................................................. 104
5.3.1 Architecture I ............................................................................... 104
5.3.2 Architecture II .............................................................................. 109
5.3.3 Architecture III ............................................................................. 111
5.4
Results and Comparisons ......................................................................... 113
5.5
Summary .................................................................................................. 115
Chapter 6 Conclusions and Future Work ......................................................................... 116
6.1
Conclusions .............................................................................................. 116
6.2
Future Work ............................................................................................. 118
Appendix A The Residue Number System and its Extensions ........................................ 127
A.I
A.2

A.3

A.4

A.5
A.6
A.7

Properties of Number Systems ................................................................. 127
Residue Number System ........................................................................... 128
A.2.1 General Characteristics ................................................................. 128
A.2.2 Residue Representation ................................................................ 129
A.2.3 Representation of Negative Numbers ......................................... 130
A.2.4 Residue Operation Indentities ...................................................... 130
Residue Arithmetic Operations ................................................................ 133
A.3.1 Residue Addition and Subtraction ............................................... 134
A.3.2 Residue Multiplication ................................................................. 134
A.3.3 Residue Division .......................................................................... 135
Conversion Techniques ............................................................................ 138
A.4.1 Binary to Residue Conversion ..................................................... 139
A.4.2 Mixed Radix Conversion ............................................................. 139
Base Extension ......................................................................................... 142
Redundant Residue Number System (RRNS) ......................................... 142
Complex Residue Number System .......................................................... 143
A.7.1 Quadratic Residue Number System ............................................. 145
A.7.2 Quadratic Like Residue Number System ..................................... 147
A.7.3 Modified Quadratic Residue Number System .................. :.......... 149

II

University of Windsor

A.8

Summary .................................................................................................. 151
(

Appendix B NFET Chain Sizing ..................................................................................... 152
B.1
B.2

B.3

B.4

B .5
B.6

Introduction .............................................................................................. 152
Delay Model ............................................................................................. 153
B.2.1 Discharge Delay of an nFET Chain ............................................. 153
B.2.2 RC Model ..................................................................................... 156
B.2.3 Elmore Delay Formula ................................................................. 158
Rand C Approximations .......................................................................... 159
B.3.1 Parasitic Capacitance Approximation .......................................... 159
B.3.2 Channel Resistance Approximation ............................................. 162
Simple nFET Chain Sizing ...................................................................... 165
B.4.1 Typical Optimization Approach ................................................... 165
B.4.2 Analytical Approach to nFET Chain Sizing ................................ 166
Results and Discussions ........................................................................... 169
Summary .................................................................................................. 170

Appendix C Transistor Sizing Software ......................................................................... 172
C.1
C.2

Introduction .............................................................................................. 172
Code Listing ............................................................................................. 172
C.2.1 Technology Block ........................................................................ 172
C.2.2 Topology Block ............................................................................ 184

iii

University of Windsor

Table of Contents
Chapter I Introduction ......................................................................................................... I
1.1
Introduction .................................................................................................. I
1.2
Research Objectives and Review ................................................................. 8
1.3
Thesis Organization ..................................................................................... 9
Chapter 2 Algebraic Structures for Multidimensional Digital Signal Processing I 1
2.1
Introduction ................................................................................................ 11
2.2
Algebraic Structures and Relations ........................................................... 12
2.2.1 Binary Operation ............................................................................ 12
2.2.2 Groups ............................................................................................ 12
2.2.3 Homomorphism, Isomorphism, and Factor Groups ...................... 14
2.2.4 Rings and Fields ............................................................................. 17
2.2.5 Polynomial Rings ........................................................................... 18
2.2.6 Direct Product Rings ...................................................................... 20
2.3
Polynomial Based Mappings ..................................................................... 21
2.3.1 QRNS Mapping ............................................................................. 22
2.3.2 Polynomial Residue Number System ............................................ 24
2.3.3 Polynomial Ring Engine (PRE) ..................................................... 25
2.3.4 Comparisons Among Polynomial Mappings ................................. 32
2.4
Summary .................................................................................................... 34
Chapter 3 Small Moduli Polynomial Ring Engine35
3.1
Introduction ................................................................................................ 35
3.2
Polynomial Number System ...................................................................... 36
3.2.1 Finite Polynomial Ring Representation ......................................... 36
3.2.2 Polynomial Mapping ...................................................................... 38
3.2.3 Polynomial Mapping Implementation ........................................... 44
3.3
Polynomial Ring Engine Implementation .................................................. 58
3.3.1 Mapping Order ............................................................................... 58
3.3.2 Constructing PRE Architectures .................................................... 60
3.3.3 Probability of Overflow and Computational Accuracy .................. 64
3.4
Summary .................................................................................................... 66
Chapter 4 Transistor Level PRE Synthesis ........................................................................ 67
4.1
Introduction ................................................................................................ 67
4.2
Analytical Approach to nFET Chain Sizing .............................................. 68
4.2.1 Single nFET Chain Sizing ............................................................. 68
4.2.2 Validity of the Analytical Approach Assumption ........................... 69
4.3
Complex nFET Logic Sizing ..................................................................... 72
4.3.1 Switching Tree nFET Logic Structure ........................................... 73
4.3.2 Sizing Algorithm ............................................................................ 74
4.3.3 Sizing Software ··································································~···········76
4.3.4 Convergence ................................................................................... 79

University of Windsor

4.4
4.5

4.6

Sizing Results and Discussions .................................................................. 83
Circuit Modules ......................................................................................... 85
4.5.1 Circuit Structure ............................................................................. 85
4.5.2 ROM Sizing and Simulation .......................................................... 89
Summary .................................................................................................... 91

Chapter 5 Some Specific PRE Architectures ..................................................................... 92
5.1
PRE Architecture for a Wavelet Transform ................................................ 92
5.1.1 Wavelet Transform ......................................................................... 93
5.1.2 Considerations for HDTV .............................................................. 95
5.1.3 PRE Mapping Parameters .............................................................. 95
5.1.4 Hardware Requirement .................................................................. 99
5.2
Fault Tolerance ......................................................................................... 102
5.3
PRE Fault Tolerant Architectures ............................................................. 104
5.3.1 Architecture I ............................................................................... 104
5.3.2 Architecture II .............................................................................. 109
5.3.3 Architecture IIl ............................................................................. 111
5.4
Results and Comparisons ......................................................................... 113
5.5
Summary .................................................................................................. 115
Chapter 6 Conclusions and Future Work ......................................................................... 116
6.1
Conclusions .............................................................................................. 116
6.2
Future Work ............................................................................................. 118
Appendix A The Residue Number System and its Extensions ....................................... 127
A.1
A.2

A.3

A.4

A.5
A.6
A. 7

Properties of Number Systems ................................................................. 127
Residue. Number System ........................................................................... 128
A.2.1 General Characteristics ................................................................ 128
A.2.2 Residue Representation ................................................................ 129
A.2.3 Representation of Negative Numbers ......................................... 130
A.2.4 Residue Operation Indentities ...................................................... 130
Residue Arithmetic Operations ................................................................ 133
A.3.1 Residue Addition and Subtraction ............................................... 134
A.3.2 Residue Multiplication ................................................................. 134
A.3.3 Residue Division .......................................................................... 135
Conversion Techniques ............................................................................ 138
A.4.1 Binary to Residue Conversion ..................................................... 139
A.4.2 Mixed Radix Conversion ............................................................. 139
Base Extension ......................................................................................... 142
Redundant Residue Number System (RRNS) ......................................... 142
Complex Residue Number System .......................................................... 143
A.7.1 Quadratic Residue Number System ............................................. 145
A.7.2 Quadratic Like Residue Number System ..................................... 147
A.7.3 Modified Quadratic Residue Number System ..................... :...... .149

ii

University of Windsor

A.8

Spmmary .................................................................................................. 151

Appendix B NFET Chain Sizing ..................................................................................... 152
B.1
B.2

B.3

B.4

B.5
B.6

Introduction .............................................................................................. 152
Delay Model ............................................................................................. 153
B.2.1 Discharge Delay of an nFET Chain ............................................. 153
B.2.2 RC Model ..................................................................................... 156
B.2.3 Elmore Delay Formula ................................................................. 158
Rand C Approximations .......................................................................... 159
B.3.1 Parasitic Capacitance Approximation .......................................... 159
B.3.2 Channel Resistance Approximation ............................................. 162
Simple nFET Chain Sizing ...................................................................... 165
B .4.1 Typical Optimization Approach ................................................... 165
B.4.2 Analytical Approach to nFET Cgain Sizing ................................ 166
Results and Discussions ........................................................................... 169
Summary .................................................................................................. 170

Appendix C Transistor Sizing Software ......................................................................... 172
C. l
C.2

Introduction .............................................................................................. 172
Code Listing ............................................................................................. 172
C.2.1 Technology Block ........................................................................ 172
C.2.2 Topology Block ............................................................................ 184

iii

•

University of Windsor

List of Figures
Figure 1.1
Figure 1.2
Figure 2.1
Figure 3.1
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Figure 3.7
Figure 3.8
Figure 3.9
Figure 3.10
Figure 3.11
Figure 3.12
Figure 4.1
Figure 4.1
Figure 4.2
Figure 4.3
Figure 4.4
Figure 4.5
Figure 4.6
Figure 4.7
Figure 4.8
Figure 4.9
Figure 4.10
Figure 4.11
Figure 4.12
Figure 4.13
Figure 4.14
Figure 4.15
Figure 5.1
Figure 5.2
Figure 5.3
Figure 5.4
Figure 5.5
Figure 5.6
Figure 5.7
Figure B.1

Systolic Array Design Cycle ........................................................................ 7
Relations of PRE Techniques ....................................................................... 9
The Rings and Homomorphisms ............................................................... 26
Systolic Array for Matrix-Vector Multiplication ...................................... .43
Generic Processing Element ...................................................................... 45
Homogeneous Systolic Array for Forward Polynomial Mapping ............ .46
Homogeneous Systolic Array for Multi-Indeterminate Forward PM ...... .47
Staged Approach Forward PM Implementation ......................... :.............. .49
Root Processing Block ............................................................................... 49
Complete PM Architecture for 1=3, 00=2, OI=l ...................................... 52
Stage Approach Overhead Plots ............. ·;··················································55
PRE mappings possibilities ........................................................................ 57
Overall PRE Architecture ..........................................................................59
PRE Example Implementation ................................................................... 61
POF versus block length ............................................................................ 63
Simple nFET Chain .................................................................................... 67
Percentage Error Verses WO .......................................................................69
Binary switching tree nFET dynamic logic ...............................................71
General switching tree path ....................................................................... 72
Sizing Algorithm Flow Chart .................................................................... 73
Sizing Software Components ..................................................................... 74
Technology Component Interface :·····························································76
Topology Component Interface (Circuit Description) ........._......................77
Topology Component Interface (Output Node Definition) ........................ 77
Topology Component Interface (Path listing and their delays) ................. 78
Topology Component Interface (Computed Sizes) .................................... 78
A typical circuit section ............................................................................. 79
Full tree path pull-down delay ................................................................... 82
ROMO Switching Tree Graph .................................................................... 87
Circuit for Generating output bit O of ROMO ............................................ 87
Bit Slice Implementation of Fixed Multiplication and Accumulation ....... 88
Single Stage Forward WT for Images ........................................................ 93
Single Stage Inverse WT for Images .......................................................... 94
Architecture I ........................................................................................... 105
Correction RO Ms ..................................................................................... 108
Architecture II .......................................................................................... 109
Bit slice cell with fault detection [32] ...................................................... 110
Architecture III ............................................................................... :......... 112
Single CMOS Dynamic Gate Chain ........................................................ 153
V

University of Windsor

Figure B.~
Figure B.3
Figure B.4
Figure B.5
Figure B.6
Figure B.7
Figure B.8
Figure B.9
Figure B.10
Figure B.11

Dypamic Gate Timing Diagram ............................................................... 154
L6ng Precharge State .............................................................................. 155
Worst Case Discharge State ..................................................................... 156
RC Model Construction ........................................................................... 157
CMOS Transistor Capacitance Model ..................................................... 160
Typical Transistor Layout ........................................................................ 160
Test Circuit ............................................................................................... 164
Delay VS Width for the Test Circuit ........................................................ 165
Delay VS Area for Single Chain .............................................................. 169
Single nFET Chain Sizing Profile ............................................................ 170

vi

University of Windsor

(

Table 2.1.
Table 3.1.
Table 3.2
Table 3.3
Table 3.4
Table 3.5
Table 3.6
Table 4.1
Table 5.1.
Table 5.2
Table 5.3
Table 5.4
Table 5.5

List of Tables

Quadratic Residue Rings and their Roots .................................................. 22
Rings and Root sets .................................................................................... 41
Notation Glossary ...................................................................................... 44
Mapping Overhead for l=l, 00=24, and 01=12 ....................................... 53
Mapping Overhead for 1=2, 00=4, and 0I=2 ........................................... 53
Mapping Overhead for 1=3, 00=2, and 0I= 1 ........................................... 54
Architectural requirements for forward and reverse mapping ................... 60
Full Tree Path Sizes, WAVERAGE=2.8µ, 0.8µ Process ............................ 84
Primes in the Range 1-63 .......................................................................... 98
Quadratic Residues Available in the Range 1-63 ...................................... 99
Wavelet Transform Implementation Comparison .................................... 100
Throughput Rates ..................................................................................... 101
Comparisons among the fault tolerant architectures ................................ 114

vii

Chapterl
Introduction

1.1

Introduction

This century has witnessed the silicon phenomenon in a way that
was never imagined, even a few decades ago. Our civilization has
migrated from a purely industrial society, with its assembly line
processes, to an information age in which the movement of bits is
as important to the economy as the manufacturing of goods. The
rapid change can be attributed largely to the exponential growth [1]
in the ability to fabricate transistors on a silicon die. Currently,
device densities are measured in tens of thousands of transistors per
square millimeter of silicon (e.g. the PowerPC 604 chip measures
12.4x15.8mm in size and comprises 3.6 million transistors [2]).

With

the

availability

of a

strong

silicon

manufacturing

infrastructure, digital computing devices become increasingly
affordable with enormous processing power. General computing
devices have been developed based on the binary number system
and modified Von Neumann architectures to reduce manufacturing
costs and to increase flexibility and usage base. These digital
devices are deployed to perform many routine tasks and can be
found in almost every intelligent machine we use .such as banking
machines, cars, telephone system, and computers.
Introduction

Introduction

-- .. ·- - .. - ' .. - -- .

- ---

- - --

-· ~ ---

-~-' ~, .,.._._.___ -~·"'..""·--'· ..... , . ,_ ~-- ..,.,..,--,-.-::-·., ."-= :'"'·- _.,. _

Introduction

_., -· -. -

.•. --

•c, : " · · •• - : --:;- · -

-

-

University of Windsor

Although current general computing devices deliver very large data throughput rates, they
are insufficient for some real time digital signal processing algorithms. Let us estimate the
computational power requirement to implement typical Digital Signal Processing (DSP)
algorithm. Assume that an image of 1024 x 768 pixels is needed to be filtered using a
4 x 4 tap FIR filter in 1/30 of a second. Each output pixel requires 48 multiply and
accumulate operations (each pixel made up of three primary color sub-pixels); we will
refer to these operations as MIA. Our example system therefore requires 1,132,462,080 M l
A operations per second and clearly special DSP hardware is the only solution to obtain

capable, efficient, affordable, and compact implementation. As Yasuo Kato [11] points out:

"Application-oriented high speed processor development should
last forever by featuring one order higher speed processing capabilities than conventional microprocessors. Special architectures tuned
for specific application have advantages over conventional microprocessor systems."
Parallel processing is a methodology where tasks are being accomplished within the same
time interval. Thus the processing speed requirement can be decreased by up to N-fold if
N sub-tasks can be processed concurrently. For the above image filtering example, if 48
MIA can be processed in a single clock cycle then a 23 MHz clocked system is adequate

and this requirement is well within today's technology. The subjeGt of parallel processing
is not a new idea, and the realization of such systems is becoming more attractive with the
availability of inexpensive processing elements (PEs). More information about the subject
of parallel computers can be found in the text of Hwang and Briggs [3], Kung [4],
Zakharov [5] and their references.

One can always build parallel processing hardware by implementing the DSP algorithm,
on silicon, the way it appears in the computational flow chart. This approach is very
expensive for the following reasons:

• The DSP algorithm has to be well defined a priori.
• Long data paths or feedback loops may limit the processing throughput rate.
• Minimal use of replications.
Introduction

2

·- -

University of Windsor

Introduction

• Difficulty with simulation and clock distribution.
.

r

Much recent research has been directed to finding ways to use these parallel processing
elements in an orderly fashion. In the late 1970s, H. T. Kung [12] and his colleagues
observed that some algorithms can be mapped into arrays of locally connected identical
processing elements; the term systolic arrays was coined to describe such architectures.
The term is borrowed from medicine, and is intended to show the analogy between the
system that pumps blood through the cells in the body, and the manner of pumping digital
data through locally connected processing cells. Such architectures feature the important
properties of modularity, regularity, local interconnection, a high degree of pipelining, and
highly synchronized multiprocessing. Systolic an:ay architectures are used to implement
various algorithms from signal processing, speech processing, image processing, and
matrix arithmetic.

Mapping of algorithms into systolic arrays often involves the use of uniform recurrence
equations (UREs), and dependence graphs, (DGs). This type of mapping is algorithm
specific where tasks and data are distributed over an interconnected PE, and often yields
two dimensional arrays for many DSP algorithms. Although the mapping is not
guaranteed to be successful and efficient, many attempts to systemize the process have
been cited in the literature (e.g. [13].) 2-D systolic arrays are not scalable, and clock skew
is a major drawback which limits the size and speed of such architectures. Fisher and
Kung [14] point out:

"One-dimensional arrays can be clocked at a rate independent of
their size under fairly robust assumptions, while two-dimensional
arrays and other graphs with similar properties cannot."
If we consider the implementation of bit-level systolic arrays (introduced shortly after the

general concept of systolic arrays was identified [64]), then even one dimensional DSP
algorithms require 2-D systolic arrays at the bit-level. Clearly the way in which we
perform the arithmetic has a direct bearing on the connectivity of the systolic array.
Weighted magnitude arithmetic will require full 2-D connectivity at the bit-level, though
some relief from strict synchronization requirements may be obtained by using redundant
Introduction

3

Introduction

University of Windsor

arithmetic representations. An alternative to weighted magnitude representations, in which
the ·computations are performed over independent modulo rings, effectively removes the
two-dimensional connectivity between rings. The exploration of such computational
strategies at the silicon level is the subject of this dissertation.

The most obvious technique to compute over independent rings is to employ the Residue
Number System (RNS) (see, for example, [6][7][9][15][16].) This representation strategy
is at least two thousand years old and certainly can be traced back to ancient China [6].
The mathematical foundations for this residue arithmetic were established by Gauss, and
the reconstruction technique became known as the Chinese Remainder Theorem, CRT, in
honor of its origins. Attempts to construct general computing processors based on the RNS
[16] [17] have not been successful due to the inherent difficulties with the system's
implementation of some logical and arithmetic operations such as magnitude comparison
and general division. Moreover, the RN~ imposes some restrictions on the moduli set
selection which makes it difficult to achieve large dynamic range computations with small
finite rings. This problem has b~en alleviated recently by the introduction of a polynomial
mapping technique [ 18] in which polynomia~s are used to represent binary numbers with
replicates associated with each of the polynomial coefficients. This leads to an
enhancement of not only the computational dynamic range but also the architectural
properties as related to VLSI implementation. This specific strategy is explored in this
dissertation.

The redundant residue number
system, .. RRNS, is an extension to the RNS,
and which
'
.
~

provides advantages in constructing fault tolerant architectures [40][ 41 ]. Redundant
residue channels are utilized in detection and/or correction of corrupted data resulting
from failure in any residue channel including the redundant ones. Two distinct approaches
have been cited in the literature. The first approach employs base extension to generate
syndromes that act as addresses to a set of correction ROMs. The output of the ROMs is
used to modify the overall result with a set of binary adders [29]. The second approach
uses multiple projections of the output to determine its legitimacy and to loc.ate and correct
the erroneous residue module [30] [31 ]. Circuit-level parity bits have been embedded with
Introduction

4

Introduction

University of Windsor

the bit-slice inner product step processor, BIPSP [32], to provide a circuit level
mechanism of fault detection that can be embedded within the above two approaches.

The RNS , as a computational tool, attracted some interest in the 1950s for its fault tolerant
properties which could potentially improve the reliability of vacuum tube digital
computers. This interest soon declined with the development of more reliable transistor
based computers. Not until the late 19~0s and early 1980s, when the cost of digital chips
dropped, did the implementation of special purpose DSP algorithms using RNS become
feasible: INMOS has reported the use of RNS in a signal processing chip [28]. A common
implementation procedure is to use ROMs to provide parallel arrays of look-up tables for
/

performing the residue arithmetic [9]. Numerous RNS-based DSP algorithms have been
reported in the literature [7] [23] [24].

Extensive research has been carried out in the synthesis, minimization, and layout of
ROMs. These essential digital building blocks are traditionally constructed using row and
column decoders to access a programmed bit value, similar to the method used in building
RAMs. Another distinctive approach has emerged recently using the switching tree
concept [25][26]. A full binary tree is constructed with transistors, according to the truth
table of the desired logic function, and then minimized as a graph. The tree is· placed in a
dynamic logic latch in which a logic '1' is produced for each input state that allows at least
one path to form connecting the top and bottom nodes of the tree. G.A. Jullien, and his
research group, have implemented this switching tree approach and have also developed
software [25] that synthesizes minimized dynamic logic ROMs for a given truth table
content. The minimization algorithm reduces area and transistor count of the final circuit
using a graph theoretic approach [26]. This technique is particularly effective for small
RO Ms (less than 9 address bits) providing that sizing of transistors in the edges of the
minimized graph is performed.

In the mid 1980s, Shoji showed that sizing a chain of serially connected transistors can

reduce the gate area by up to 30% and simultaneously reduce the discharge delay by about
10% [69]. Sizing transistors has become an essential part of circuit synthesis not only to
Introduction

5

Introduction

University of Windsor

optimize the( delay/area trade-off but also to reduce power consumption [77]. Yuan and
Sven$son have incorporated a simple switch level delay model, TMODS [71], using an
iterative optimization technique to size a recently introduced dynamic single phase clock
latch. They have been able to design high performance dynamic logic circuits with mature
CMOS technologies [78]. A novel analytical approach to size NFET chains has been
presented in [55] which uses empirically generated assumptions to formulate a set of
equations which are then evaluated in a back-substitution fashion in order to optimally
size the chain with no iterations. RNS architectures can benefit substantially from
employing ROMs with sized transistor chains, in which the trade-off between area and
speed does not stop at the selection of the moduli set but also extends to the sized circuits.

In order to visualize the design process which we have been discussing, it is constructive
to capture the essential elements in the design cycle methodology.

Figure 1.1 shows a typical approach to implement a DSP algorithm using systolic array
architectures. The problem is characterized first by defining the function of the algorithm
using descriptive equations or a flowchart, and then determining the desired level of
performance (e.g. throughput rate, latertcy, signal to noise ratio, ... etc.). Depending on the
type of systolic array desired, mapping procedures are employed to define the ·systolic
array architecture, and all resulting PEs are synthesized using the desired circuit structure.
Circuit level techniques such as transistor sizing to minimize delay and circuit
modifications to minimize charge sharing, are employed. If all performance measures are
satisfied, then the architecture is built into silicon. If some of the specifications are not
satisfied, then some of the design factors are varied to achieve acceptable results.

Behavioral, functional, and timing simulations must be carried out at all stages of the
design process to ensure the validity and integrity of the design.

Introduction

6

-I

- ~

University of Windsor

Introduction

(

Figure 1.1 Systolic Array Design Cycle
Define the
algorithm

Select mapping
method and define
the architect,ure

Perfom circuit
synthesis

Apply circuit leuel
techniques

No

Layout the
architecture

We conclude this section by describing various characteristics of an architecture that lends
itself to VLSI jmplementation. The architecture should possess one or more of the
following properties [27]:

• The architecture can be built by replication of a small number of different cells.
• The architecture contains regular connectivity patterns; data and control line runs are
local and modular.
• The architecture is heavily pipelined and highly utilized; the number of active PEs in
any given pipeline period is high.

Introduction

7

""

'

.. - ..

,.

-~.- ................. -.

... ~-~-- ,..- .• c:.:.:..r?-- ... ... ~---- .~-- ........ J

Research Objectives and Review

...-...- ••

~x . .

--~3"";:r, . 1 ...---:::r":;;"--·--

University of Windsor

1.2 . Re~earch Objectives and Review
The main objective of this research is to develop suitable VLSI implementation of the
newly introduced Polynomial Ring Engine, PRE, mapping using small finite rings [18].
This dissertation presents many techniques relevant to the design and implementation
process of the PRE architectures, from the system level down to the transistor level. While
the mathematical theory of the PRE is documented in the literature, there have been no
significant attempts to translate these algebraic structures into silicon. This dissertation
contains a detailed set of techniques to exploit the independent computational properties
of the PRE and specifically provides contributions in the following areas:

1. A complete mathematical theory for the PRE and an accompanying RNS
tutorial.
2. A set of systematic approaches that generate semi-systolic arrays for PRE
architectures with associated circuit descriptions.
3. Development of high performance circuit modules to provide building
blocks for the PRE architectures.
4. Development of a novel transistor sizing technique, with an associated software package, to optimize cost functions based on selected performancearea trade-offs for switching tree PRE architectures.
5. New fault tolerant architectures that combine a mathematical system level
approach with circuit level parity checking for superior fault coverage.
6. An illustrative example of the power of the PRE technique in embedding
algebraic integers. We have used a wavelet transform implementation for
HDTV image compression that exploits the algebraic properties of
Daubechies coefficients.
Figure 1.2 shows the interaction of the techniques and approaches developed in this
dissertation to yield a complete design package for implementing small moduli ring PRE
architectures on silicon. (Note that BIPSPm refers to a specific circuit structure that
enables modulo operations to be performed in a bit-level fashion. This structure also can
be used for circuit level fault detection [32].)

Introduction

8

-

....

.,,,__,.

University of Windsor

Thesis Organization

Figure 1.2 Relations of PRE Techniques

(

PRE Theory

t

Circuit
Modules
Sizing
Techniques - . . BIPSP

Basic Pipelined
Structures
-------....

Fault
Tolerance

System ....-Level

m

1

\

Circ~t
Level

/

/

PRE Architectures Design Package

Combination

~

t
Wavelet Transform Architecture Example

1.3

Thesis Organization

The thesis contains six chapters. Chapter 2 starts by briefly rev1ewmg the algebraic
structures essential in the construction of finite integer systems. Fundamental mapping
concepts among finite algebraic structures are given. The mathematical structures of
relevant integer mapping systems based on finite polynomial rings are reviewed, including
the newly introduced polynomial ring engine. The chapter ends by comparing various
mappings in terms of implementation and capabilities.

Chapter 3 introduces various implementation strategies for the polynomial ring engine. It
starts by defining a polynomial number system as an independent mathematical structure
able to represent and map integers to parallel processing finite rings. Various novel silicon
architectures are then introduced based on the mapping requirements, such as the number
of indeterminates and roots used. The polynomial ring engine architecture is then
constructed by embedding RNS mappings within the polynomial mapping.

Introduction

9

Thesis Organization

University of Windsor

Transistor levtl techniques, which enable the silicon implementation of PRE architectures
via switching trees, are discussed in Chapter 4. We begin with a brief introduction to a
transistor sizing technique, previously developed by the author. An iterative algorithm is
then introduced to size complex nFET logic blocks; the description of a software tool to
perform this sizing is provided in an appendix. Extensive circuit simulation and transistor
sizing results are provided for the circuit module development. Also in chapter 4, sample
layouts for a MOD 7 multiplier (one of the more complex circuit structures required by the
small ring PRE) are provided based on the BIPSP cell construction and switching three
logic synthesis.

Chapter 5 presents results of our system building procedure by providing an example
implementation of a PRE architecture for a Daubechies wavelet transform. The
architecture is targeted for lossy image compression and conforms to HDTV standards.
Various approaches to achieve fault tolerance for the PRE architectures are discussed;
these rely on either system level or circuit level error check and recovery. A novel
architecture based on both levels is introduced and discussed.

The main body of the dissertation concludes with comments and discussions in Chapter 6.
Suggestions for further research activities along the theme of this dissertation are also
presented.

The dissertation also includes appendices that will be helpful to the interested reader.
Appendix A reviews the residue number system and its variations. Appendix B contains
details on transistor sizing. The source code listing of the transistor sizing software
package is given in Appendix C.

Introduction

10

\

Chapter 2
Algebraic Structures for
Multidimensional
Digital Signal Processing

2.1

Introduction

The work in this dissertation is concerned with implementing
various concepts in number theory and abstract algebra, and in this
chapter we briefly review the mathematical concepts that we
employ.

This chapter is not intended as an introduction to the general area of
abstract algebra. To make the review concise and interesting, proofs
of theorems are left out intentionally; they can be found in any
standard introductory algebra text book. The interested reader in
abstract algebra or discrete mathematics may consult Fraleigh's
book entitled "A First Course in Abstract Algebra" [33] which
contains an excellent development of the subject. There are other
references that deal with the subject as it relates to building DSP
architectures [34] [6] [35].

We start the chapter with some basic definitions; we then discuss
the concepts of groups, ring, and fields. Polynomial rings and other
complex algebraic structures are also reviewed along with mapping
concepts such as homomorphisms and isomorphisms which is the
basis for some recent advances in multi-dimensional computing
Algebraic Structures for Multidimensional Digital Signal Processinglntroduction

•

11

-- ...,

- - - - - - - ~ - ;,..:,,.'!.~"...;i...::..~• ....:&,l

r~r.,...:-~-·· "71;;,

~ ..... • :u:-

'"b~-- -y-

7:\----.,..... ~ -

r-..;--:.--"':""",...= ·-:: ~~-<· .---~..'.!.:. :z_:-:;:, ·;r,.z.--::-:-:c;--·-

Algebraic Structure and Relati n

l

systems. V~ous development in th fi Id

111\ 1.

11, ,11 \ 111 l 111

p lyn nfr l m·1r pin~' ·u di~ us

l

i ·m l .t

brief comparison is presented.

The reader is encouraged to review Appendi.: A whi h

ntains

Residue Number System along with its variants su h ·1s th

2.2

·1

f( 1 m~ I lk hnit inn ol

llil

I N S and PP N '

Algebraic Structures and Relation

2.2.1 Binary Operation
Definition (Binary Operation): A Binary operation • on a set S i8 a ruJ 1h'1t a

1>

each ordered pair (a, b) of elements of S some element of S.

Definition (Commutative Operation): A binary operation on a se S · C<Jmrnu a i:; · if
a • b = b • a for all a, b

E

S.

Definition (Associative Operation): A binary operation on a set ' js a ~o.,ia iv
( a • b) • c = a • ( b • c) for all a, b, c

E

if

S.

2.2.2 Groups
Definition (Group): A group <G, •) is a set G on which a binary operation • is defined
and the following axioms hold:

1. The binary operation • is associative.
2. There exists an element e
3. For each a

E

E

G such that a • e = e • a = a for all a

G, there is an element a'

E

E

G.

G with the property that

a • a' = a' • a = e .

Algebraic Structures for Multidimensional Digital Signal Processing

12

- . _ -~-- .

.,. ::--

Algebraic Structures and Relations

University of Windsor

TI).e element e is the identity element for • of G. Also the element a' is the inverse of a
: with respect to the operation •. It can be easily proven that all inverses are uniquely
defined in the group structure.

Definition (Abelian Group): A group G is called Abelian 1 if its binary operation • 1s
commutative.

Definition (Order of G): If G is a finite group, then the order IG! of G is the number of
elements in G. In general, for any finite set S, IS! denotes the number of elements in S. (If
Sis infinite then ISI denotes the cardinal nu;nber of S.)

Definition (Cyclic group, generator): The cyclic group of finite order n denoted by C n is
a group consisting of elements, e, a, a
condition an = e (Here, of course, a

2

2

, .•. ,

an-

1

a· a, a

3

with multiplication2 subject to the
a· a· a, etc.). An element a is

called a generator of Cn .

The following theorem and definition are needed for Theorem 2.3.-

Theorem 2.1

(Division algorithm for Z): If m is a positive integer and n is any

integer, then there exists unique integers q and r such that

n

mq + r and O ~ r < m

q is referred to as the quotient and r as the non-negative remainder when n is divided by

m.

I. Named after the mathematician N.H. Abel (1802-1829)
2. Note the use of multiplication notation to express the binary operation.
Algebraic Structures for Multidimensional Digital Signal Processing

13

Algebraic Structures and Relations

University of Windsor

Depnition (Modulo operation): Let n be a fixed positive integer and let h and k be any
integers. The remainder r when h+k is divided by n as in the division algorithm is the sum
of h and k modulo n.

Theorem 2.2

Every cyclic group is abelian.

Theorem2.3

The set { 0, 1, 2, ... , n - l } is a cyclic group Zn of n elements under

addition modulo n.

Theorem2.4

E~ery_gro~p of pri_me order p is cyclic.
/

Definition (Greatest common divisor gcd): Let rand s be any two positive integers. The
positive generator d of the cyclic group

C

{nr+msln, m

E

Z}

under addition is the greatest common divisor of r and s.

Groups have all the structural properties to solve equations of the form ax = b or
ya = b and the following theorem explains just that.

Theorem2.5

If G is a group with binary operation •, and if a and bare any ele-

ments of G, then the equation a• x = b and y • a = b have unique solutions in G.

2.2.3 Homomorphism, Isomorphism, and Factor Groups
Definition (Function or mapping): A function or mapping <p from a set A into a set B is
a rule which assigns to each element a of A exactly one element b of B. We say the <p
maps a to b, and that <p maps A into B. The mapping is denoted symbolically by

Algebraic Structures for Multidimensional Digital Signal Processing

14

Algebraic Structures and Relations

l

University of Windsor

cp :A~B

Definition (One to one and onto): A function from a set A into a set B is one to one if

each element of B has at most one element of A mapped onto it, and is onto B if each
element of B has at least one element of A mapped onto it.

Definition (Homomorphism): The mapping cp : G

~

G' is a homomorphism from G to G'

if for every a, b E G we have cp(ab) = cp(a)•cp(b).

Theorem 2.6

Let cp be a homomorphism of group G into a group G'.

1. The map preserves identity (i.e e in G maps to e' in G').
2. The map preserves inverses (i.e if a E G, then cp(a- 1 ) = cp(a)- 1 ).
3. If His a subgroup of G, then cp(H) is a subgroup of G'.
4. If~ is a subgroup of G', then cp- 1(K') is subgroup of G.
Definition (Kernel): Let cp :G ~ G' be a group homomorphism. The subgroup cp- 1 (e') ,

consisting of all elements of G mapped by cp into the identity e' of G' is the kernel of cp .

Definition (Isomorphism): An isomorphism cp: G ~ G' is a homomorphism map that is

one to one and onto G' denoted by G

=G' .

What this means is that the two groups have the same structural features, and one can be
made to look exactly like the other by a renaming of the elements.

Definition (Cosets): Let H be a subgroup of a group G. The subset aH = {ahjh EH} of

G is the left coset of H containing a, and Ha = { ha Ih E H} is the right coset of H
containing a.

Algebraic Structures for Multidimensional Digital Signal Processing

15

Al gebraic Structures and Relations

Ti:f:eorem 2.7

University of Windsor

Let H be a subgroup of G whose left and right cosets coincide. The

cosets of H form a group G/H under the binary operation (aH)(bH) = (ab)H.

Definition (Factor Group): The group G/ H 1s called the factor group (or quotient

group) of G modulo H.

Theorem2.8

Let Gi, G 2,

... , G n

be groups. For the ordered n-tuple
n

i= l

called the finite direct product of groups.

Theorem2.9

(Fundamental Theorem of Finitely Generated Abelian Groups):

Every finitely generated abelian group G is isomorphic to a direct product of cyclic groups
in the form
Z

r

I

P1

XZ

r

P2

2

X ... XZ

r

n

XZXZX ... XZ,

Pn

where the pi are primes, not necessarily distinct.

The above two theorems form the basis of techniques which allow computations in large
order groups (large dynamic range) to be performed in pair-wise (parallel) fashion over
smaller order groups. The challenge is to find a suitable (implementable) isomorphic map.
Groups are defined for single binary operation while most DSP algorithms require two
operations; namely, addition and multiplication. The following section will introduce such
algebraic structures more appropriate for this purpose.

Algebraic Structures for Multidimensional Digital Signal Processing

16

Algebraic Structures and Relations

University of Windsor

~.2.4 Rings and Fields
Definition (Ring): (R, +, •) is a set R along with two binary operations + and • called
addition and multiplication respectively defined in R such that the following hold:

1.

( R, +) is an abelian group.

2. Multiplication is associative.
3. The left and right distributive law holds. For all a, b, c

E

R:

a•(b+c) = a•b+a•c, (b+c)•a = b•a+c•a.

Definition (Commutative ring, unity): If multiplication in R is commutative, then Risa
commutative ring. If the ring contains unity 1 such that a• l = 1 • a = l for all a

E

R,

then R is a ring with unity.

Definition (Unit, field): Let R be a ring with unity. An element u in R is a unit of R if it has
a multiplicative inverse in R. If every nonzero element of R is a unit, then R is a division
ring. A field is a commutative division ring.

Rings ar~ sufficient for most DSP architectures smce they support multiply and
accumulate operations. Fields contain multiplicative inverses for all nonzero elements
which make them suitable for solving linear equations of the form ax + b = 0 and for
building other general DSP architectures. Fields also allow multiplication of non-zero
elements to be mapped to addition (finite field logarithms), which provides an
implementation advantage

Definition (Homomorphism of rings): A map cp :R -t R' is a homomorphism if for all

a, b

E

R , the following property are satisfied:

l. cp(a+b) = cp(a)+cp(b)

2. cp ( a • b) = cp (a) • cp (b)

Algebraic Structures for Multidimensional Digital Signal Processing

17

Algebraic Structures and Relations

University of Windsor

~efinition (Isomorphism of rings): An isomorphism cp : R ---) R' is a homomorphism
map that is one to one and onto R'.

Definition (Divisors of 0): If a and b are two nonzero elements of a ring R such that
ab = 0, then a and bare divisors of 0.

Theorem 2.10

In the ring Zn, the divisors of Oare precisely the elements that are not

relatively prime to n.
Theorem 2.11

If p is prime, then Zp is a field.

Theorem 2.12

A field has no divisors of 0.

Notation (Finite rings of integers): Let m be any positive integer. The ring of integers
modulo m denoted by R(m) and is given by,

R ( m)

=

{

S : EB m, ® m} and S

=

{

0, 1, ... , m - l }

(2.1)

Where a EBm b and a ®m b denote modulo m addition and multiplication, respectively.
We can extend the notion of addition and multiplication from the elements of S to all of the
integers.

2.2.5 Polynomial Rings
Definition (Polynomial, degree): A polynomial f(x) with coefficients in the ring R is an
infinite formal sum,

f(x)

Algebraic Structures for Multidimensional Digital Signal Processing

•

(2.2)

18

Algebraic Structures and Relations

'\here ai

E

University of Windsor

R and ai = 0 for all but finite number of values of i. The degree of the

polynomial, degf (x), is the largest i for which ai

* 0. If degf (x)

is n and ai = l , the

polynomial is called a monic polynomial. Polynomials of the form aO are known as

constant polynomials.

Theorem 2.13

(Polynomial Rings) 1: The set R[x] of all polynomials in an indeter-

minate x, with coefficients in a ring R, is a ring under polynomial addition and multiplication. If R is commutative, so is R[x] and if R has 1 as unity, then 1 is also unity for R[x].
The above theorem extends easily to polynomials with more than one indeterminate with
the usual addition and multiplication.

Theorem 2.14

(Division Theorem for Polynomials): Let F[x] be a field and let

a(x ), b(x) E F[x], with b(x)

* 0. Then there are unique polynomials q(x) and r(x) in

F[x] such that:

a(x) = b(x)q(x)

+ r(x)

(2.3)

Where either deg r(x) < deg b(x) or r(x) is a zero polynomial.

Definition (Divisor, gcd): g(x) is a divisor off (x) in F[x] if there is a polynomial h(x)
in F[x] such that f(x) = g(x)h(x). Given any two polynomials a(x) and b(x) in
F[x], then the greatest common divisor gcd of a(x) and b(x) is d(x) such that

l. d(x) is a divisor of a(x) and b(x),

2. any divisor of a(x) and b(x) is also a divisor of d(x).

1. If the polynomial coefficients belong to a field,

F, then F [x]

does not form a polynomial field since the

polynomial x is not a unit (there exists no f (x) E F[x] such that xf (x) = 1 [33].)
Algebraic Structures for Multidimensional Digital Signal Processing

19

Algebraic Structures and Relations

\heorem 2.15

University of Windsor

Let F[x] be a field and d(x) be the gcd of the polynomials a(x) and

b(x) in F[x]. Then there are polynomials A(x) and µ(x) in F[x], such that
d(x) = A(x )a(x) + µ(x )b(x)

Definition (Irreducible polynomial): For all f(x ), g(x ), h(x)

(2.4)
E

F[x], the polynomial

f(x) is irreducible if it is not a constant polynomial, and if f(x) = g(x)h(x) implies that

either g(x) or h(x) is a constant polynomial.

Theorem 2.16

Any non-constant polynomial f(x) in F[x], can be expressed as a

product of irreducible polynomials. If there are two such factorizations

then r = s and the order of the factors can be rearranged such that for all i,
g /x) = aihi(x), where ai is non-zero constant polynomial.

Theorem 2.17

(Factor Theorem): Let F be a field and f(x) be a polynomial in

F[x]. Then x- a is a divisor of f(x) in F[x] if and only if /(a) = 0 in F. a is called

the root of f(x).

Theorem 2.18

If Fis a field and f(x) is a polynomial of degree n ~ l in F[x], then

the equation f(x) = 0 has at most n roots in F.

2.2.6 Direct Product Rings
If R 1 and R 2 are any two rings then we can define the cross-product ring R 1 x R 2 as the
set of pairs {si, s 2 }

E

S 1 x S2

,

with addition and multiplication defined component-

wise, as in Eqn. (2.5).

Algebraic Structures for Multidimensional Digital Signal Processing

20

Polynomial Based Mappings

\

University of Windsor

(a1, a2)EBR 1xR/b1, h2)

(a1EBR1b1, a2EBR2b2)

(a1, a2)®R 1 xR/b1, b2)

(a1®R1b1, a2®R2b2)

(2.5)

L

An isomorphism between R(M) and the direct product of {R(mk)} , where M =

IT mi
i= ]

and M, m, L

E

Z, means that calculations over R(M) can be effectively carried out over

each R(mk), independently and in parallel. A final mapping to R(M) is performed for
the output of the DSP algorithm. We have therefore broken down a calculation over a large
dynamic range, M, to a set of_ L calculations over small dynamic ranges given by the
{ m k} . This is the main advantage of using a system such as the RNS over a conventional

weighted value numbering system (e.g. binary).

2.3

Polynomial Based Mappings

Recently polynomials have been given a special importance in the construction of number
theoretic techniques. The common objective of these techniques is to remove the need for
partial product processing associated with the polynomial multiplications. The target
application and specifics of the polynomial mappings led to three distinct approaches cited
in the literature. The first implicitly uses polynomials to represent the real and imaginary
parts of Gaussian integers and is well known as the QRNS [39] [42]. The second approach
uses polynomials to represent sequences of numbers and then performs partial product
free polynomial multiplication to implement DSP operations such as correlation and
convolution [50] [51] [52]. The third approach, which is the subject of this dissertation,
uses finite polynomials mapped from a weighted (polynomial) representation of a single
data sample [19]. The following is a brief description of each polynomial approach. We
conclude this chapter with a comparison and summary of these approaches.

Algebraic Structures for Multidimensional Digital Signal Processing

21

Polynomial Based Mappings

University of Windsor

2.3.1 QRNS Mapping
This system was the first to implicitly embed polynomial mapping along with the
traditional residue number system. Here we will show that the general approach for the
QRNS is in fact a special case of polynomial mapping.

We start by considering the following special polynomial:

2

x + l =0 mod m

(2.6)

where m is some modulus in which Eqn. (2.6) has real integer roots. Table 2.1 shows a list
of available moduli in the range 1-127 that support the quadratic residues along with their
roots.
Table 2.1. Quadratic Residue Rings and their Roots
Modulus

First Root

Second Root

5

2

3

10

3

7

13

5

8

17

4

13

25

7

18

26

5

21

29

12

17

34

13

21

37

6

31

41

9

32

50

7

43

53

23

30

58

17

41

61

11

50

65

8

18

73

27

46

74

31

43

82

9

73

85

13

38

Algebraic Structures for Multidimensional Digital Signal Processing

. Third Root

Fourth Root

47

57

47

72
22

Polynomial Based Mappings

University of Windsor

Table 2.1. Quadratic Residue Rings and their Roots
Modulus

First Root

Second Root

89

34

55

97

22

75

101

10

91

106

23

83

109

33

76

113

15

98

122

11

111

125

57

68

Third Root

Fourth Root

Table 2.1 exhibits the following well known features: (a) roots appear in pairs, r and-r, (b)
if all prime divisors of the modulus have the form 4k+ 1, the ring is a quadratic residue
ring, (c) the prime, 2, can be a prime factor of a quadratic residue modulus. The moduli
shown in bold print in Table 2.1 contain the factor 2. Quadratic residue moduli, which are
always odd primes, form Galois Fields that enable the use of index calculus (finite field
logarithms) to implement modular arithmetic.

Let the first order polynomial, below, represent any Gaussian integer:

(2.7)
The indeterminate, x, plays the role of the complex number i =

R. .The quotient ring of

polynomials, with respect to the ideal generated by the quadratic equation Eqn. (2.6),
written in terms of its roots in the ring Zm, is isomorphic to the direct product ring given
by Eqn. (2.8):

(2.8)

An evaluation map can be used to map the elements of the quotient ring into the direct
product ring; it is defined by:

Algebraic Structures for Multidimensional Digital Signal Processing

23

Polynomial Based Mappings

University of Windsor

(2.9)
I

The reverse map can be deduced from Eqn. (2.9) and is given by:

aR

= 2- 1,(A + A*)
2- 1r- 1 (A-A*)

a1

-1

where 2 , r

-]

(2.10)
E

Zm

Note that we deliberately use the notation that appears in the QRNS literature. We see that
Eqn. (2.9) and (2.10) describe the forward and inverse QRNS mapping, and that this is
indeed a polynomial mapping based on the existence of the isomorphism between the
quotient ring of polynomials and the direct product ring, given by Eqn. (2.8).

2.3.2 Polynomial Residue Number System
The PRNS is implemented to provide multi-channel independent processing in which
convolutions and correlations are computed by the use of modulo multipliers only (no
adders are required.) As is the case with the QRNS, this polynomial mapping relies on
monic polynomials to generate the ideal, as given by Eqn. (2.11)

x

N

± 1 =0 mod

m

(2.11)

Here, m is some modulus for which Eqn. (2.11) has real integer roots. Polynomials of
order N-1 are used to 'code' an N data samples by equating each coefficient with one
sample.

(2.12)
The isomorphism between the quotient ring of polynomials and the direct product ring is
given by

Algebraic Structures for Multidimensional Digital Signal Processing

24

Polynomial Based Mappings

University of Windsor

(2.13)

The forward isomorphic mapping can be carried out by evaluating the polynomial p(x)
that represents the data at the N roots .o f th-e ideal polynomial x N ± l ; this is given by:

(2.14)
The reverse mapping can be easily obtained by solving N equations in N unknowns

2.3.3 Polynomial Ring Engine (PRE)
A detailed mathematical description of this system is included here since it forms the basis
on which this dissertation is constructed. This system was initially proposed to relax the
constraint that pair-wise relatively prime moduli have to be added to the existing modulus
set of a RNS to increase its comput~tional dynamic range. In fact, with the PRE, a
modulus set as small as { 3, 5,

7}

is capable of providing large practical dynamic ranges

[19]. In the mapping, a single indeterminate
polynomial is made to representt a data sample
•
by grouping the binary bits of the sample into the polynomial coefficients and recognizing
the indeterminate, x, as a base. The work has been generalized to multiple indeterminates,
each representing some power of the base of the weighted representation [19]. This data
conversion technique is not unique, and the many trade-offs among hardware area,
dynan:iic range, and throughput rate is a major driving force for the research work
presented in this dissertation.

The progression of ring mappings required by the PRE are shown in Figure 2.1. The rings
are denoted by boxes, and the mappings by arrows.

For simplicity in the diagram, the set of indeterminates {x i, x 2,

•.. , xn}

is represented by

the vector x. Encoding of the data begins at the top with the input integers, Z, and
Algebraic Structures for Multidimensional Digital Signal Processing

•

25

Polynomial Based Mappings

University of Windsor

continues down ward to the bottom. The algorithmic computation is performed in the
direct product ring

IT Zm;,j; the answers are then decoded in the reverse direction.
i ,J

Figure 2.1 The Rings and Homomorphisms

<p

II

cp'

{Zmi[x]D / (gmi(x))}

i

i ,j

The map cp and cp'
<p is a map that represents integers as polynomials. This map is not a homomorphism but

it is nevertheless sufficient to find for each input integer a polynomial which will represent
the data throughout the mapping stages. It turns out that there are many ways to perform
this map based on the following major factors: polynomial order, data bit distribution, and
the use of single or multi indeterminate polynomials. The trade-off among these factors is
usually dictated by the computational dynamic range and the nature of the DSP algorithm
(i.e number of multiplications).

Algebraic Structures for Multidimensional Digital Signal Processing

26

Polynomial Based Mappings

University of Windsor

cp' is an evaluation homomorphism whereby the output polynomials are evaluated with
their respective weights. This map preserves the value of the DSP computation and is
uniquely mapped to Z. Eqn. (2.15) formally defines this map.

cp': ZM[x ]D --"7 Z
cp'c(f(x)) = f(x = c)

(2.15)
Where c and x are integer and indeterminate vectors, respectively./The integer vector Dis
the degree of the polynomial g(x) which generates the ideal and therefore all the
polynomials in ZM[x]

0

are of degree less than D. The coefficients of the polynomials are

elements of the ring ZM·

The map <I> and <I> -I
This map constitutes the residue number system implementation over the polynomial
-£.Oefficients. A formal definition 9f this system is given by Eqn. (2.16),

R(M) = {SM, E0M, ®M}
R(M)

(2.16)

=IT R(m)
i

The isomorphism given by Eqn. (2.16) exists iff (a) M =

IT mi

and (b) {m) are

relatively prime to one another. The forward map is carried out by forming an ordered
tuple that consists of modulo reduction of the polynomial coefficients

ck

with the L

moduli and is defined by Eqn. (2.17),

L

<I>: R(M)

-"-7

IT R(mi)
i= I

such that

ck -"-7 (lcklm jcklmi' ... , lcklm)
'
1

Algebraic Structures for Multidimensional Digital Signal Processing

(2.17)

27

·-

- .,. ~

:. - ~

-

- - - ...

--

-

-

...

Polynomial Based Mappings

_- - -

-

University of Windsor

Under certain conditions, this map is trivial and no hardware is required to implement it.
The condition is that all the input data must be represented with smaller coefficients than
the smallest modulus. The input polynomials are usually low in order and small in
coefficients and the condition can be met easily. The inverse map <I>- 1 is defined by the
classical Chinese Remainder Theorem and given by Eqn. (2.18),

<I>-

1

IT R(mi) -"1 R(M)

:

i
L

I

such that ck

1
M{fh/~M[(fh)~ ]®M ck,)

i= I

where fh;

M

- , ck E R(M), ck i E R(m;)

m.l

'

(2.18)

A close look at Eqn. (2.18) reveals that modulo Madders are needed to perform this map.
This undesirable characteristic is alleviated by mapping the residue polynomial
coefficients to the associated mixed radix representation [16], AMRR, and then to the
binary representation. A polynomial coefficient ck can be represented with L mixed radix
digits (ak v ... , ak
'

'

1)

defined by

ck

± 1{ IT m;}]

+ ak, 1

[ak,

1=2

1=1

where O ~ ak 1 < m 1

(2.19)

'

The mapping from RNS to AMRR can be defined recursively, Eqn. (2.20), and
implemented by small modulo operations.

(n)

<i, n

(n)

where

c(I)
7<:,

I

(n + I )

ck,1'1',i

=

Algebraic Structures for Multidimensional Digital Signal Processing

~. i - ck, n

mn

'

\:Ii n

(2.20)

,

m;

28

Polynomial Based Mappings

University of Windsor

It is possible to choose a moduli set such that the multiplicative inverse of mn in all other
residue rings will be ± 1 , and _hence the modulo multipliers will be completely eliminated.
However, such a restriction severely limits the number of moduli sets available. The
conversion from AMRR to binary representation is straight forward with the use of fast
binary adders. This map can be combined with the polynomial to binary map and will be
discussed later.

The map q and q'
The forward map q takes each individual nng II Zm;[x ]
0

to the quotient nng

II {Zm;[x] 0 I (gm;(x))} . This map usually reduces polynomials by calculating

remainders with respect to the ideal generated by (gm; ( x)). We will restrict this map such
that the elements of the individual rings will form the remainders and no actual
polynomial reduction takes place. This is done by ensuring that all polynomials of
IIzm;[x]

0

have degrees less ' than the degree of their respective ideal generators. The

i

map q' takes the least elements of the polynomial residue classes of the quotient ring
II {Zm;[x] 0 I (gm;(x))} and maps it back to the direct product ring IIZm;[x] 0 .
i

i

The structure of the quotient ring of polynomials is just a mathematical vehicle to give rise
to the existence of the isomorphic polynomial map which forms the final product rings. As
is noted above, there is no actual hardware needed to implement the maps q and q' .

The map w and w- 1
The map W is an isomorphism and usually referred to as an evaluation map. It evaluates
all polynomials of the quotient ring at all possible combinations of the roots of the

Algebraic Structures for Multidimensional Digital Signal Processing

'

29

Polynomial Based Mappings

University of Windsor

polynomials gm;(x) that generate the ideal. Note that the direct product ring, which is
indicated as

IT Zm7

,j

in the -diagram, consists of many individual copies of each of the

i ,J

nngs Zm;, F

w- 1

1s the reverse polynomial map in which the individual rings are

combined to form the output polynomial coefficients.

w- 1

referred to as the Chinese

Remainder Theorem for polynomial rings [88]. The details of performing this map is
discussed in detail in "Polynomial Mapping" on page 38.

The DSP Algorithm Computation
Once the data have been converted to elements of the direct product ring,

IT Zm;

,j ,

the

i ,J

DSP algorithmic computation is replicated in each individual ring. Large dynamic range
multiplications and additions in the common binary representation can be carried out with
many small wordlength modulo multipliers and adders in completely parallel fashion.
Eqn. (2.21) illustrates this main concept.

(2.21)
Where L is the number of moduli and d is the polynomial replication factor.

Illustrative Numerical Example
Let us support the above review of the various PRE mappings with a simple numerical
example. Suppose that we want to multiply two integers, 2 and 3, using the PRE. First we
represent the numbers with polynomials as follows

(2.22)
Algebraic Structures for Multidimensional Digital Signal Processing

30

Polynomial Based Mappings

University of Windsor

Note that x plays the role of the radix 2. We must select g(x) of third degree to allow one
multiplication without polynomial overflow as follows:

g(x) = x(x - l )(x

+ 1)

(2.23)

At this stage each polynomial representing the input numbers is evaluated at the roots of
the polynomial g(x) generating the ideal. This map can be carried out using matrix
notation and assuming that all coefficients belong to Z7 .

(2.24)

We perform the multiplication in a pair-wise fashion:

0®71 = 01
1®72 = 2
[

(2.25)

6®70 = 0

The result is then converted back to a polynomial representation by using the inverse
mapping matrix:

(2.26)

Algebraic Structures for Multidimensional Digital Signal Processing

31

Polynomial Based Mappings

University of Windsor

Finally we convert the result to integer form by replacing the indeterminate, x, by its
weight, 2.

(2.27)

0 + 1 (2) + 1 (2 /
6

The above example clearly shows the steps involved in the PRE mappings. By
appropriately selecting the polynomial orders and the modulus for the coefficients, one
only requires hardware to implement the map cp, cp', 'I', and

w- 1 • Note that the example

does not implement the map <I> and <1>- 1 which is the classical RNS mapping.

2.3.4 Comparisons Among Polynomial Mappings
All the polynomial mappings discussed above share the following fundamental
mathematical basis:

1. Mapping the quotient polynomial ring structure to a direct product ring
structure.
2. Polynomial evaluations at the ideal polynomial roots is used to define the
isomorphic mapping.
The following is a list of the main features and drawbacks of each polynomial mapping
system:

1. Target Applications
QRNS: The map is specifically designed to process complex integers so
that complex multiplications are reduced to component-wise
multiplication in the direct product ring.
PRNS:

This map is designed to implement circular convolution and
correlation by the use of polynomial multiplications. N data
samples are represented by N-1 degree polynomial and only N
parallel modulo multiplications with no additions are needed in
the direct product ring structure.

Algebraic Structures for Multidimensional Digital Signal Processing

•

32

Polynomial Based Mappings

PRE:

University of Windsor

Each polynomial coefficient is decomposed into residue representation by the RNS and hence an integer represented by a
polynomial has greater dynamic range than allowed by the
moduli set. This fact makes the PRE useful in processing all
applications that is traditionally targeted for RNS with smaller
residues. With multi-indeterminate polynomials, PRE mapping
is also suitable for complex number and irrational FIR filter
coefficient processing.

2. Ideal Polynomial, Roots, and Modulus
QRNS:

x2 + 1 = 0 mod m is the polynomial that generates the ideal.
This places restrictions on the moduli set in whrch real integer
roots exist.

PRNS:

± 1 = 0 mod m is the ideal polynomial and requires an
extensive mathematical procedure for finding suitable moduli
and roots.

PRE:

g(x) may be any polynomial and hence the roots can be chosen

x

N

at will for a given modulus, m. The only condition required is
that the difference between any two roots should be invertible in
Zm.

3. Mapping Flexibility
QRNS: The maoping is well defined and can be only used for complex
integer processing.
PRNS:

The ideal polynomial is specific and the mapping can only be
used for one purpose. Complex integer manipulations can only
be handled by a separate mapping stage.

PRE:

The mapping is very flexible and multiple computational
requirements can be accomplished with one homogeneous map.
For instance, with multi-indeterminate polynomials, one can
increase the computational dynamic range and process complex
integers with the same ideal polynomial roots.

4. Hardware Reusability
QRNS: The mapping is very efficient for complex integer processing
but not necessarily practical for other uses.
PRNS:

It is possible to reuse the mapping stages to increase the computational dynamic range as in the PRE. However, since it is targeted to represent sequences of samples, the polynomial will

Algebraic Structures for Multidimensional Digital Signal Processing

33

Summary

University of Windsor

have a large order over a large modulus which makes its reuse
impractical.
PRE:

2.4

Once the modulus and polynomial order are chosen, the actual
mapping stages become independent of the target application.
With multi-indeterminate polynomials, for example, hardware
built to process a large dynamic range integer system can be
reused in a complex integer system by trading-off some of the
computational dynamic range.

Summary

Discrete mathematics provides the necessary tools to construct suitable number systems
for specific algorithms. Mapping numbers from one system to another may result in a
favorable computational environment which may offset the additional hardware overhead
needed to implement the number conversion. We have discussed algebraic structures such
as quotient ring of polynomials and direct product ring. We have also reviewed
mathematical tools and concepts necessary for such number conversions. Finally, recent
advances in polynomial mappings cited in the literature have been summarized and
compared.

Algebraic Structures for Multidimensional Digital Signal Processing

34

Chapter3
Small Moduli
Polynomial Ring
Engine

3.1

Introduction

Number theoretic architectures have traditionally been based on the
Residue Number System, RNS [6], but the disadvantages of RNS
techniques

(non-homogeneous data conversion architectures)

outweigh the advantages of carry free computation. A recently
introduced approach, based on a polynomial ring mapping strategy,
eliminates_ many of the problems associated with conversion
architectures from rings with disparate moduli [19]. Unlike the
recently

introduced algebraic integer [21] · or PRNS

[50]

approaches, the MRRNS technique [18][19] allows simple, errorfree, mapping of incoming integer streams, and homogeneous
conversion architectures at the output. The main body of the
computation is performed in identically replicated linear bit-level
pipelines; this has important ramifications in terms of fault
tolerance and testability when implemented in dense technologies,
such as WSI and ULSI.

This chapter is directly concerned with new implementations of
conversion

strategies associated with the polynomial ring

mappings. We first present an overview of the mapping technique,
discuss architectural details ranging from pure systolic arrays to
Small Moduli Polynomial Ring Engine

Introduction

35

Polynomial Number System

University of Windsor

semi-systolic array mapping architectures, and finally discuss a mixed mapping strategy,
first introduced in [19], that embeds the polynomial mapping in the classical RNS to
achieve large computational dynamic range while keeping the individual computational
rings small for high speed data processing. The new architectures possess many of the
important VLSI implementation features such as regularity, mainly one dimensional data
propagation, and low silicon area compared to competing implementations.

3.2

Polynomial Number System

This system is based on the use of polynomials to represent integers or Gaussian integers.
There are two main developments associated with this system: the first is the polynomial
/

representation and the second is the finite polynomial ring mapping. The latter is used to
transform polynomial multiplication into pair-wise operations without the need for
grouping terms and summations. This is analogous to the residue number system where
cross digit carry is completely removed in the arithmetic operations.

3.2.1 Finite Polynomial Ring Representation
A polynomial p(x_) can be used to construct a number system if the i!ldeterminate x
assumes the value of a radix. The coefficients ai are chosen such that Eqn. (3.1) holds true.

(3.1)

Y = p(x) = AX

Here Y is the integer, A is a vector containing all the coefficients,

[a a
0

1

aN_

J, andX

is an indeterminate vector whose elements are powers of a single-indeterminate or a
product of powers of multi-indeterminates. At least one indeterminate serves as a radix
and others may be used to represent algebraic integers (real or complex roots of
polynomials that have integer coefficients.) Examples are the complex operator and
irrational numbers such as

J3, which we will later use in a Wavelet processor (see "PRE

Architecture for a Wavelet Transform" on page 92.) A finite polynomial ring is formed by
replacing the integer coefficients with their residues modulo M, as discussed in
Small Moduli Polynomial Ring Engine

36

Polynomial Number System

University of Windsor

"Polynomial Ring Engine (PRE)" on page 25. We have to choose a large enough value of
M so that there is no coefficient overflo~ when multiplying two such polynomials over the
finite ring.

Definition (Binary Polynomial Number): A polynomial number is called binary if the
coefficients are B bits wide binary numbers and the indeterminate assumes a radix of 28 .

If B is unity, then the coefficients of the binary polynomial number are /the bits of the

binary number representation. Binary polynomial numbers are special since no hardware
is needed for the conversion process.

Theorem 3.1

The binary polynomial number representation is unique.

Proof3.1

Any binary polynomial number N can be written as

(3.2)

where B is the number of bits per coefficient and O is the degree of the polynomial. If we
expand Eqn. (3.2) and replace the indeterminate, x, by the radix, 2 8 , then we have:

0

1

N = (n 0 2 + n 12 + ... + ns _ 12
0

1

B-1

+ (n 0 2 + n 12 + ... + ns _ 12

)2

B-1

0

)2

B

(3.3)

+ ...

From Eqn. (3.3) it is clear that the number N can be represented by a binary number of
Ox B bi_ts width. Note that the subscripts must be reassigned to reflect the bit weight and
uniqueness. If we have two polynomial numbers N 1 and N 2 of the same magnitude and

Small Moduli Polynomial Ring Engine

37

Polynomial Number System

University of Windsor

sign then B and O must be the same for both numbers and Eqn. (3.3) map the two numbers
to the same binary representation. It follows from the uniqueness of the binary
representation that N 1 and N 2 must be equal.

D

3.2.2 Polynomial Mapping
It was shown in Chapter 2 that the polynomial mapping relies on the existence of the
isomorphism between the direct product ring and the quotient ring of polynomials.

(3.4)

Here we use the vector x to indicate multi-indeterminate polynomials. The isomorphic
map is defined by Eqn. (3.4) where the polynomial p(x)

E

Zm[x] is evaluated at all

combinations of the roots of g(x).

(Co, C 1, · · ·'

CN -

(3.5)

I)

Note that ri represents some unique combination of the roots _of all indeterminates.
Eqn. (3.5) can be rewritten in matrix form

C

AT

(3.6)

Where A is given by Eqn. (3.1). The transformation matrix, T, is constructed as follows:
the first row contains the indeterminate terms evaluated at the first root, r0, such that the
first element of C is equal to the value of the polynomial p(x) evaluated at the first root.
The second row uses the second root, r 1, and so on.

Post-multiplying Eqn. (3.6) by T-

1

,

the inverse polynomial mapping can be defined as

follows
Small Moduli Polynomial Ring Engine

38

Polynomial Number System

University of Windsor

A

(3.7)

1

The underlying assumption here is that T- exists. It is appropriate at this point to list the
necessary conditions for finding the inverse polynomial mapping given by Eqn. (3.7).

1. If the degree of p(x) in x i is Oi then the degree of g(x) in x i must be greater
than
2.

oi.

(ri - r} is a unit in ZM for all 1 -5: i < j -5: d.

The first condition is necessary for the isomorphism given by Eqn. (3.4) to exist. This is
somewhat analogous to the condition placed on the computational dynamic range of the
RNS where the computations do not overflow the modulus M; here, the degree of a result
polynomial should have degree less than the ideal g(x). The second condition simply states
that the difference of any two root combinations must be invertible in ZM. This condition
at first seems to be stringent but a careful look reveals that the simple roots -1, 0, and 1 are
valid for any ZM where M > 2 . At this point we will focus our attention on singleindeterminate polynomials for examining this second condition.

Single-Indeterminate Polynomial Mapping
Here we will restrict our attention to single indeterminate polynomials in representing the
data throughout the mapping stages. Let us consider the following:

N-1

L aix i

PM(x)

(3.8)

i=0

N

gM(x)

IT (x-r)

(3.9)

i= I

39

Small Moduli Polynomial Ring Engine

•

Polynomial Number System

University of Windsor

Note that the subscript M indicates that the polynomial coefficients or roots are residues
modulo M. Note also that the degree of gM(x) is greater than the degree of PM(x) by
one which satisfies condition one while keeping the hardware overhead low; this also
facilitates the use of matrix algebra to obtain the inverse mapping. The forward
polynomial mapping is given by Eqn. (3.10) and simplified to a vector form.

[co cl ...

J

cN _

1

1

1

1

rI

1
1

,

r2

rN

[ao al ··· aN-1] ®M

(3.10)
N-1 N-1
r1
r2

C

N-1
rN

A®MT

The matrix T is obtained by evaluating p M(x) at the first roots. This process always
results in a square matrix which provides a systematic way of defining the reverse
polynomial map. The vector C contains the elements of the final direct product ring where
DSP algorithm operations are performed in a pair-wise fashion. The reverse polynomial
mapping is obtained by simply post-multiplying Eqn. (3.10) with the inverse root matrix
as shown in Eqn. (3.11).

-1

[a1 a2 ... ad]

[c1 C2 ··· cdJ® M

1

1

1
r1

r2

N-1
r1

Eqn. (3.11) shows that T-

1

1

N-1
r2

1
1

rN

(3.11)

N-1

rN

M

must exist in the ring ZM for the inverse map to exist.

Theorem 3.2. offers the necessary conditions.

Small Moduli Polynomial Ring Engine

40

Polynomial Number System

Theorem 3.2

University of Windsor

Given a matrix T as defined in Eqn. (3.10) with elements in Z M,
1

1

there exists a matrix T- whose_elements are also in Z M such that T- ®MT = I iff
( r i - r j) is a unit in Z M for all 1 ~ i <

Proof3.2

j

Every element of T

~

-1

N.

can be written as in Eqn. (3.12) using the

matrix of cofactors of T.

f. .
l, J

=

cof(T)

t

(3.12)

ITI

A close look at the above expression reveals that computation of the transpose of the
matrix of cofactors cof(T)

1

involves only modulo multiplications, additions, and

subtractions which pose no obstacle in ZM. The goal now is to find the condition(s)
necessary for the multiplicative inverse of the determinant of T to exist in Z M . The matrix
T is a Vandermonde matrix [89] and its determinant is given by Eqn. (3.13).

!TI ·= IT (ri- r}

(3.13)

i< j

Therefore it is a necessary and sufficient condition that ( r i - r j) is a unit in Z M for all
1 ~i<j~N.

0

Corollary: If p is a prime then any subset of the elements of ZP is a valid root set.

Corollary: Any root set is still valid under translation. If R = { r 0, r 1'

... , r N}

is a valid

root set then R* = {r 0 + k, r 1 + k, ... , rN + k} is also a valid root set where ri, k

Small Moduli Polynomial Ring Engine

E

ZM.

41

.

Polynomial Number System

Theorem 3.3

University of Windsor

If M =

IT Pi then the maximum number of roots in any given valid

root set for ZM is equal to the smallest prime, Pi.

To reinforce the understanding of the above concepts, Table 3 .1 lists all rings from Z 1 to
Z 49 and their maximum root sets.

Multi-Indeterminate Polynomial Mapping
This approach allows the use of polynomials with more than one indeterminate. Let us
consider the following polynomials given by Eqn. (3.14) and Eqn. (3.15).

(3.14)

IT IT (r . .-x.)
L, j

L

(3.15)

i=lj=l

The generic polynomial PM(x) represents the data bits and the output result and the
fixed polynomial gM(x) generates the ideal for the quotient ring. Eqn. (3.16) expresses
the forward polynomial mapping with the use of vector algebra. For notational simplicity,
only second order polynomials in two indeterminates are considered. By evaluating the
two indeterminates at all possible combinations of roots, the rows of the matrix T can be
constructed. For instance, the first row is obtained by replacing the first and second
indeterminates by their first roots.

Small Moduli Polynomial Ring Engine

42

Polynomial Number System

University of Windsor

Table 3.1. Rings and Root sets
. ·.. f'"

M ·.

·• ·M:xilll~ili·•·RoJt Sets·=

..

'.:Maxiriim:ri Root Sets :

1

0

2

0,1

M
26
27

3

All Elements

28

0,1

4

0,1

29

All Elements

5

All Elements

30

0,1

6

0,1

31

All Elements

7

All Elements

32

0,1

8

0,1

33

0,1,2

9

0,1,2

34

0,1

10

0,1

35

0,1,2,3,4

11

All Elements

36

0,1

12

0,1

37

All Elements

13

All Elements

38

0,1

14

0,1

39

0,1,2

15

0,1,2

40

0,1

16

0,1

41

All Elements

17

All Elements

42

0,1

18

0,1

43

All Elements

19

All Elements

44

0,1

20

0,1

45

0,1,2

21

0,1,2

46

0,1

22

0,1

47

All Elements

23

All Elements

48

0,1

24

0,1

49

0,1,2,3,4,5,6

25

0, 1,2,3,4

50

0,1

•:-:-:,·:-:

Small Moduli Polynomial Ring Engine

•·

0,1
0,1,2

43

Polynomial Number System

University of Windsor

2

1 r 11 rl I
cl

2

1 r12 r12

C2
C3

2

1 r 13 r 13
2

C4

1 r11 rl I

C5

1

2
r12 r12

c6

1 r 13 r 13

C9

2

2

1 r 11 r 11
1

2
r 12 r 12

1

2
r13 r13

2
r21

2

r 21 r 11

2 2
r21r11

r21 r21r12 r21r12

2
2
2 2
r21 r21r12 r21r12

2
r21 r21r13 r21r13

2
2
2 2
r21 r21r13 r21r13

2

r 22 r 22 r 11 r 22 r 11
2

r22 r22r12 r22r12

2

C7
Cg

2
r21 r21r11 r21r11

2

r22 r22r13 r22r13
2

r 23 r 23 r 11 r 23 r I I
2

r23 r23r12 r23r12
2

r23 r23r13 r23r13

2

2

2

al

a2
a3

2

r22 r22r 11 r22r 11
2

2

2

2

2 2
r22 r22r12 r22r12
2

2

r22 r22r13 r22r13
2

2

2

2

2

2

2

2

r 23 r 23 r 11 r 23 r 11
r23 r23r12 r23r12
2

2

2

a4

®M

as

(3.16)

a6
a7

ag
a9

2

r23 r23r13 r23r13

or

C

(3.17)

Note that Tisa square matrix since the order of g(x) is one degree higher than that of p(x)
in all indeterminates. The reverse polynomial mapping is obtained by finding T 1 with the
same conditions developed for the single-indeterminate case.

3.2.3 Polynomial Mapping Implementation
Efficient VLSI implementation of the polynomial mapping is crucial to attract attention
from both the research and the industrial communities. The use of matrix notation has
been adopted to allow easy mapping to systolic array architectures. There is a plethora of
literature on systolic arrays and our objective here is to relate polynomial mapping design
parameters to the resulting mapping architecture rather than provide detailed mapping
theory for such arrays. We will also present alternative semi-systolic array architectures.

Small Moduli Polynomial Ring Engine

44

Polynomial Number System

University of Windsor

Systolic Array Approach
An intuitive implementation of matrix-vector multiplication, using a systolic array, 1s
shown in Figure 3.1. Each processing element, PE, performs a multiply and accumulate
operation, MAC, in which the input

ai

is multiplied by the appropriate matrix element

tij

and added to the accumulator c. At the end of each PE row, the final result, ci, will be
ready for independent channel DSP processing. Note that pipelining latches are not shown
in the figure.

Figure 3.1 Systolic Array for Matrix-Vector Multiplication

At this point it is constructive to define the notation used throughout this section; the
notation can be found in Table 3.2. To keep our argument simple, all indeterminates are
assumed to be of the same order throughout this section. Therefore the number of
processing channels is given by

N

Small Moduli Polynomial Ring Engine

(0 0 +1)

I

(3.18)

45

Polynomial

umber System

University of Windsor

Table 3.2 Notation Glossary
:- :•.

Abbreviation

:::

Meaning

01

Input Polynomial Degree (for each indeterminate)

Oo

Output Polynomial Degree (for each indeterminate)

I

Number of Indeterminates

N

Number of Processing Channels

NFPE

Number of Forward Processing Elements (PEs)

NRPE

Number of Reverse Processing Elements

Nyp£

Number of Total Processing Elements

NDPE

Number of Distinct Processing Elements

In order to facilitate a comparison among different PM architecture implementations, the
following cost function equations are provided:

I

I

(3.19a)

NFPE = [(0 1 +1) -1][(0 0 +1) -1]

(3.19b)
I

NTPE = N DPE = [(01

I

-

+ l) -1 ][(Oo + 1) -1] + (Oo + 1)

21

(3.19c)

Eqn. (3.19a) gives the total number of forward mapping blocks needed to construct the
forward PM. It assumes the use of the trivial root, r=O, to reduce the mapping overhead
and in the case of multi-indeterminates one of the root combinations should be O for all
indeterminates. Other roots such as 1 and -1 should also be used to marginally reduce the
silicon area of the PM. Eqn. (3.19a) also uses the fact that input polynomials are smaller in
order than the output polynomials when multiplication operations take place. The
elements of the inverse matrix have no obvious regularity and there is no reduction, in
general, to the computational requirement of the inverse PM (Eqn. (3.19b )). This approach
yields an architecture with all of its PE distinct, as shown in Eqn. (3.19c).

46

Small Moduli Polynomial Ring Engine

•

Polynomial Number System

University of Windsor

Homogeneous Systolic Array Approach
Although the PEs are performing simple operations, namely modulo multiplication and
addition which can be implemented in silicon by the use of ROM look-up tables, their
contents are different, which makes VLSI implementation cumbersome. A novel approach
is presented here to remove the dependency of the PEs on their spatial coordinates in the
array. Recognizing that the rows of the mapping matrix T are obtained by some powers of
the roots, one can generate the powers of the roots in the PEs. Consider the generic PE
given in Figure 3.2.

Figure 3.2 Generic Processing Element
k

ai

'

r7~~r~+I
k
ci ~

PE

k+ I
ci

k+ I

a l.

Note that the superscript indicates the timing sequence. Root data is also fed to the PE and
the governing recursive equations are given by Eqn. (3.20).

k+l

a l.

k+I

c.l

k+I

r.l

k

= ai

k

k

k

= ci + ai . ri

(3.20)

k

= ri . ri

The additional circuitry needed for generating the root data is small compared to the total
PE structure mainly for two reasons: (a) the roots are usually trivial or small; (b) the output
is usually smaller than the modulus and a highly compact binary logic circuit can be used.
The systolic array architecture given in Figure 3.1 can be simplified to accommodate the
generic PEs as shown in Figure 3.3. The first column is replaced by delay elements since
the corresponding elements in the mapping matrix, T, are always unity. The value of the
Small Moduli Polynomial Ring Engine

47

Polynomial Number System

University of Windsor

roots are fed to the PEs and processed along with the data stream. This forward scheme
offers the following advantages;

1. Only one cell has to be designed which is ideal for VLSI implementation.
2. The layout is expandable and reusable. To increase the polynomial order,
we simply append more PEs and feed more roots. Note that if the original
root processing circuitry can not handle the extra number growth, we need
only redesign that part in the appended PEs.
3. The architecture can be reduced for trivial roots such as O and 1. For
r = 0, a complete row is eliminated and c 0 = a 0 . For r = l, simple

modulo adders are used instead of the :PEs for the respective row.
Figure 3.3 Homogeneous Systolic Array for Forward Polynomial Mapping

'
The above systolic array applies directly to a single-indeterminate polynomial mapping
and can be easily adapted for the multi-indeterminate case. Polynomial terms can be
ordered in such a manner as to facilitate the evaluation of each indeterminate root
separately. Consider, for simplicity, the following second order polynomial in two
indeterminates x 1 and x2;

p(x)

Small Moduli Polynomial Ring Engine

48

Polynomial Number System

University of Windsor

The /h-element of the vector C can be obtained by first processing the indeterminate x 1,
all the terms inside the parentheses in Eqn. (3.21), and then feeding the results to another
set of PEs to evaluate the root of the indeterminate, x 2. Figure 3.4 shows the

ith

row of the

systolic array. Note that all PEs are identical; missing arrows indicate that the output is
ignored. As expected, the multi-indeterminate systolic array results in the same number of
PEs to process one row as in the single-indeterminate case, although longer runs of
interconnecting wires are evident. The use of trivial roots will result in a further reduction
in PEs or in the elimination of a complete row.

Figure 3.4 Homogeneous Systolic Array for Multi-Indeterminate Forward PM

The cost functions for this new architecture are provided below:·

I
~

I

NFPE = [(0 0 + 1) -1]0 1 ~ (0 1 + 1)

I-s

Identical Blocks

(3.22a)

s= I

(3.22b)

'°'
+ [(Oo + 1) - 1 ]0/ L (01 + l)
I

NTPE = (Oo + l)

2/

I

I-s

(3.22c)

s=1

NDPE=(Oo+l)

2/

+l

(3.22d)

Eqn. (3.22a) provides the number of identical forward mapping blocks needed where the
first root is 0. The number of reverse mapping blocks is the same as in Eqn. (3.19b) since

Small Moduli Polynomial Ring Engine

49

Polynomial Number System

University of Windsor

we use the same architecture. Eqn. (3.22c) and Eqn. (3.22d) provide the total number and
the number of distinct mapping blocks respectively. For a VLSI implementation, the lower
the number of distinct blocks the faster the turn around design cycle will be due to
standard cell replication and verification. In a highly competitive industry the time
window for producing a successful chip is limited and design cycle turn around can be
critical.

Staged Approach
This approach is designed to treat each indeterminate in separate stages. We start with a
coefficient multidimensional vector in which each dimension represents an indeterminate.
We will use the same polynomial as in Eqn. (3.21 ); here it is rewritten in matrix form:

(3.23)

p(x)

where t denotes
the matrix transposition operator. Now, the forward
polynomial mapping
.
.
is given by Eqn. (3.24)

[c0c3cj
Cl C4 C7
Cz

c5

Cg

C

[aa30a a2J
as
a6

1

1

1

a4

rl,

a7 a8

2
r 1, 1

1 r1,2
2
r 1, 2

I

1

1

I

rl,3

rz, 1 rz, 2 rz, 3

2
r 1, 3

2
'2, 1

2

2

(3.24)

r2, 2 r2, 3

t

(AT 1 ) T 2

where ri,j is the / 1 root of the fh indeterminate. Note that all matrix products are computed
modulo M. If the same roots are used for both indeterminates, then T = T 1 = T 2 and the
polynomial mapping can be greatly simplified by using the same hardware for both stages.
Usually the order of the matrix T is small and its roots are simple; hence a custom
Small Moduli Polynomial Ring Engine

50

Polynomial Number System

University of Windsor

implementation approach as ·a unit cell is more appropriate to achieve high silicon
utilization. Figure 3.5 shows the construction of the staged approach.

Figure 3.5 Staged Approach Forward PM Implementation

The basic building block processing the root,

ri,j,

for a second order polynomial is

characterized by Eqn. (3.25).

V

u0

2

+ u 1r.l , J· + u 2 r l., J·

(3.25)

Here, v and u denote the output and the input respectively. This block can be built using a
single three-input ROM or two 2-input ROMs and a latch. Generic processing elements
can be used to implement this block as shown in Figure 3.6.

Small Moduli Polynomial Ring Engine

51

Polynomial Number System

- University of Windsor

Figure 3.6 Root Processing Block

V

In general, if A has / number of dimensions, then the polynomial mapping can be obtained
as follows:

C

(3.26)

where t 1,k indicates matrix transposition with respect to the first and the "J!h dimension. In
the following cost functions, we assume that all indeterminates are of the same order and
that they use the same set of roots.

l

N FPE = 0001

I

(01 +

l)l-5coo

+

l)s-1

(3.27a)

s=1

N RP E = I ( 0 0 + l )

l+ I

(3.27b)

(3.27c)

N DPE = l + (Oo + 1)

2

(3.27d)

A closer look at these cost functions reveals that this approach is far superior to the
previous ones for VLSI implementation. In this case, the number of distinct mapping
blocks reduces to those of a single indeterminate mapping case. To reinforce the concepts
Small Moduli Polynomial Ring Engine

52

Polynomial Number System

University of Windsor

presented above and to show the derivation of the above cost functions, Figure 3.7 on page
54 is provided.

Note that this mapping arrangement is for three indeterminates, second order output
polynomials, and first order input polynomials. Processing elements with the same fill
pattern are identical and the square blocks are simply timing latches. The interconnection
area progressively widens through the stages, but is limited in between the stages.
Interconnection delay should not pose any problem for the following reasons;

1. Current technologies produce less than 10 Pico seconds of delay for 1 millimeter typical metal layer wire run. The expected pipeline processing
speed of such architectures is less than 1 Giga hertz and the delay will
amount to less than 1% of the clock.
2. In the case of a large percentage delay, the critical interconnection wiring
can be pipelined to completely remove this obstacle with insignificant
increase in area.
There are many factors which affect the selection of a particular PM implementation. This
section attempts to ~ase the selection process by comparing and capturi:r:ig the main
advantages of each architecture in terms of design complexity, siliGon area, and flexibility.
The following tables are produced using the cost functions given by Eqn. (3.19a)Eqn. (3.19c), Eqn. (3.22a)-Eqn. (3.22d), and Eqn. (3.27a)-Eqn. (3.27d). Approximately
the same number of channels are selected in each case which results in approximately the
same computational dynamic range, and one cascaded multiplication for the DSP
algorithm is allowed. This also enables us to compare single indeterminate with multiindeterminate architectures assuming a single indeterminate is sufficient for the DSP
algorithm.

53

Small Moduli Polynomial Ring Engine

•

Polynomial Number System

University of Windsor

Figure 3.7 Complete PM Architecture for 1=3, 0 0 =2, OF1
Input Data Streams

•

•

i99i99i99i99iYYiY9iYYiY9i99
DSP Processing Channels

•oeeoee@eem•em•emee~•e~•~o•~
••••••••••••••••••••••••••
Q©~Q~~~©~0©~0~~0®~~®0~~~~~0

~©00©~0®0~®0~®0~~0~00000000

meemeemee~•e~e@®e@®e@®e~m•~

r11rrrr111111111111r1111r11
Output Data Streams

Small Moduli Polynomial Ring Engine

54

Polynomial Number System

University of Windsor

Comparisons
Table 3.3, Table 3.4, and Table 3.5 show various architectural costs for the systolic array
(SAA), homogeneous systolic array (HSA), and staged approach (SA), respectively. In all
cases the HSA is similar to the SAA except for the number of distinct PEs. The slight
reduction in the distinct PEs is offset by an increase in the PE block area and complexity
and also some increase in global interconnects.

Table 3.3 Mapping Overhead for 1=1, 0 0 =24, and OF12
·,:

:··:=:. •:

.•.

Systolic Array=
Approach

:,:,\.

•:

Homogeneous
Systolic Array
Approach

Staged Approach

·.. ··:....::

.·

·=··

..

N

25

25

25

NFPE

288

288

288

NRPE

625

625

625

NrPE

913

913

913

NDPE

913

626

626

Table 3.4 Mapping Overhead for 1=2, 0 0 =4, and OF2
:\

Systolic Array
Approach

. ·Homogeneous
, . Systolic Array
Approach

Staged Approach

N

25

25

25

NFPE

192

192

64

NRPE

625

625

250

NrPE

817

817

314

NDPE

817

626

26

'/

.,

:"<

Small Moduli Polynomial Ring Engine

...
•,:

;:

55

Polynomial Number System

University of Windsor

Table 3.5 Mapping Overhead for 1=3, 0 0 =2, and OF1
·=·

Systolic Array
Approach

Homogeneous
Systolic Array
Approach

Staged Approach

N

27

27

27

NFPE

182

182

38

NRPE

729

729

243

NTPE

911

911

281

NDPE

911

730

10

i,

All mappings essentially converge to the same results for a single indeterminate, as
expected. As the number of indeterminates increases, the staged approach becomes the
superior choice with less mapping overhead due to the fact that the basic mapping matrix,
T, is repeatedly used in each stage. Another striking result is the low number of distinct

PEs which is mainly due to the reuse of T 1 in the reverse mapping. In fact, only two
mapping blocks are needed in the SA: one is the generic processor element and the other is
required to implement T 1. The entire SA architecture can be built with a highly regular
interconnection of only two blocks, which lends itself to efficient VLSI implementation
using place and route software packages and requires little engineering assistance. The
above tables also show that an SA implementation with three in~eterminates yields less
mapping overhead than two indeterminates due to lower output polynomial order. SA data
flow propagates through the PEs one dimensionally, and therefore timing latches are not
needed to synchronize the MAC result with the polynomial coefficients; this provides
extra silicon area savings as expected.

Figure 3.8 shows NTPE versus the number of processing channels available for 2 to 4
indeterminates. The figure clearly demonstrates the fact that mapping overhead can be
reduced substantially by keeping the polynomial order low and increasing the
indeterminate count.

Small Moduli Polynomial Ring Engine

56

Polynomial Number System

University of Windsor

Figure 3.8 Staged Approach Overhead Plots
60,000

<>

50,000

<>

40,000
~
c.

a

<>

30,000

D 1=3

<>
20,000

D

<>
<>

<>

1=2

•

<>

10,000

1=4

0

0
0

200

400

600

800

Number of Channels

However, the available choices for N become limited and the distance between successive
available N values increases as I increases (for 1=4, only two choices are available in the
range NE [25, 1000] namely N=81 and N=625 as shown the figure). This problem can be
alleviated by the following techniques:

1. Mixing Polynomial Orders: Different polynomial orders for various indeterminates can result in more closely spaced choices for N.
2. Using Additional Mapping: Polynomial mapping can be embedded in other
number systems such as the residue number system to increase the total
computational dynamic range without increasing the number of indeterminates while keeping the size of the individual computational rings small.
Note that mixing polynomial orders renders the mapping stages asymmetrical and the
mapping matrix T will be different for different stages. The subject of the next half of this
chapter indeed deals with the second solution; namely, using the PM within the RNS to
provide more flexibility in the design parameters of such DSP processing architectures.

Small Moduli Polynomial Ring Engine

57

Polynomial Ring Engine Implementation

3.3

University of Windsor

Polynomial Ring Engine Implementation

The PRE mapping strategy allows computations on Gaussian integers to be implemented
in direct product rings with simple mapping procedures between signed digit binary
representations and a direct product of identical modulus rings. The theoretical mapping
procedure requires several intermediate rings to be defined, and these are described in this
section.

3.3.1 Mapping Order
The order in which the polynomial representation, polynomial and residue mappings
occur gives rise to three possibilities to arrange PRE mappings. Figure 3.9 shows these
mappings. Note that the polynomial representation mapping must precede the polynomial
mapping.

Mapping I performs the binary to residue modulo reduction operation first for the set of
moduli under consideration. Out of each residue digit, a polynomial representation is
constructed and the number of polynomial coefficients will depend on the computational
requirement of the architecture. The third mapping is the polynomial mapping. After the
computational algorithms (DSP algorithms), the inverse mappings are performed in the
reverse order to obtain the result in binary representation. Mapping II performs the
polynomial representation mapping first and then reduces the polynomial coefficients into
residue digits. In mapping III, polynomial representation mapping and then polynomial
mapping are performed prior to the residue mapping.

The following argument assumes all the mappings in Figure 3. 9 yield the same number of
DSP channels and output computational dynamic range with the same number of cascaded
multiplication in the DSP algorithm. Mapping II results in the smallest silicon area due to
the following four reasons:

Small Moduli Polynomial Ring Engine

58

Polynomial Ring Engine Implementation

University of Windsor

Figure 3.9 PRE mappings possibilities

Binary Input Data Stream
Polynomial
Representation Mapping
Polynomial
Representation Mapping

Polynomial Mapping

Polynomial
Representation Mapping

Polynomial Mapping

Polynomial Mapping

Inverse Polynomial
Mapping
Inverse Polynomial
Representation Mapping

Mapping

Inverse Polynomial
Mapping
Inverse Polynomial
Representation Mapping

Inverse Polynomial
Representation Mapping

Binary Output Data Stream
Mapping I

Mapping II

Mapping III

(a) Binary data bits are decomposed (grouped) into polynomial coefficients first, thus
increasing the computational dynamic range without resorting to a larger moduli
set (as in mapping I) and also rendering the residue mapping simple or trivial (trivial in the case where each coefficient is less than the smallest modulus).

Small Moduli Polynomial Ring Engine

59

Polynomial Ring Engine Implementation

University of Windsor

(b) Polynomial mapping involves evaluation of a set of roots and can not be made trivial, hence performing the residue mapping prior to the polynomial mapping will
result in the use of smaller modulo computational blocks that support VLSI implementation.
(c) Inverse polynomial representation mapping is performed last which facilitates the
use of fast binary adders rather than modulo adders (the truth tables of binary
adders are highly decomposable and result in smaller layouts).
(d) The ability to join the binary additions of mixed radix conversion in the inverse
residue mapping with the binary additions required for the inverse polynomial representation mapping. Considering all the factors above, it becomes clear that
mapping II yields the best architecture i'n terms of silicon area and implementation
complexity and hereafter will be referred to as PRE mappings.

3.3.2 Constructing PRE Architectures
The overall mapping and computation _architecture is shown in Figure 3.10 for a complex
input data stream. This figure shows the resulting DSP processing stage is indeed a
systolic array structure with only one dimensional interconnections. For this architecture,
it is assumed that the DSP computations can be implemented with linear pipelines; this is
certainly the case with inner product algorithms (e.g. [54]). A brief description of the
blocks, and their computational requirements, is given in Table 3.6.

60

Small Moduli Polynomial Ring Engine

•

Polynomial Ring Engine Implementation

University of Windsor

Figure 3.10 Overall PRE Architecture
Complex Binary Number Stream

+

Binary to Polynomial Representation

Polynomial Mapping

DO
DO

D
D

11

I

ISyst11fic Datic !F:'ocessing Sta~e Over S . clJ! fimi:te Togs
•••••••••

Complex Binary Number Stream

Small Moduli Polynomial Ring Engine

61

•,.!:........, _ __..........,... .., ....

..

., - r ... • *•

~

...

_,,.

-

.

-

.

.

Polynomial Ring Engine Implementation

University of Windsor

Table 3.6 Architectural requirements for forward and reverse mapping
..,

Description

Computing Blocks

Map weight of redundant binary digits to
coefficients of polynomial indeterminates

Simple interconnection

Modulo Reduction

Reduce coefficients by
{m)

No action since coefficients are chosen to be
less than {m)

Polynomial Mapping

Stage approach forward
mapping on roots

Pipelined 3-bit
weighted modulus
addition (6-bit input 3bit output blocks)

Block

Binary

~

Polynomial
'

{-1,0, l}m.
I

Data Processing

DSP algorithmic cornputation

Systolic array based on
general function 3-bit
modulo computations
(6-bit input 3-bit output
blocks)

Reverse Polynomial

Stage approach reverse
mappmg

Pipelined 3-bit
weighted modulus
addition

CRT

3-modulus CRT (simple mixed radix systern)

Pipelined 6-bit input 3bit output blocks

Mapping to weighted
binary (shifts) and the
complex operator (separation)

Pipelined shifted
binary adders and subtractors

Polynomial

~

Binary

Figure 3.11 shows an example PRE architecture which we first presented in [20]. The
simulation uses two indeterminates; one corresponds to a binary weight of 2 and the other
represents the complex operator j. The RNS representation is based on the residue set
{ 3, 5, 7} which keeps the data representation to 3 bits throughout the architecture. The
dotted line indicates the placement of the DSP algorithmic computation (not included in
this simulation) and the boxes contain the two modular structures for the forward and
reverse mapping portions used in the stage approach implementation.

Small Moduli Polynomial Ring Engine

62

Polynomial Ring Engine Implementation

University of Windsor

Figure 3.11 PRE Example Implementation

Input Blocks

Pipeline Latches
CRT Block

Weighted modulo adders

The input blocks are used to generate the real and imaginary binary input data for the
simulation. The 3-bit 3-modulus CRT behavioral model block can be implemented by
pipelined 6-bit input 3-bit output blocks using a simple mixed radix system [6]. The output
converters are represented with a behavioral model simulating fast binary adders [86].
After the CRT block, we use a stage of binary subtractors to separate the coefficients of
Small Moduli Polynomial Ring Engine

63

Polynomial Ring Engine Implementation

/

University of Windsor

= -1 from the real part polynomial coefficients. It is at this stage where we introduce

the sign bit for the real part of the complex number output.

There are a total of 9 replications of each of the 3 residue rings (27 channels in total). Note
that each of the three moduli computational channels are completely independent until the
CRT stage is reached. The number of stages in both the forward and reverse mapping is
equal to the number of indeterminates (in this example two). The reverse mapping
illustrates the need for a full

2nd

order polynomial inversion whereas the forward mapping

is simplified because the input polynomial has order one. The basic mapping block
structure for both the forward and inverse maps are contained in the shaded areas on
Figure 3.11. The weighted modulo adder blocks will actually take no more hardware than
the binary adders since we implement the blocks as minimized look-up tables. The
simulator includes a weight option within each modulo adder block (not shown in
Figure 3 .11 ). The DSP computation is carried over each of the 27 channels with complete
independence between the channels. This offers unique opportunities for fault detection
tolerance and simple testing [53] and clocking strategies.

This example.architecture has an input dynamic range of 3 (for both.real and imaginary
parts of the Gaussian number). The maximum output dynamic range is 468 (from -234 to
+234) for the real part and 468 for the imaginary part with one cascaded multiplication
and 26 additions. We may increase the input dynamic range, output dynamic range, and/or
the block length of the inner product (number of additions) by allowing the occasional
overflow of the output polynomial coefficients and/or grouping the input data bits. In the
next section we discuss the potential and provide examples for optimization of such an
architecture.

3.3.3 Probability of Overflow and Computational Accuracy
Optimization and trade-offs of various design parameters can be carried out based on the
statistical distribution of the input data streams to the output polynomial coefficients. The
following list contains typical design parameters for the PRE architectures:
Small Moduli Polynomial Ring Engine

64

Polynomial Ring Engine Implementation

1.
2.
3.
4.
5.
6.

University of Windsor

Product of moduli
Input data distribution
Input bit representation
Dynamic range of the input data
Block length
Probability of overflow (POF)

With simple manipulations, it is found that the output dynamic range of the above
architecture is increased to 490 by grouping 2 bits to the input polynomial coefficients x 1
and x 1x 2 (where x 1=2 and x 2=j). The input dynamic range also increases to 7 for both real
and imaginary parts. However the block length is decreased to 5 with zero probability of
overflow.

If we allow an occasional overflow, Figure 3.12 shows the block length versus the POF for

the original problem (real and imaginary input dynamic range of 3).

Figure 3.12 POF versus block length
l.80E-04
1.60E-04
l.40E-04
l.20E-04
l .OOE-04
8.00E-05
6.00E-05
4.00E-05
2.00E-05
O.OOE+OO ._.....~ei::==11~+----1--4-----ll---4----I
70 80 90 100 110 120 130 140 150

It is important to mention that for a block length of 26 the POF is zero where for a block

length of 27 the POF is less than 4.4E-65 and the output polynomial coefficient that
overflows corresponds to x 1x2 . At a block length of 53 other coefficients will start to
overflow. This means that up to a block length of 52 the imaginary part will have a
maximum error of 44.4% of the maximum imaginary value possible. If we scale the
output, for instance, to half the number of bits, the overflow error will vanish. It is clear

Small Moduli Polynomial Ring Engine

65

/

Summary

University of Windsor

that building a system with POF=O is extremely pessimistic; however, the target
application might restrict our choices.

Prior work by others has focused on this topic and resulted in a software package
MODULUS [87]. The software allows fast and efficient techniques for considering various

PRE design parameters for trade-offs based on a comprehensive statistical analysis of the
input data.

3.4

Summary

In this chapter we have introduced different implementations of the polynomial mapping
for the Polynomial Ring Engine. Comparing architectures based on standard cell count
shows that a novel staged approach is the most efficient implementation both in terms of
silicon area and design complexity. Systematic procedures for constructing the
architectures suitable for VLSI design tools have been introduced to generate complete
polynomial mapping layouts with little on-line engineering assistance. Design parameter
trade-offs such as the number of indeterminates, polynomial orders, probability of
overflow, and computational block length are discussed. Finally

a complete

implementation of the Polynomial Ring Engine is presented and simulated The
architecture uses the best possible arrangement to reduces silicon area and ease the design
process.

Small Moduli Polynomial Ring Engine

66

Chapter4
Transistor Level
PRE Synthesis

4.1

Introduction

All of the architectures discussed so far in this dissertation, require
circuit synthesis tools for generating the basic building blocks. The
majority of the existing tools utilize standard cell libraries that
contain a variety of logic gates (e.g. NAND and NOR gates) based
on static logic circuit techniques. To overcome speed restrictions
and fan-out requirements, standard libraries often contain various
sizes of each cell. While existing tools can be used to implement
finite arithmetic architectures, alternatives are available. In
particular we consider dynamic logic as a natural implementation
of high throughput rate pipelined DSP computations and we
specifically examine a recent alternative approach to logic gates by
using direct implementations of minimized truth tables.

The Switching Tree circuit synthesis approach [26] is based on fast
CMOS dynamic logic and offers a suitable alternative to generating
the necessary functional blocks for constructing the PRE
architectures. The synthesizer starts with full binary graph
transistor implementation of the truth table and then applies
reduction techniques to minimize the transistor count and maximize
logic height (number of serially connected transistors). The
Introduction

Transistor Level PRE Synthesis

•

67

Analytical Approach to nFET Chain Sizing

University of Windsor

minimized trees are then embedded in dynamic True Single-Phase Clocked (TSPC)
latches [78]. Extensive research [25] has been conducted to efficiently realize logic
functions with large truth table entries. Using current mode latches, it is possible to show
that very high trees (height> 10) can be successfully pipelined [83]. The throughput rate
of a switching tree is dictated by the logic height, transistor size profiles, tree topology,
process parameters, etc. Transistor sizing is instrumental in increasing the tree height
while maintaining required throughput rate, and results in the ability to meet both large
computational dynamic range and high throughput rate at the same time.

This chapter focuses on the development of transistor sizing techniques targeted to size
CMOS dynamic logic for application to PRE architectures. The technique relies on a
simple delay model and novel analytical formulation [55] to size single nFET chains. We
introduce new iterative techniques to size the more complex nFET trees that are required
in the PRE architecture. This chapter also develops circuit modules suitable for PRE
architecture construction using a standard bit-sliced modulo computation approach.

4.2

Analytical Approach to nFET Chain Sizing

This section presents a simple mathematical analysis to enforce previously found
experimental results for the analytical sizing of single nFET chains [55]. We first consider
the established sizing formulae with the detailed derivation and definitions of all variables
given in Appendix B, for completeness. Then, a new mathematical justification of the
results is presented.

4.2.1 Single nFET Chain Sizing
For N+ l serially connected nFETs forming a dynamic logic chain, similar to the circuit
shown in Figure 4.1, the individual transistor sizes can be computed using Eqn. (4.1) and
(4.2) [55]. The implicit assumption (based on many observations) is that the Elmore delay
terms are approximately equal at the minimum discharge delay.

Transistor Level PRE Synthesis

68

Analytical Approach to nFET Chain Sizing

University of Windsor

(4.1)

w.

(4.2)

l

Where SPICE simulations and/or MOSFET transistor equations can be used to generate
parameters PURS, PURL, K 1, K 2 , K 3, K 4, and K 5 (see Appendix B).

4.2.2 Validity of the Analytical Approach Assumption
The objective of this section is to show that the delay terms given by Elmore 's formula are
approximately equal at optimum transistor sizes for typical circuits. The mathematical
analysis can be simplified greatly by considering a simple circuit where important
elements and mechanisms of the sizing problem are isolated. Let us start with a discharge
path consisting of two nFETs, as shown in Figure 4.1.

Figure 4.1 Simple nFET Chain
VDD
E

<j>-g
E

et•

IN~

R1

c/•

~

Ro

Co~

<t>

GND

GND

The discharge delay of the evaluation node, E, has been given by Elmore [60] and can be
expressed as a sum of individual delay terms

N

N

N

IRiICj= ITD;
i=O

j=i

(4.3)

i=O

Expanding the above equation for two terms only we get
Transistor Level PRE Synthesis

69

Analytical Approach to nFET Chain Sizing

University of Windsor

T DO+ T DI

(4.4)

Ro(Co +cl+ CL)+ R1 (Cl+ CL)
All definitions for the capacitances and resistances are consistent with those given in
Appendix B. Let us make the following further simplification

Ri
Cos.
I

KR/Wi
(4.5)

KcWi

K = KcKR

Where

CDS .

is the drain/source capacitance contribution. Then the node capacitances are

I

given by

Co= KcWo+KcW1
(4.6)

C1 = KcW1

CL=KcWP

Where W p is the size of the precharge pFET. Substituting Eqn. ( 4.6) and Eqn. (4.5) into
Eqn. (4.4) we obtain

2KW 1 KWp KWp
2K+--+--+-Wo
Wo
WI

(4.7)

For constant WO the optimum value for W 1 can be obtained as follows

2 Wp
-+Wo

Then

WI =

W2

I

r~Wo

(4.8)

Substituting the optimum value of W 1 into the delay terms, we get

Transistor Level PRE Synthesis

70

Analytical Approach to nFET Chain Sizing

University of Windsor

(4.9)

The experimental evidence suggests that T DO ::::: TD 1 for the s1zmg distribution that
minimizes delay. If we set the equality T DO

TD 1

at the optimum sizing distribution,

then the percentage error can be defined as

E
Hx 100

(4.10)

Eqn. (4.10) is plotted for Wp=l.2µ and W0 from 1.2µ to 12µ in Figure 4.1.

Figure 4.1 Percentage Error Verses W0

45.0
40.0
35.0
30.0
c.

w

25.0
20.0
15.0
10.0
5.0
0.0
C\J

sq-

<O

co

0

C\J

sq-

<O

co

0

T"""

C\J

C')

sq-

<O

I'-

co

O'>

0

C\J

T"""

T"""

WO

Transistor Level PRE Synthesis

[µ]

71

Complex nFET Logic Sizing

University of Windsor

The following observations can be made:

I. Ep decreases rapidly for smaller Wp!W0 ratios. This is expected
since transistor sizing will be more effective when the capacitance
contribution of the nFET is larger than that of the pre-charge pFET.
2. SPICE simulations of sized nFET chains show that sizing is more
effective as the number of nFETs in the discharge path increase (see
Figure 4.12 on page 83). This leads us to conclude that for more
nFETs, 'E p becomes negligible. This is a natural result since for
more nFETs in the chain, the capacitance of the internal nodes
become as significant as the capacitance on the evaluation node, E,
and sizing tends to reduce these capacitances while maintaining
sufficient driving capabilities.
The above analysis establishes that for typical circuit parameters and topologies (logic
heights) the percentage error resulted from unequal delay terms at the optimum sizing
profile is very small and hence the approximation made by making the delay terms equal
is valid under these conditions. While there is no guarantee that the sizing formulae will
produce global optimum sizing profile the results can be checked easily by circuit
simulation to raise the confidence level.

4.3

Complex nFET Logic Sizing

As the logic block complexity increases, logic height (fan-in) increases, and transistor
sizing becomes essential to reduce pull-down delay. Sizing, in some cases, is the critical
factor in selecting one circuit topology over the other. In [55] analytical formulae are
introduced to size single nFET chains. This section introduces a new hybrid approach in
which the analytical formulation is extended by the use of an iterative algorithm to size
more complex nFET logic structures. This section starts with discussing the electrical
characteristics of logic blocks built using Switching Tree techniques [25] in order to use
them as an example to generate sizing results.

Transistor Level PRE Synthesis

72

Complex nFET Logic Sizing

University of Windsor

4.3.1 Switching Tree nFET Logic Structure
Figure 4.2 shows a full switching tree structure with three inputs built around nFET
dynamic logic. The clock

<t>

is fed to the pre-charge pFET and the ground switch nFET at

the top and bottom of the tree respectively. Each internal node is connected downward to
two branches driven by an input signal and its complement. In the evaluation cycle only
one path to ground is formed, and therefore the analytical formulae can be used for sizing
the resulting nFET chain.

Figure 4.2 Binary switching tree nFET dynamic logic
VDD

GND

S1nce all paths in the full tree structure are identical in terms of node loading and height,
sizing a single path is sufficient for sizing the whole tree. While full switching trees yield
desirable electrical discharge characteristics, transistor count and silicon area become a
primary concern for typical logic heights and hence tree minimization is essential.
Extensive research has resulted in the development of rules that reduce transistor count
and collapse tree height where, in some cases, the height is decreased from 17 to 10 [26].
After minimization, the tree becomes wide in the middle and narrow near the top and
bottom and no longer forms identical paths during the evaluation cycle. Sizing a path in
the tree affects other paths sharing some internal node(s) and therefore some type of
iterative technique must be used to ensure all paths conform to the delay requirement.

Expatiating on synthesizing and minimizing logic blocks based on switching trees is
beyond the scope of this dissertation. A general and representative switching tree path is
Transistor Level PRE Synthesis

73

Complex nFET Logic Sizing

University of Windsor

considered to generate results that will allow a summary of the delay/area characteristics
of general dynamic logic blocks. Figure 4.3 shows a typical switching tree path suitable
for the sizing software. Each internal node to the path is loaded by parallel nFET
connections.

Figure 4.3 General switching tree path
VDD

GND

The loading nFETs are connected to ground through an appropriate number of serially
connected nFETs such that when sized it will be of the same size as the main path nFET.
This -path structure is an approximation that enables us to perform optimization and to
extract results that will be used to optimize the overall architecture of the Polynomial Ring
Engine presented in Chapter 3. However, more accurate delay and silicon area results can
be obtained by the use of actual synthesized circuits.

4.3.2 Sizing Algorithm
The flow chart in Figure 4.4 shows the basic structure of the sizing algorithm. It starts with
transistor size setup based on load requirement and initial conditions. All possible
discharge paths are identified and their associate pull-down delays are computed to
facilitate the selection of the worst case discharge path. Then the selected path delay is
tested against the desired delay and the path is sized if its delay does not satisfy the delay
requirement. The process is repeated until all paths conform to the delay requirement.

74

Transistor Level PRE Synthesis

•

Complex nFET Logic Sizing

University of Windsor

A discussion of the convergence properties of the new algorithm 1s presented m
"Convergence Analysis" on page 80.

Figure 4.4 Sizing Algorithm Flow Chart

Choose desired delay, select PFET and load sizes
Set all NFETs to minimum sizes

Compute delays for all paths

Are all delays less than the desired delay?

Select the worst path and size using the
analytical formula

The following is a detailed description of the new sizing algorithm:

1. Prompt the user to enter all values required for the sizing software
such as desired delay, load transistor sizes, and fabrication process
parameters.

2. Prompt the user to enter the circuit description of the pull-down
nFET logic.
3. Preprocess some input data to evaluate intermediate parameters
used by the analytical formulae such as capacitance factors, capacitance constants, PURS, and PURL.
4. Set all nFETs available for sizing to minimum size.
5. Search and list all possible paths for the given nFET logic.
6. Use the simple Elmore's delay formula to compute the discharge
path delay for all possible listed paths.
7. Select the largest path delay and test against the desired one.
8. Terminate the program if the selected path delay is less than the
desired delay and output the sizing results.
Transistor Level PRE Synthesis

75

Complex nFET Logic Sizing

University of Windsor

9. Size the nFET chain associated with the path delay in question if
the path delay is larger than the desired value. Use the set of analytical formulae given in Eqn. (4.1) and (4.2).
10. Repeat steps 6 to 9 until all discharge paths satisfy the delay
requirement or no improvement can be achieved.
The above algorithm can be used to size not only Switching Tree dynamic logic but also
any dynamic logic such as Domino and Nora. In fact the first version of the coded
software has been used to size EMODL (Enhanced Multiple Output Domino Logic) for
implementing a 32-bit binary adder [86]. For feed-forward architectures, such as the CLA
(Carry Look Ahead Adder), the loading affect can be removed if transistor sizing
originates from the output stage blocks and progresses to the input stage blocks. For sizing
multiple output domino logic, and during the search for the worst discharge path, the top
evaluation node should be considered first and then the lower nodes. If all discharge paths
of the lower evaluation node are connected to the upper one, then the lower evaluation
node may be ignored during the search process. This is possible due to the fact that, for ·
optimally sized nFET chains, the potential profile of the nodes increases monotonically
with node height during the discharge process [69].

4.3.3 Sizing Software
To add flexibility and ease of use, the code has been divided into two parts:· technology and
topology components. Figure 4.5 shows the two parts as they appear in the Extend™
simulation software, listed in Appendix C.

Figure 4.5 Sizing Software Components

Transistor Level PRE Synthesis

76

Complex nFET Logic Sizing

University of Windsor

The technology component allows the user to enter all related parameters for the
fabrication process under consideration. It also computes intermediate parameters needed
in the sizing formulae. The user is also allowed to override computed parameters in order
for comparison of different fabrication processes. The code also computes parameters for
the pFET device to facilitate future implementation of pFET sizing. Figure 4.6 shows a
screen shot of the front interface of this component.

Figure 4.6 Technology Component Interface

[Compute) (clear) (Cancel) I(

OK

n11.2u Process

Rov Ya lue(NFET) 1
Status
l Ya lue(PFET) l
Status
Name
Units
lfa@II
0
1 .2e-06
l
l
1 .2e-06
l
l
lmin
l
meter
~
··············· ····································-:-····································?····································?····································?····································?···································· ·,~. .r~;~.
1
7e-08
=
:
7e-08
:
:
ld
:
meter
i~~@
2 ........... ......... 5 .0000..........
5 .0000..........
vg_s: ..............
volt ..............
3
0 .0000
l
l
O.0000
;
.
l
VbS:
l
VO lt
[{~%\
....•..•.••.••. ····································->····································->····································,O,··································••,O,•···································,O,···································· .:.·.:4
0.7900
1
l
0.8400
l
l
vto
l
volt
i%W
5 ............... 3 .451 5e-1.1 .....
3 .451 5e-1.1 .....
eppsox ..........
eppsox .......... 11~~!:l
6
2 .5e-08
1
l
2 .5e-08
l
l
tox
i
meter
!]@}
7 ................0.001 3806 .....
computed ....... 0.001 3806 .....
computed .......
cox .............. far ad /meter2.. jgg~l:i
8
577 .00
l
l
205.00
l
l
u
l
u
\]gilI

L. . . . . . . . . . . . . . . . . I.........

I....................................I..............

I.............

L. . . . . . . . . . . . . . . . . I...

I....................................I........·.
I.......
I..............

I. . . . .

I.......

I....

l f~;i~

0 ::::::!::::::: COffiP.Uted

l:::::::: •::::::1

I..

~~r~f

:::::::!::::: 3

~~~fj:j

O::::::!::::::: COffiP.Uted :::::::!:::::: C~:~:SO

::::::!::

fg~~;~: i

14
1 .2e-06
1
l
1 .2e-06
1
l
wmin
l
meter
!1gt\J
··············· ····································<>····································~···································-~····································.O,·•········-·························~·-·································· );,·:;~;.-.
15
16

•••••••••••••••

O.0000
2 .4e-06

1
l

l
l

O.0000
2 .4e-06

1
l

1
1

ads.a
bdsa

l
l

meter2
meter

im11

1 .5000

1

l

1 .5000

l

l

rr atio

i

-

11i!~~fi

••••••••••••••••••••••••••••••••••••,O,••••••••••••••••••••••••••••••••••••,O,••••••••••••••••••••••••••••••••••••,O,••••••••••••••••••••••••••••••••••••,O,••••••••••••••••••••••••oo••••••••••,O,••••••••••••••••••••••••••••••••••••

: !:::::::: :::::::: ~ ·~~gi::::::::i::::::::::::::::::::::::::::::::::i:::::: ~ ~~gi:::::::::l::::::::::::::::::::::::::::::::::::L::::::::::~~:~::::::::::::L:::::1:7~:~:r : : : :·
19

!!:::::::: : :
26

:;.~~\~~;: 1: : :
2 4e-06

~:................. :~ :ggg_. . . .
He IP I1 .2u Process

! ~l~l~!: : : I: : ~:~~2-!~~ ! ~l~l~l
::::!:::::::

:::::J:::::::::::~~~:::::::::::::I:::

lj@j
, >,,,.,

:'.~~!~~~~:: : i

J. . . . . . . . . . . . . . . . . . f. . . . . !g~0gg. . . . . f. . . . . . . . . . . . . . . . . . f.. . . . . . .r:: . . . . . . . f.. . ohm:~~uare . . 1
l

I!ill~

l

2 .4e-06

l

i

dbc

l

meter

11§~/:l

::<:';*:~:-ar+::f:/:-C~'-'~·:~::-~::::·;-~·\;:,':-:-><:f''~>:-·2~:'/-ffi~(::,:.:,><:>::;:-,:.:.><:',·;:r-~s:::.::::>=>nt&mlffi Ii

Transistor Level PRE Synthesis

77

Complex nFET Logic Sizing

University of Windsor

The topology component dialog box shown in Figure 4.7, Figure 4.8, and Figure 4.9
allows the user to describe the circuit interconnection by filling in a topology matrix. The
user identifies the output node(s) and its (their) loadings in terms of device widths in the
output node table. The loads may be the precharge pFET, and any static inverter FETs, in
the case of Domino logic. The software lists all possible paths from the evaluation node to
ground and their computed RC delays. Finally the software outputs a size matrix that
contains all of the computed sizes corresponding to the topology matrix.

Figure 4.7 Topology Component Interface (Circuit Description)

( Size )( Delay )(clear al~ (cancel~

)J Full Tree Path Sizing

OK

Jrn1e

1===11

~

Desired pull-down delay [nanol=J

Factor

J

Topology MatriH

U

I0.01 I _

Percentage Error

20

Order [row] [col]

j[.--!A=d=ju=s=t=fa=c=t=o::::-r] - - - - - - - ~

@=]EJ

Row
O
1
2
3
4
5
0 ........... ................w ................ l ................ b.................l ................ b.................l ................ b.................l ................ b.................l ................ b.................

J..........................rif~.t............. l ..............nf~.t ............. l .................~ .................l .................~ .................l .................~ .................l .................~.................

2
3

b

l

w

l

b

l

b

l

b

l

b

+·
f\$~
~i!~~g

nfet
T
nfet
T
nfet
T
b
T
b
T
b
4 ::::::::::: ::::::::::::: b :::::::::::::r::::::::::::: b :::::::::::::r::::::::::::: w :::::::::::::r:::::::::::::::: b:::::::::::::::::r:::::::::::::::: b:::::::. ::::::::r:::::::::::::::: b:::::::::::::::::
5
nfet
l
nfet
l
nfet
l
nfet
l
b
l
b
~m~~

r. . . . . . . nf. bet.................r. . . . . . . nfet
. b.................J................nfet
w ................r. . . . . . . . b.................J................ b................. ! !-~!~
nfet
b
~H=f

6 ··········· ................ b.................
7
nfet
:
s
b
T
9
nfet
T

1o::::::::
11

:::::.::.::.:

=

=

nfet

T
T

.:::::::::::::r:::::::::::::

b

l

nfet

b

nfet

b

:::::::::::::r:::::::::::::

b

nfet

l

nfet

Transistor Level PRE Synthesis

T
T

b

=

=

nfet

T
T

nfet

T
T

:::::::::::::r:::::::::::::

b

:::::::::::::r:::::::::::::

b

:::::::::::::r:::::::::::::

l

nfet

l

nfet

l

b

w

b

~i/1~[

nfet

~(i(i~

b ::::::::::::: ~j\j~~

nfet

sln~t

78

Complex nFET Logic Sizing

University of Windsor

Figure 4.8 Topology Component Interface (Output Node Definition)

Output Node Table
Rov
0

Rov Index

Clear ONT

1 Co 1 Index

oi

;y Prechar e l

Y NFET

1f

Y PFET

oi 1 .2000000e-05l 1 .2000000e-061 1 .2000000e-06

::::::::::::! :::::::::::::::::::::::::::::::::::1::::::::::::::::::::::::::::::::::1:::::::::::::::::::::::::::::::::L:::::::::::::::::::::::::::::::i:::::::::::::::::::::::::::::::::Jj!i

!

::::::::::}:::::::::::::::::::::::::::::::::::L:::::::::::::::::::::::::::::::::L::::::::::::::::::::::::::::::::1::::::::::::::::::::::::::::::::::1:::::::::::::::::::::::::::::::::::

Figure 4.9 Topology Component Interface (Path listing and their delays)

Path Table
Row

Out ut Node l

Path No.

l Tota 1 NFET s l

3

............ 9. ..................................Ql..................................Ql ..................................~-1 ..................................t
1
oi
1j
61
1
............ 2
3

............ 4
5

•N•~

o!................................. 4!................................. 6I. . . . . . . . . . . . . . . . . 1 ............

::::::::::::::'.::::{:::D:::::'.E::::~0:E::~::~:::::::~:{:K:J:£::~:::::::::::::::::::~:::;:::~:::::;:::;~:::::::::;::::::~
:~ • ~

De la

............ 9. .. .1.:.Q.Q.Q.Q.Q.].1.1.t.~

1
................................. 0J................................. 2J ................................. 6J ..................................1 1;1111 ............ 2
0:
3:
6:
1 ~:,:,,~
3
.................................
~11~~!~
4
oi
51
61
1 l!I1~s
5

::::::::: ::~ :::::::::::::::::::::::::::::::::::L::::::::::::::::::::::::::::::J::::::::::::::::::::::::::::: ::J:::::::::::: :::::::::::::::::::::
f.~ ~

Row

!

1 .0000644562 ,:,~,::
.. 1.. 0000349538 !!!!!l
1 .0000288561 :,~;,;

1!~~~~1

..................................

!

1 .000021 5724 11~~~1

::::::::::::i ::::::::::::::::::::::::::::::::::::

<;,1

4.3.4 Convergence
Although the iterative algorithm is quite simple m concept, it does not guarantee
convergence.

In this section we perform a basic mathematical analysis to show that convergence,
although not guaranteed, is highly probable for the algorithm.

Transistor Level PRE Synthesis

79

Complex nFET Logic Sizing

University of Windsor

Figure 4.10 Topology Component Interface (Computed Sizes)

Size MatriH
0
1
2
3
4
5
Row
............ 0 ....................................l. ...................................l .................................... l ....................................l ....................................l .................................... ~ 1 3.8078763e-0613.8079247e-061
.
1
1
1
· ·
••••••••••••••• ••••••••••••••••••••••••••••• • •••••• ,o, •••••••••••••••• ••••••••••••••••••••
,o, •••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••
,0, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,0,,, ••••••••••••••••••••••••••••••••••

~

............ 2 .................................... l ....................................l ....................................l ....................................l ....................................l .................................... •'.,;.,•
3 3 .3825984e-0613 .8789663e-06l 3 .87941 20e-061
1
1
~[112@

I....................................I.................................... I....................................I....................................I....................................

::: ......... 4 ....................................
~~tt1~
5 4 .2709735e-0614 .872511 9e-0615 .5460453e-0615 .5464351 e-061
1
~~1;g
::: ......... 6 ....................................
!t!!1i
.......... ..7 5 .34 75775e-06l, 6 .0765699e-06l,6 .89281.1 6e-06l 7 .8054851 e-06l, 7 .8060323e-06l.................................... *~~jie
............ 8 ....................................l....................................l....................................1....................................1....................................l.................................... ~t!:l·l~
9 6.6522925e-0617.5357438e-06l8.5249308e-0619.6309816e-06l 1 .0868843e-0511 .0868068e-05 ~i}l\~

r. . . . . . . . . . . . . . . . . . r. . . . . . . . . . . . . . . . . . r. . . . . . . . . . . . . . . . . . r. . . . . . . . . . . . . . . . . . r. . . . . . . . . . . . . . . . . .

I. . . . . . . . . . . . . . . . . . I. . . . . . . . . . . . . . . . . . I. . . . . . . . . . . . . . . . . . I. . . . . . . . . . . . . . . . . . I. . . . . . . . . . . . . . . . . . ~~1=1:l~

.: ....... 1.0 ....................................

11 8 .2334505e-0619 .3040876e-06l 1 .0502863e-0511 .1 843264e-0511 .3343404e-0511 .3342465e-05 ~Hl:1@

j):;:;Gc;;~:;:;~;.;:;;:,::1;u:,i:;:/::::)//.:r:c:,;:;:z;·;'.;:;-.+,,:;:;vu:;;:;:L\:-;:/:·:,;;;:;'.:;>:;:;::~:,t::-;·;\c:-:·,.'.:;.;/:,;:~:rE:;:;t:~:;::;:: t:.~
'.',t:; -~'./Jf:,:~.• ·~
7
/;\\

Total Gate Width

I 0.0001935721

Number of Sized Transistors

I 26

I

Auerage Transistor Width

17 .44506e-o6

I

~

Convergence Analysis
In this analysis, the relationship between the capacitance/resistance contribution and the
nFET width is simplified as follows

(4.11)

Let us consider a typical circuit section as given in Figure 4.11.

Transistor Level PRE Synthesis

80

Complex nFET Logic Sizing

University of Windsor

Figure 4.11 A typical circuit section

-j

There are two paths to ground, namely PL and PR. The purpose of this analysis is to show
that for any given path, the delay sensitivity due to non-path nFETs is less than that of the
path nFETs. By symmetry only the left path is considered and for the given circuit the
above condition can be written as shown in Eqn. (4.12).

arDL
arDL
_
_ < _
_
aw1R

(4.12)

aw1L

The pull-down delay for the left path is given by Elmore's formula

(4.13)
The partial derivatives are given by

arDL
aw1R arDL
KRKCWlR
aw1L = WfL

KRKc
--

(4.14)

WlL

2KRKcW 2 KRCL

---

WfL

(4.15)

WfL

It is clear from Eqn. (4.14) and (4.15) that the condition given by Eqn. (4.12) is satisfied
for W 1R ::::: W 1 L which is the case when optimum n FET sizes are achieved by symmetry.
In order to examine the algorithm convergence property for a typical silicon technology
we use values from the technology component, shown in Figure 4.6, for our target 1.2µ

Transistor Level PRE Synthesis

81

•

Complex n.FET Logic Sizing

University of Windsor

process. Assuming that W 1R= W 1L = 2 W 2

1.2 µ and CL = Kc x 12 µ, we obtain

Eqn. (4.16).

dTDL

aw1R -

6.8xl0-

7

(4.16)
dTDL

aw 1L

7

7

- 6.8x10- - 6.8x10- - 6.5x10-

6

-

Eqn. (4.16) shows that for typical nFET sizes, the absolute value of

ai:L

dT

is actually one

lL

order of magnitude larger than the magnitude of

dTDL

aw

and hence convergence 1s

lR

guaranteed for such typical circuits.

Recommendations for convergence
Even though we have not faced non-convergence problems with the sizing software so far,
the possibility exists. The following list contains some basic techniques that can be
followed or added to· the sizing software to further reduced the remote chance of nonconvergence.

1. Using a percentage error parameter: This parameter allows some
delay tolerance among the paths to speed up convergence. Note that
zero percentage error may cause the software to iterate endlessly
through the individual paths due to round-off error.

2. Ramping delay requirement: The delay requirement can be reached
in steps by starting with a large delay and then progressively reducing the delay after the software converges at each step. This technique can be incorporated into the software code.
3. Partitioning: The circuit may be divided into smaller sul>-circuits
and sized separately. Care should be taken in placing capacitance
loading at each sul>-circuit boundaries to accurately represent the
loading.

Transistor Level PRE Synthesis

82

Sizing Results and Discussions

4.4

University of Windsor

Sizing Results and Discussions

A full tree path for various logic heights was simulated using IsSpice3, as shown in
Figure 4.12.

Figure 4.12 Full tree path pull-down delay
7

6

5
f;5'

......
=

- - f --W=l.2µ

4

;;;..,

----/r- W=2.4µ

.$
Qi

Q

=

3

- 0 - - - Average W=2.4µ

~
0

"9

=s
Q..

2

- - - • f - - -

W=l.4µ

- - - -ll.- - - W=2.8µ

- - - -O- - - Average W=2.8µ
0

Logic Height

The solid curves are for the CMOS4 1.2µ process offered by Nortel [96] while the dotted
curves are for the BATMOS 0.8µ process also offered by Nortel [95]. The circuit used for
the simulation and the sizing software is given in Figure 4.3 where the evaluation node is
connected to a precharge pFET of 10 times the minimum width given by the process and a
minimum size balanced inverter. There are three curves for each process. The first curve
(marked +) shows the pull-down delay verses logic height of the full tree path for
minimum width nFETs. Note that the actual number of serially connected nFETs is
greater than the logic height by one. The second curve (marked ~) shows delay versus
height for nFETs of double the minimum width. Using the sizing software described in
Section 4.3.3, the nFETs are sized while keeping the average nFET width at double the

Transistor Level PRE Synthesis

83

Sizing Results and Discussions

University of Windsor

minimum width. The third curve (marked 0) shows the delay verses height of the sized
nFETs.

Examining the curves reveals the following results:
1. The delay increases almost linearly with logic height for the
unsized tree paths due to increased loading on the internal nodes.
2. The sized tree paths delays increase monotonically with logic
height but exhibit smaller slopes than those of the unsized ones.
~

3. Due to differences in device driving capabilities, the sized 1.2µ process tree paths provided less delays than those of the 0.8µ process.
Table 4.1 and Table 4.2 give the widths of the nFETs obtained by the sizing software for
both the 0.8µ and the 1.2µ processes respectively. Note that the nFET widths given in µM
and the last width of each column is for the ground switch.

Table 4.1 Full Tree Path Sizes, WAVERAGE=2.8µ, 0.8µ Process

1

1.82

1.68

1.55

1.45

1.40

1.40

1.40

1.40

1.40

1.40

1.40

2

1.61

1.46

1.40

1.40

1.40

1.40

1.40

1.40

1.40

1.40

1.40

3

2.03

1.83

1.67

1.54

1.43

1.40

1.40

1.40

1.40

1.40

1.40

4

2.55

2.26

2.05

1.88

1.73

1.62

1.56

1.42

1.40

1.40

1.40

5

3.16

2.78

2.49

2.26

2.07

1.92

1.83

1.67

1.57

1.49

1.41

6

3.89

3.39

3.01

2.71

2.46

2.27

2.12

1.94

1.82

1.71

1.62

7

4.44

4.11

3.60

3.22

2.90

2.65

2.46

2.24

2.09

1.96

1.84

4.64

4.25

3.81

3.41

3.10

2.83

2.58

2.40

2.23

2.09

4.77

4.47

3.99

3.60

3.25

2.96

2.73

2.54

2.37

4.98

4.64

4.18

3.72

3.39

3.11

2.87

2.67

5.12

4.81

4.26

3.87

3.53

3.24

3.00

5.28

4.83

4.41

4.00

3.66

3.36

5.26

4.99

4.53

4.11

3.77

5.42

5.09

4.62

4.22

5.51

5.16

4.71

5.56

5.23

8
9
10
11

12
13
14
15
16

5.62

17
Transistor Level PRE Synthesis

84

Circuit Modules

University of Windsor

Table 4.2 Full Tree Path Sizes, WAVERAGE=2.4µ, 1.2µ Process
:-·-

NFET
_,Index

:-:-

..

-:=:

6 .?

·7

\

/.(}( i:::

~gic:H~hi~t-y .

;;:;:

.·;)\:( ..:.j::it:::: ::·

.:.:,:
-:

8

9

10

::_-}11 :,(

)?'..

12 . <13 _:;:·. _ •: ,14 _-·:

15 /

16

1

1.72

1.56

1.45

1.35

1.27

1.2

1.2

1.2

1.2

1.2

1.2

2

1.48

1.33

1.22

1.2

1.2

1.2

1.2

1.2

1.2

1.2

1.2

3

1.85

1.64

1.5

1.38

1.3

1.22

1.2

1.2

1.2

1.2

1.2

4

2.27

1.99

1.81

1.66

1.55

1.45

1.36

1.29

1.22

1.2

1.2

5

2.75

2.4

2.16

1.96

1.83

1.7

1.59

1.5

1.41

1.33

1.27

6

3.32

2.86

2.56

2.31

2.14

1.99

1.85

1.73

1.62

1.53

1.45

7

3.73

3.39

3.01

2.7

2.49

2.3

2.13

1.98

1.85

1.74

1.64

3.78

3.52

3.14

2.89

2.65

2.44

1.96

1.85

3.89

3.61

3.33

3.03

2.78

2.26
2.1
2.56 . 2.37

2.21

2.07

3.96

3.82

3.47

3.16

2.9

2.67

2.48

2.32

3.8

3.93

3.58

3.27

3

2.77

2.58

4.27

4.03

3.68

3.36

3.1

2.87

4.36

4.11

3.75

3.45

3.19

4.43

4.17

3.83

3.53

4.47

4.23

3.9

4.52

4.29

8
9
10
11

12
13
14
15
16

/

...

4.57

17

4.5

-·--·····

.-,,

. .<::,:

<--\_:It.:.?:·::

Circuit Modules

This section shows the construction of circuit modules suitable for building the PRE
architecture. The circuit techniques used in this section appeared in scattered literature and
here we attempt to consolidate them into a single systematic circuit synthesis method.

4.5.1 Circuit Structure
Most DSP algorithms expected to be implemented with our architectures can be computed
using fixed coefficient multiplication and accumulation. Mapping stages also require the

Transistor Level PRE Synthesis

85

Circuit Modules

University of Windsor

same type of operation and hence they share the same type of circuits. Let us start by
looking at a typical mapping matrix, T, and its inverse in the ring z7 .

1 1

11

0 l 6

T =

[

1
T- = 01 04 61
4

[

0 1 1

( 4.17)

0 3 4

The above matrix maps a second order polynomial with the roots -1, 0, 1 into the direct
product ring. Let us select the inverse mapping and more specifically the third coefficient
as an example. C3 can be obtained by the following equation

A (1) = OEBi 6 x C\)
2

A( ) =

ACI)EBi4xC' 2 )

(4.18)

AC 3 ) = A( 2 )E8 7 (4xC'J

Where the value of the accumulator A at the third cycle will be C3 and C' i is a mapped
polynomial coefficient (computational channel result). We further decompose the
c~mputation of the above equations by slicing the bits of t~e C'i binary representation;

A (3) = A (2)E87( 4 x C' 3)

= A (2 )E8 ( 4 X 20 X C' [O])EBi 4 X 2 1 X C' 3 [I 1)E8i 4 X 2 2 X C' 3 [ 2 ])
7

(4.19)

3

The above equation can be implemented using bit steered ROMs and is referred to as a
BIPSPm cell [54]. Table 4.3 provides contents for the three ROMs needed to evaluate
Eqn. (4.19).

Transistor Level PRE Synthesis

86

Circuit Modules

University of Windsor

Table 4.3 BIPSPm ROM Contents

ROM0

ROM 1

ROM 2

··"

Address

Decimal

Binary

Decimal

Binary

Decimal

Binary

0

0

OOO

0

OOO

0

OOO

1

4

100

1

001

2

010

2

1

001

2

010

4

100

3

5

101

3

011

6

110

4

2

010

4

100

1

001

5

6
3

110

5

101

3

011

011

6

110

5

101

6

A careful look at the above table reveals the following:
1. ROM 1 is redundant and can be replaced by latches since its outputs
are identical to its inputs (address).
2. The output bits of ROM 2 can be obtained by rotating the output bits
of ROM 0 and hence only one ROM needs to be synthesized.
Figure 4.13 shows a switching tree logic implementation of ROM 0 . x is the input and E0 is
the evaluation node corresponding to the output bit. For a given input the tree provides a
path to ground for the evaluation node only if the output bit should be driven to logic high,
"l ". Note that solid lines represent nFETs whose gates are driven by the true logic input
and the dashed lines represent nFETs whose gates are driven by the complement of the
logic input. The vertical gray lines simply represent shorts (wires). Switching tree logic
function synthesis is well documented in the literature where techniques for minimizing
the tree height and transistor count by re-ordering the input bits and merging sub-trees are
presented [25]. Figure 4.14 gives the circuit implementation of the switching tree logic
(only bit O of the output is shown). The ROM logic is embedded in an n-type true single
phase latch followed by p-type latch. We have incorporated the bit steering switches
needed for bit slice computations in the p-type latch to optimize the performance and
silicon area utilization since the output will no longer see the switch finite impedance in
series with the output capacitive load. We also suggest buffering the input address while
Transistor Level PRE Synthesis

87

Circuit Modules

University of Windsor

generating the complement to reduce loading affect for high throughput applications. True
single phase latches provide high performance circuit implementation for dynamic logic
and well documented in the literature [78]. Figure 4.15 gives the structure for the modulo
operation under consideration namely, A(

3

)

= A(

2

)

+ 4x C' 3 where

A( 3) is the output,

A <2) is the input address to the first ROM, x, and C' 3 is the ROM steering control bits, y.

Note that ROM 2 is replaced by latches and the control bits are latched and rotated in every
ROM evaluation step.

Figure 4.13 ROM 0 Switching Tree Graph
Evaluation Nodes

X

X

[I]

[O]
X

''

[2]

''

'

'

''

''

/. ?

''
''

'

'•I

''

?
I

• • • •
'

Reference Nodes (Ground)

Figure 4.14 Circuit for Generating output bit OofROM0
VDD

<I>

X

X

--g

[2~

[I~

X

[0~
<I>

---"1

GND

ROM with n-type latch

Bit steering switch
with p-type latch

88

Transistor Level PRE Synthesis

•

Circuit Modules

University of Windsor

Figure 4.15 Bit Slice Implementation of Fixed Multiplication and Accumulation

X
X

[O]
[l]

X

[2]

ROM
with latches
and switches

----------

ROM
with latches
and switches

[O]
y[l] - - - - - - - y [2]
, ......- - - - ~......... , ~ . - - - - - - ~ - '

y

4.5.2 ROM Sizing and Simulation
Switching tree based ROMs lend themselves to sizing since they form single evaluation
nodes connected to multiple discharge paths of nFETs that can be optimized by the
techniques presented earlier in this chapter. The circuit shown in Figure 4.14 has been
simulated with minimum nFET widths using 1.2µ process SPICE models. Figure 4.17
shows SPICE simulation plots of the evaluation node of the ROM logic Switching Tree for
both the unsized and sized tree for a clock period of 2ns. Note that our assumption is that
the clock ~ill be symmetrical. The throughput rate may be . increased by using
asymmetrical precharge and evaluate cycles of the clock [26], but symmetry is normally
required for safe and flexible clocking. By running many simulations we find that
performance starts to degrade for clocking periods less than 2 ns and ultimately fails
below 1 ns. This failure can be attributed to the slow evaluation cycle of the ROM logic

(Eo= 1.9V after 0.5nS) and also to a slow precharge cycle (E0=2.5V after 0.5nS). We have
used our sizing software to optimally determine the circuit nFET widths given in
Figure 4.16. Simulation shows that the performance has been increased by 22% over a
minimum size circuit (£0=0.8V and E0=3.5V for the evaluation cycle and precharge cycle
respectively measured after 0.5nS). Note that the pFET sizes for the p-type logic (steering
switches) and latch are estimated from the nFET sizes since the current software does not
handle pFET sizing.

Transistor Level PRE Synthesis

89

Circuit Modules

University of Windsor

Figure 4.16 Circuit nFET Sizes and Simulation Conditions
VDD

Figure 4.17 Transient Response of the ROM logic

6

volts

Vout (min)
Vout (Sized)
Clock

2.2

2.4

2.6

2.8

3.0

3.2

3.4

3.6

3.8

ns

Transistor Level PRE Synthesis

90

Summary

4.6

University of Windsor

Summary

We have presented transistor techniques suitable for defining circuit modules needed to
construct the PRE architectures. The chapter has presented a mathematical analysis that
supports previously found experimental results in which nFET sizing formulae are
derived. The formulae are used to develop sizing software which enables the optimization
of complex nFET chains of sizes such as those found in domino logic structures and
Switching Tree blocks. The software breaks down the complex nFET ~structure into
separate discharge paths which can be sized using the analytical formulae. The loading
effects are updated and the paths are resized using an iterative algorithm to satisfy the
design conditions. Convergence is shown to be highly probable for typical nFET
structures with available silicon technology. Various steps, however, are suggested to
increase the probability of convergence. Extensive results are provided for sized general
switching tree logic using 0.8µ and 1.2µ processes.

We have also shown an example circuit module suitable for building PRE architectures.
The module uses Switching Tree logic pipelined with true single phase clocked latches
and the input data is steered with CMOS switches to enable bit-slice modulo computation.
An example layout of a modulo 7 fixed multiplier with accumulate is shown using our
target 1.2µ CMOS process. We note that this example represents the- most complex
computational block used in the small modulus PRE architecture, discussed in Chapter 3.

Transistor Level PRE Synthesis

91

Chapter 5
Some Specific PRE
Architectures

This work presented in this chapter demonstrates some specific
implementations of the PRE architectures. A Wavelet Transform,
WT, algorithm is targeted for its stringent computational power
requirement and its repeated use of FIR filtering typical of special
DSP hardware solutions. The use of algebraic integers for a
Daubechies mother wavelet also stresses the importance of number
theoretic architectures for DSP implementation. The second part of
the chapter looks at new fault tolerant PRE architectures in which a
combination of architectural and circuit level fault detection is
employed.

5.1 PRE Architecture for a Wavelet
Transform
In this section, the wavelet transform for image compression
conforming to the proposed high definition TV standard
(throughput rate, resolution, etc.) is implemented using a PRE
architecture. Wavelet transform processing of images promises high
data compression with negligible deterioration when assessed by
the human visual system. Our new implementation is obtained
through the design methodology proposed in Figure 1.1, "Systolic
Array Design Cycle," on page 7 of Chapter 1. The design process
Some Specific PRE Architectures

PRE Architecture for a Wavelet Transform

92

PRE Architecture for a Wavelet Transform

University of Windsor

allows circuit performance parameters to be considered along with system level design
parameters and hence more efficient designs can be achieved.

5.1.1 Wavelet Transform
The continuous wavelet transform, CWT, is defined by Eqn. (5.1) and based on the concept
of scale and translation [91]. f(t) is the original signal and

under scale factor a and translation -c . The quantity

~
-vial

he:')

is the mother wavelet

keeps the energy of the scaled

mother wavelet equal to the energy of the original mother wavelet.

CWT1 =

1
!T:i

-vial

Jf(t)h (t- "C) dt

(5.1)

a

Figure 5 .1 shows the discrete wavelet transform implementation of two dimensional
signals [90].

Figure 5.1 Single Stage Forward WT for Images
Rows

Columns

-

-

G

G

-

H

-

-

H

G

-

H

A 2)·f

This implementation is adopted from the well established theory of sub-band coding
which relies on quadrature mirror filters for the signal separation [92]. For the Discrete
Wavelet Transform, DWT, the filter coefficients are replaced by mother wavelet
coefficients and the procedure is referred to as multi-resolution analysis and
Some Specific PRE Architectures

93

PRE Architecture for a Wavelet Transform

University of Windsor

reconstruction. The 2-D input signal Ad 21+ 1/ is first filtered with the quadrature mirror

-

-

filters G and H with respect to rows and down sampled by two. The resulting signals are
then filtered again with respect to columns. A multi-stage WT can be achieved by
repeating the whole process for the low resolution signal A

2

J.

It has been shown that a

three stage wavelet transform is sufficient for image analysis and compression [90]. Data
compression is obtained by the use of coding schemes such as run-length coding coupled
with thresholding of those parts of the data that are insignificant to human visual
perception system. Figure 5.2 shows the reconstruction process of the 2-D signal
1

given the detail signals D. 2if, D

2

2 if,

A\J+ f
1

3

D 2if, and the lower resolution signal A J.
2

Figure 5.2 Single Stage Inverse WT for Images

Rows

Columns

G

D

1

2if

H

The relationships among the various filters are given in Eqn. (5.2) and Eqn. (5.3) below,
given the filter H which contains the mother wavelet coefficients.

-

h(n) = h(-n)
g(n) = (-1)

Some Specific PRE Architectures

1-n

h(l - n)

(5.2)
(5.3)
94

PRE Architecture for a Wavelet Transform

University of Windsor

5.1.2 Considerations for HDTV
The following is a list of features offered by current proposals for HDTV specifications,
1.
2.
3.
4.

8 bits per pixel per primary color.
Three primary colors.
1024 by 768 pixels per frame.
30 frames per second.

The above numbers translate to an output throughput rate requirement of 70.77888 MHz
where each data sample consists of 8 bits. Implementing reasonable compression
algorithms with current general computing devices is not feasible at the current time.
Since the target application is for mass production, hardware/software implementation
using dedicated architectures on silicon is reasonably justified in terms of both
manufacturing and design cost.

5.1.3 PRE Mapping Parameters
In this section the above requirements of the WT for HDTV are translated into mapping
requirements for the PRE architecture. While there exist many ways for carrying out the
mapping and integer coding, only the most efficient mapping, in terms of area, speed, and
computational accuracy, is considered. ·

Number of Indeterminates
In order to encode 8 bits, while keeping one bit per coefficient and using only first order
polynomials, 3 indeterminates are needed for the input data stream. An input data sample
can be written as:

Using the Daubechies (N=4) filter [93], the coefficients are given by Eqn. (5.5).

95

Some Specific PRE Architectures

•

PRE Architecture for a Wavelet Transform

University of Windsor

h(O) = l + J3
8
h(2) =

Since

J3

3

h(l) =

1-13
8

h(3) = 3 -

+ J3
8

(5.5)

J3

8

is an algebraic integer, this suggests that a new indeterminate can be introduced

to represent this irrational number [94], hence removing round-off error in encoding the
filter coefficients. Using the indeterminate, t, the filter coefficients become:

h(O) = 1 + t

h(l) = 1 - t

(5.6)

h(2) = 3 + t
h(3) = 3 - t

Note that the common factor 1 / 8 is removed from the coefficients and can be processed
in the polynomial to binary conversion stage.

Eqn. (5.7) provides all indeterminates and their assumed values:

X

= 2

y = 4

z = 16

t =

J3

(5.7)

Order Growth
A three stage WT requires six filtering passes. The filter coefficients contain only the
indeterminate, t, and therefore the order of the indeterminates x, y, and z will never grow.
Using non-quadratic residue rings (NQRRs), the indeterminate twill grow with each filter
pass and the output polynomial will be

(a 0 + a 1x + a 2 y

p(x, y, z, t)

X

+ a 3 z + a 4 xy + a 5 xz + a 6 yz + a 7 xyz)

(b 0 + b 1 t + b2t 2 + b 3 t 3 + b 4 t 4 + b 5 t 5 + b 6 t 6 )

(5.8)

= p(x, y, z) x p(t)
Some Specific PRE Architectures

96

PRE Architecture for a Wavelet Transfonn

University of Windsor

Using quadratic residue rings (QRRs), the order of the polynomial p(t) will be kept to
first order throughout the computations.

Number Growth
The coefficients of the input polynomial p(x, y, z) are either 1 or O and the coefficients of
the filter polynomial p(t) can be set to the worst case condition where all the filter
coefficients are added; hence the output polynomial will be:

p(x,y,z,t) = (l +x+y+z+xy+xz+yz+xyz)(8+4t) 6

(8+4t) 6 =

(5.9)

262144 + 786432! + 983040t2 + 655360t 3 + 245760t 4 + 49152t5 + 4096t6
The above equation indicates that for NQRRs, M

should be greater than

983040+983040=1966080. Note that the coefficient of the term i2 is doubled to reserve the
upper half of M for negative numbers. For QRRs the polynomial (8 + 4t)6 can be reduced
noting that i2-=3, t'=3t, t4 =9, f =9t, t6 =27.

p(t) = 5533696 + 3194880!

(5.10)

Hence M > 5533696+5533696=11067392.

Moduli Set for NQRRs
To _illustrate the selection process of the moduli set, Table 5.1 lists all primes in the range
1-63. The set can not be formed with moduli all less than 5 bits (<32) while meeting the
required computational dynamic range. 5-bit and 6-bit sets are therefore required and are
constructed as follows

s1

= {31,29,23, 19, 7}

S2 = {61,59,53, 11}

Some Specific PRE Architectures

(5.11)

97

PRE Architecture for a Wavelet Transform

University of Windsor

Table 5.1. Primes in the Range 1-63

1
17
41

3
19
43

5
23
47

7
29
53

11
31
59

13
37
61

In selecting the moduli sets the following are observed:
1. Start by choosing the largest modulus within the targeted number of bits to
reduce the number of residues as needed.
2. The last modulus can be chosen just large enough to meet the computational dynamic range requirement, M.
3. Keeping all prime moduli provides the flexibility of choosing any element
as a root for the PM.
4. Any modulus in the set must be greater than the largest indeterminate order
of the output polynomial.

Moduli Set for QRRs
Table 5.2 shows the available moduli in the rang~ 2-63 that support the solution of the
quadratic equation x

2

-

3 = 0 mod m . The first and second solu-tions are also shown in

the table which constitute the roots and are used to form the polynomial generating the
ideal for the PM. Prime moduli are shown in bold. The moduli set must contain 6-bit
moduli to meet the dynamic range requirement of at least 11067392; hence, the proposed
set is as follows:

S3 = {61,59,47,46, 11}

(5.12)

Note that 46 is not prime but it is relatively prime to all other moduli in the set. Also since
only two roots are needed for the PM, It is easy to find roots in the ring

z46 where

the

difference is unity.

Some Specific PRE Architectures

98

PRE Architecture for a Wavelet Transform

University of Windsor

Table 5.2 Quadratic Residues Available in the Range 2-63

·Modulus / . :'= : Fk1ti<bot . ..,. 'Second Roiit
2

1

6

3

11

5

6

13

4

9

22

5

17

23

7

16

26

9

17

33

6

27

37

15

22

39

9

30

46

7

39

47

12

35

59

11

48

61

8

53

I!

5.1.4 Hardware Requirement
From Chapter 3, the best implementation for the polynomial mapping is the staged
approach. Utilizing the QRRs, the staged approach may be implemented directly, as
explained on page 50, since all the roots may be the same and all polynomial orders are
the same. Using the NQRRs, the staged approach can be implemented in the forward PM
since the input data coding does not require polynomial terms in the t indeterminate.
Table 5.3 lists design parameters and their values according to the cost functions
developed in Chapter 3 for the three moduli sets of Eqn. (5.11) and (5.12).

Some Specific PRE Architectures

99

PRE Architecture for a Wavelet Transform

University of Windsor

Table 5.3 Wavelet Transform Implementation Comparison

NQRRs

Parameter

Moduli Set

QRRs

S1

S2

S3

{31,29,23, 19, 7}

{61, 59, 53, 11}

{61, 59, 47, 46, 11}

Logic Height, H

10

12

12

Nie

8

8

8

Noe

56

56

16

NFPE

12

12

12

NRPE

560

560

128

NTPE

572

572

140

NMPE

2860

2288

700

KA=2HxNMPE

2,928640

9,371648

2,867200

Examining the results shown in the above table, we may observe the following;
1. Choosing larger residues reduces the number needed and hence reduces the
number of modulo blocks, N MPE, needed for the total mapping. However,
due to the increase in block complexity, the silicon area may actually
increase (see the area factor, KA, for S1 and S2 moduli sets).
2. The use of QRRs results in the smallest area factor since the number of output polynomial coefficients, Noe, are kept small due to polynomial order
folding. This is similar to what happens in the QRNS.
Despite the increase in the modulus, M, for the polynomial coefficients, the reduced
number of coefficients resulting from the use of QRRs has a greater impact on the overall
architecture silicon area. Chapter 3 discusses the architectural details for implementing the
staged approach in silicon and it is highly modular and compact. In the DSP processing
channels, we place the WT architecture shown above in each channel and use the
appropriate mapped filter coefficients and modulo blocks. During filtering with
Daubechies coefficients, there is no need to divide by 4 or multiply by 4 and this operation
can be embedded in the polynomial to binary conversion stage at no extra cost.

Some Specific PRE Architectures

100

PRE Architecture for a Wavelet Transform

University of Windsor

Table 5.4 shows the pull-down delay of 12-bit switching tree logic intended to be used as
modulo blocks for building the WT architecture in silicon. The pull-down delay is
obtained from "Sizing Results and Discussions" on page 83 and given for minimum,
equal, and optimum nFET sizes for the target 1.2µ process [96]. The estimated delay is
calculated as follows:
1.

Adding 1ns to the pull-down delay to account for the rise/fall time of the
clock and the delay needed to generate the input inversion.

2. Adding 1ns to the pull-down delay to account for the evaluation node
inversion and the time needed to latch the output.
3. Doubling the result assuming 50% duty cycle on the master clock.
4. Adding 20% to account for process variation. Note that the above pulldown delays are for nominal process parameters.
Table 5.4 Throughput Rates
:•:

.....

Paiametef
Pull Down Delay
[ns]
Estimated Period
[ns]
Throughput Rate

······•

"
\.

Miniriium Sizes

2.4µ
Equal Sizes

4.90

3.40

3.15

16.56

12.96

12.36

60.38

77.16

80.90

.

.. 1.2µ

····

. i:

2.4µ

.....

. Optimum Sizes

[MHz]

It is clear that using minimum transistor sizes will not meet the throughput requirement of
the WT video compression algorithm. We naturally suggest using an optimum sizing
technique, such as the one described in Chapter 4, and set the modulo block delay to just
meet the throughput requirement of the WT; this will provide a more area efficient silicon
design.

To further reduce the silicon area, we may use the bit slicing technique proposed in [54]
and explored, at the circuit level, in Chapter 4. This technique is based on the use of
BIPSPm cells that implement modulo m operations (addition or fixed multiplication and
accumulation) by the use of B bit steered ROMs of B input bits. This replaces the use of

Some Specific PRE Architectures

101

Fault Tolerance

University of Windsor

one 2B input ROM. The area savings is evident since the ROM area exhibits exponential
growth with the number of input bits. In fact, it is reported that as much as 66% saving in
area based on transistor count can be achieved by this technique [32]. In section 4.5 on
page 85 we have shown sized circuit details contained in the BIPSP cell for MOD 7 fixed
multiplication and accumulation (example construction) along with simulation results.
Circuit performance enhancement due to nFET sizing of the BIPSP cell enables the PRE
architecture to implement compute intensive DSP algorithms with the more stringent
requirements needed for the WT implementation. The added advantage of using the
BIPSP cell technique is its ability to incorporate parity data check for detecting errors that
can be used to build fault tolerant architectures. It is intriguing to consider this
combination of PRE architectures and fault tolerance and, indeed, the remainder of this
chapter is dedicated to this topic.

5.2

Fault Tolerance

Redundant Residue Number Systems, RRNS, have been proposed as suitable candidates
for fault tolerance in compute intensive applications. The redundancy is based on
computing multiple projections to moduli sub-sets and conducting a search for results that
lie in a so-called illegitimate range [6]. This section presents RRNS fault tolerant
procedures for the polynomial ring mapping procedure. The PRE mapping technique
dispenses with the need for many relatively prime ring moduli, which is a major drawback with conventional RRNS systems. Double, triple, and quadruple modular
redundancy can be implemented in the polynomial mapping structure, polynomial
coefficient circuitry, or the independent direct product ring computational channels, for
error detection and/ or correction. In this chapter we limit our discussion to new
implementations of redundant rings which are generated by (1) redundant residues, (2)
spare general computational channels, or (3) a combination of the two. The first
architecture is suitable for RNS embedding in the PRE, and the second for single modulus
mappings. This combination of architectures allows a trade-off between the two extremes.
The application area is in fault tolerant compute intensive DSP arrays.

102

Some Specific PRE Architectures

•

Fault Tolerance

University of Windsor

The approach taken by traditional RRNS fault tolerant systems is to effectively generate
overlapping sets of moduli, each set capable of handling the computational dynamic range
requirements. The overlapping sets are produced by increasing a base set of moduli,
( m 1 .•• m L) ,

by a number of redundant moduli,

IT

m 1 < m2 < · · · < m L + R ,
i

E

mi~

M, where M

( m L + 1 ••• m L + R) , such that
is the required computational

(µ 1 •.• ~ )

dynamic range, and the {µJ are sets of L modulus indices where the total number of
such sets is (LL~;)!. The results of computations, over each of the sets of moduli, are
referred to as projections. Clearly if each of t,he projections yield identical results, then no
fault has occurred. If some of the projections differ, then there are well established
theorems to a) locate the channel in error and b) determine which is the correct projection
L

(result). For example, defining a legitimate range of

IT Mi,

it can be shown that a

i=1

channel error will produce a result over the extended moduli set, ( m 1 ••• m L + R) , outside of
this legitimate range. Theorem 5.1. (79] establishes the detection/correction properties of
such a system.

Theorem 5.1

The redundant residue number system, as described above, will

detect R errors and correct

ProofS.1

l~ J

errors.

See reference [79].

D
A variety of channel correction schemes are available based on this concept. The major
drawback to such schemes lie in the need to increase sets of relatively prime moduli,
which do not lend themselves to exact hardware replication based on the variety of ring
moduli chosen, and so increase design (and any replacement hardware) overhead.

Some Specific PRE Architectures

103

PRE Fault Tolerant Architectures

University of Windsor

The underlying theme of multiple projections can also be extended to the PRE [18]. Since
the mapping allows large dynamic range computations to be performed over many
replications of the same modulus, we may capitalize on this feature.

5.3

PRE Fault Tolerant Architectures

Redundant rings can be utilized to construct fault tolerant architectures for the PRE
mapping procedure. In general, the number of faults that can be detected and/or corrected
is proportional to the number of redundant rings formed. There are two distinct
approaches investigated in this section that produce redundant rings for fault tolerance.
The first approach is based on redundant residues [80] which accompany a limited RNS
embedding within the PRE mapping. The second approach is based on using a spare
general computational ring, and is most useful when a single replicated modulus is used. A
combination of these two approaches is possible where a wide variety of faults can be
detected and recovered without interruption to the data flow [85]. The following
subsections discuss each of the architectures in some detail.

5.3.1 Architecture I
In this first architecture, redundant residues are used to detect and-correct faulty residues.
This architecture is useful where there is a certain amount of limited RNS embedding (i.e.
where it is inefficient to use a single modulus PRE). The polynomial mapping is
performed within the residue channel and thus any faults occurring in the mapping will be
detected and corrected. A considerable advantage of the RRNS, is its ability to recover
faults with the redundancy depending only on the number of redundant residues. The
overhead, by definition, is the ratio of redundant residues to non-redundant residues; this
differs from, for example, triple modular redundancy [81], which always possesses a
200% overhead. In light of this, the efficiency of our new architecture will depend on the
number of factors of the polynomial ring modulus that are used for the RNS embedding.

Figure 5.3 shows the general structure of this architecture.

Some Specific PRE Architectures

104

PRE Fault Tolerant Architectures

University of Windsor

Figure 5.3 Architecture I
Binary Input Data

A

t
i·-··H

Binary to Polynomiaf

Polynomial Cofficients: Binary to Residue
Redundant •.•.

,, ,,u Residues,,

~

r·++

11111111

11111111

RPM

I

,,

,,,,

,

~ Compu~~:~d ~~~:·~::~ -~~

r·u

I

~~~

+ ++

,,

,, -' .
~ -

, ,,,,

I I --~~

,. ,m ,, ,..,

lr~ii

I -~~~ I I -~~-M I

, ,,,, ,, ,,,,
,

/

--

,,

:

n o...--------~--1--1~---1
\

,,

Base Extension

Residue to Binar 1

'--

Syndromes

,,,,
Correction ROMs

PM

Polynomial Cofficients

,,
--

1---.--------,-,-1--~,F-1-'1µ
--

--

7'

'/

""

Polynomial Mapping

RPM

Reverse Polynomial Mapping

0

Binary Subtracter

EB

Binary Adder

Polynomial to Binar

•
B

Binary Output Data

The first block at the top (binary to polynomial), converts the data from binary
representation to polynomial representation where the equality given by Eqn. (5.13) must
hold.

Some Specific PRE Architectures

105

PRE Fault Tolerant Architectures

University of Windsor

N-1

I

aCi]2i

(5.13)

i=0

Here, Ck is the polynomial coefficient, X is the polynomial indeterminate, 0 i is the input
polynomial order, aUl is the ith bit of the input sample, A, and N is the number of bits in
the input sample. It is possible to align the polynomial representation with the binary
representation so that the conversion process reduces to simple wiring (e.g. X = 2 and
Ck = 0 or 1) [20]. The second block of Figure 5.3 performs binary to residue conversion
on the polynomial coefficients, as given by Eqn. (5.14).

Vk, i

(5.14)

If we choose a moduli set such that the input polynomial coefficients are always smaller
than the least modulus, this block also reduces to simple wiring. The blocks labeled PM
perform the polynomial map within the residue channel. They utilize the isomorphism that
exists between the quotient ring and the direct product rings given by Eqn. (5.15).

Zm;[X]/(gJX))

=rrzm;

(5.15)

j

Zm_[X] is the mapped polynomial, (gJX)) is the ideal whose degree must be greater than
I

the degree of the output polynomial, and

IT Zm; is the direct product ring (j copies).
j

The polynomial map is an evaluation map where the polynomials are evaluated at the roots
of the ideal polynomial. This completes the forward mapping stage and the DSP
algorithmic computations take place here over small finite rings. The reverse mapping
stage starts with the reverse polynomial mapping block, RPM, which simply reverses the
polynomial mapping using the isomorphism of Eqn. (5.15). We now employ mixed radix
conversion, MRC, given by Eqn. (5.16) (residue to mixed radix) and Eqn. (5.17) (mixed

Some Specific PRE Architectures

106

PRE Fault Tolerant Architectures

University of Windsor

radix to binary), with base extension performed on the polynomial coefficients to produce
binary representations [82].

(5.16)

L-1

ck=

I
}=0

N-1

21

I

di,iPi

(5.17)

i=O

di, i is the j th bit of the i th mixed radix digit for the k th polynomial coefficient, p 0

1,

i- 1

and pi =

IT m

1•

The base extension part will provide the necessary polynomial

I= 0

coefficients modulo the redundant residues, which will be compared to the actual results
generated by the redundant residue channels to obtain the syndromes. A set of RO Ms store
all possible correction values to the output polynomial coefficients based on the syndrome
values as address inputs. The contents of these ROMs can be found by simulating all
possible errors and calculating correction values to maintain valid outputs (with two
redundant residues, each possible single error has a unique set of syndromes). Figure 5.4
shows the arrangement of the correction RO Ms and the number an9 size of these RO Ms is
dictated by the number of redundant moduli, R, the number of output polynomial
coefficients, the number of bits of the moduli set, and the number of bits of the
computational dynamic range. For the system described in section 5.4, for instance, each
ROM shown in Figure 5.4 will have a 10 bit address input and a maximum of 23 bits per
output word.

Since the exponential growth of the ROM is a function of the address inputs, our
architecture represents a reasonable hardware solution to this correction architecture,
where the input address width is based on the dynamic range of each residue channel and
the output word width is based on the complete computational dynamic range. The ROMs
will be built using Switching Trees with the transistor sizing profile performed as
discussed in Chapter 4.
Some Specific PRE Architectures

107

PRE Fault Tolerant Architectures

University of Windsor

Figure 5.4 Correction ROMs
Syndromes
sK.R SK.I

J-··:
ROM

S1.RS1,1So,RS0.1

11111111

. ---

---.

t

,,,
,,..; ~, 1,

EJ

-~t

V

---Er rors

---

EJ-----1111••£0

The last block converts the data from polynomial representation to binary representation
by simply adding the polynomial coefficients with the proper weights, based on the
polynomial indeterminate given by Eqn. (5.18).

B

(5.18)

Here O O is the order of the output polynomial. If the indeterminate is a multiple of 2 then
binary adders can be used to obtain the final binary output.

This architecture detects/corrects errors as follows: if an error occurs within a residue
channel, the polynomial coefficients generated by the base extension will not match those
output from the redundant residues, and hence non zero syndromes will result. Based on
the value of the syndromes, the ROM looks-up stored correction values. The polynomial
coefficients are then corrected by adding these correction values, with no interruption to
the data flow. There are other techniques that utilize redundant residues for fault tolerance
[84], but the one adopted in this paper seems to be the most efficient to detect and correct
one faulty residue. It is possible to correct more than one faulty residue with more than
two redundant residues; however, in this chapter, we restrict ourselves to two redundant
residues.
Some Specific PRE Architectures

...

108

PRE Fault Tolerant Architectures

University of Windsor

5.3.2 Architecture II
The architecture, shown in Figure 5.5, is based on building a spare general computational
channel that can replace any given faulty computational channel.

Figure 5.5 Architecture II
Binary Input Data

Binary to Polynomial

General
Computational
Channel
Computational
Channel

1111111111

t--~__,Transmission
Gate
Polynomial Mapping
1111111111

Polynomial to Binary

Binary Output Data

This is most useful for a single mQ.duJus MRRNS mapping. If the non-redundant channels
tTv-v--- ..,,. ""'use general computations (i.e. no fixed coefficient multipliers), then the spare channel is
identical to any of the non-redundant channels. If the non-redundant channels take
advantage of fixed coefficient multiplications, then the spare channel will be larger to
accommodate general multiplication. The spare channel is activated using a set of
transmission gates and uses a ROM to extract the appropriate multiplicative and additive
coefficients that correspond to the faulty channel.
109

Some Specific PRE Architectures

•

PRE Fault Tolerant Architectures

University of Windsor

The error detection scheme is implemented in each channel, at the circuit level, by
utilizing a parity bit [54]. Figure 5.6 shows the general structure of a bit slice
computational cell used to construct the DSP algorithm in the computational channels
[32].

Figure 5.6 Bit slice cell with fault detection [32]
Fault In

y

(i - I)

y

(i)

pU- 1)
con

Steer Bit

X

(i - I)

X

(i)

For clarity, Figure 5.6 shows only four bits for inputs X and Y. The cell uses two parity bits,
a content parity check, P con, and an address parity check, P add, to check the parity of the
look-up cell (or steered) output. Any discrepancy in the two parity bits will result in a fault
flag being generated and travelling with the erroneous data. The pipeline latches are not
included in the diagram; they may be placed at the input or the output data lines.

The values of the parity bits are obtained from Eqn. (5.19) and Eqn. (5.20).

Some Specific PRE Architectures

110

PRE Fault Tolerant Architectures

University of Windsor

p

(i)

N- l
"

L..,

con

j=O

EBm 2

y[j](i)

(5.19)

[j](i- l)

(5.20)

N- l

p

(i)

add

"

L..,

EBm/

j=O

N is the number of bits of the input Y. A fault is held and propagated through the fault flag
bit by comparing the parity check bits of each processing cell and ORing the result with
the flag, as in Eqn. (5.21).

FaultOut = Faultln v {P'add

(i)

EBm 2 Pcon

(i-1)

}

(5.21)

If the fault flag is true at the end of the computational channel, then a fault has occurred in
at least one of the processing cells. The flag is then used to switch the transmission gates
such that the data of the faulty channel is directed to the spare channel so the output of the
spare channel replaces the output of the faulty stage.

The overhead associated with the fault detection circuitry depends on the number of bits
required to represent the channel residues. For a 5-bit residue, the overhead, for a
conventional 2-phase dynamic ROM circuit, is about 25% [32].

Note that we can assign several spare channels to replace distributed faulty channels. One
of the draw-backs of this architecture is that the polynomial mapping stage has to be error
free. This problem is alleviated by Architecture III.

5.3.3 Architecture III
This architecture combines both techniques and is shown, at a reduced detail level, in
Figure 5.7.

Some Specific PRE Architectures

Ill

PRE Fault Tolerant Architectures

University of Windsor

Figure 5.7 Architecture III
Input

+

Binary to Polynomia

Polynomial Cofficients: Binary to Residue
Redundant

I

PM

I 111111111 I

PM

I ~

Spare~~d Computation;l °Channe~~

+

RPM

Polynomial Cofficients

Base Extension

Residue to Binar

Syndromes~

Co~~~~tio,-f--------~\

I

Polynomial to Bina

Output

Redundant residues are utilized to detect and correct faulty residues while spare general
computational channels are distributed to replace faulty channels within the polynomial
mapping. This architecture is of importance because: (1) its implementation complexity
and hardware overhead are approximately the sum of those given by architectures I & II.
(2) the fault tolerance of this architecture is greater than the sum of that offered by
architectures I & II. This added fault tolerance is based on the fact that, in practice, hard
faults are usually separated both in the spatial domain and the temporal domain.

Some Specific PRE Architectures

112

Results and Comparisons

University of Windsor

Let us assume that one of the computational channels fails due to a hard fault. The general
computational channel will replace the faulty channel and the faulty channel is ignored.
Once the replacement channel has filled its pipeline, the architecture has recovered its
level of fault tolerance, and the output data will only depend on the non-redundant
residues. Multiple hard faults can be corrected successfully without interruption to the
data flow if they are separated by a full computational channel pipeline temporal distance
and a sufficient spatial distance such that they lie in different general computational
channel zones.

5.4

Results and Comparisons

It is cumbersome task, if not impossible, to produce a fair comparison among architectures
with different trade-offs. Nevertheless we have attempted below, to put forward a
comparison based on practical worst case assumptions. Table 5.5 summarizes the silicon
area overhead, the fault tolerant aspects, and the VLSI implementation aspects of the three
architectures. The table shows that Architecture III is the most immune to hard and soft
faults that are spaced spatially or temporally with a modest silicon area overhead
(compared to triple modular redundancy).

The following is a list of the assumptions made to produce Table 5.5.

Assumptions for architecture I & III:

1.
2.
3.

5-bit moduli set.
Five non redundant residues (e.g. 17, 19, 23, 25, and 27) and two redundant
residues (e.g. 29 and 31 ).
The residue channel is very large such that the silicon area of the architecture is proportional to the number of residues.

Assumptions for architecture II & III:

1.
2.
3.

One general computational channel for every five channels.
Five data bits per word (per residue digit).
Two redundant bits per word within the computational channel for error
detection.

Some Specific PRE Architectures

113

Results and Comparisons

4.

University of Windsor

The computational channels are assumed to be very large compared to the
transmission gates and control interconnections.
The general computational channel is 2.5 times larger than the computational channels if a fixed coefficient multiplication, FCM, is employed (e.g.
fixed impulse response FIR filters) and is identical to the computational
channels if a general multiplication, GM, is employed (e.g. general convolver). The channel silicon area is proportional to word size.

5.

Table 5.5 Comparisons among the fault tolerant architectures

I

40%

Can detect and correct any single residue
without interruption to the data flow.

% bits utilized=70%

Can detect and correct soft errors provided
not more than one faulty residue per pipeline period without interruption to the data
flow.

II

90% if employing
FCM

Can detect and correct one faulty channel
in every five channel zones.

60% if employing
GM

A full computational channel of pipeline
data is lost during fault correction.

The expected VLSI implementation advantages of using the MRRNS are preserved
such as cell replication and local interconnections.

If GM is employed, the computational and
the spare channels within the residue channel are completely identical and VLSI replication is possible.
If FCM is employed, replication of the
computational channels within the residue
channel is still possible with minor adjustments. The spare channels are substantially
different.

III

130% if employing
FCM
100% if employing
GM

Can detect and correct faulty computational channels without interruption to the
data flow. Assumptions: the faulty channels
are more than five channels apart spatially
within the same residue channel and the
faulty channels are separated by more than
a full computational channel pipeline data
temporally.

All VLSI advantages of the above two
architectures are imported to this architecture.

If the entire residue channel is in error then
this architecture has similar properties to
Architecture II.
If all the general computational channels
are utilized then this architecture has similar properties to Architecture I.

Some Specific PRE Architectures

114

Summary

5.5

University of Windsor

Summary

To demonstrate the use of PRE architectures for implementing DSP applications, we have
considered the implementation of a wavelet transform algorithm targeted for digital
HDTV video data compression. System requirements such as speed and dynamic range
are identified first then used to define mapping parameters such as number of
indeterminates, input and output polynomial orders, set of moduli, and mapping approach.
The implementation also shows the use of an indeterminate as a 'place holder' for the
irrational number,

J3 , without the need for approximation during the DSP computations.

Quadratic residue rings generate more compact and modular mapping architectures than
non-quadratic rings, which parallels the results found in using the QRNS to process
complex numbers. Circuit techniques are used to define the basic building blocks for the
architectures and to estimate area and throughput rate. Transistor sizing is used to meet the
speed requirement for some specific larger set of moduli. A bit slice technique can be used
to considerably reduce the area requirement and increase the throughput rate of the
resultant architecture.

In the second part of this chapter we have discussed three architectures for applying fault
tolerance to the Polynomial Ring Engine. We usually embed residue number system
coding within the polynomial ring structure, and thus fault tolerant techniques developed
for RNS can be easily adopted for the new mapping. Redundant residues are implemented
to detect and correct faulty residues with no interruption to the data flow. The regularity of
the computational channels, generated by the polynomial mapping (operating on the same
modulus), enables the use of a single spare general channel that can replace any faulty
channel that lies within its spatial zone. Different fault tolerant techniques can be
combined in the same architecture to detect and correct various types of faults. The
techniques can compensate for each other and can add an extra dimension to the fault
tolerant ability of the structure.

Some Specific PRE Architectures

115

Chapter6
Conclusions and
Future Work

6.1

Conclusions

The main objective contained in this dissertation is to implement
the Polynomial Ring Engine, using small finite rings, in silicon.
Besides developing piplined architectures to achieve the main goal,
the dissertation also includes various techniques and approaches to
enable transistor level realization of these architectures. The
following is a list of the major contributions of this dissertation;

1. Review: The dissertation contains a detailed mathe-

matical review of the newly developed polynomial
ring mapping technique along with a comprehensive
review of traditional RNS techniques. Other techniques and concepts that rely on the polynomial
mapping isomorphism/homomorphism are outlined
and compared, such as QRNS and PRNS.
2. PNS: The formalism of the polynomial number system representation is introduced; this uses polynomials to represent integers over finite rings.
Polynomial mapping therefore can be applied to
produce parallel computational channels independent of the classical RNS.

Conclusions

Conclusions and Future Work

•

116

Conclusions

University of Windsor

3. Pipelined Structures: Chapter 3 contains detailed implementations of the
polynomial mapping where several approaches have been developed.
These approaches are systematic and produce pipelined structures constructed using basic generic processing elements which makes them ideal
for circuit synthesis tools and for VLSI implementation. Cost functions for
each approach are provided for silicon requirement per mapping stage
based on the number of processing elements.
4. PRE Architecture: A new detailed construction of the general PRE architecture is developed using the polynomial and residue mappings. These
mappings are arranged in an order that reduces silicon area by allowing the
possibility of merging and removing some processing stages.
5. Transistor Sizing: A target implementation technique for the PRE is
Switching Tree blocks. A new, and essential, contribution to this technique
is an iterative transistor sizing profiler for minimizing switching delay in
the trees. A software package is developed to size general complex nFET
circuits which allows the designer to control area/delay trade-offs. The
iterative algorithm employed in the software utilizes an analytical sizing
technique, recently developed by the author, to provide a very fast overall
design cycle.
6. BIPSP Cell Implementation: Example basic PRE building block silicon
layout has been developed. The cell implements the most complex logic
block in a {3,5,7} PRE, namely a MOD 7 fixed coefficient multiplier. The
logic block uses three 3-input switching trees.
7. An Illustrative Example: A Wavelet Transform for HDTV image compression has been implemented as an example of using the PRE architecture. The polynomial mapping exploits the integer properties of the
Daubechies coefficients to reduce the computational error associated with
representing irrational numbers. The use of quadratic residue rings reduces
the polynomial order growth and produces a more efficient design despite
the increased computational dynamic range requirement needed for the
polynomial coefficients.

Conclusions and Future Work

117

Future Work

University of Windsor

8. Fault Tolerance: New multi-level fault tolerant architectures have been
developed for the PRE mappings. The architectures combine both system
and circuit level fault detection and correction to provide superior fault
coverage with no data interruption.

6.2

Future Work

Research that does not stimulate inquiring minds is like a garden that does not blossom.
No matter how extensive it is, research is never complete. Certainly, there are many more
avenues to pursue and there are greater depths to plumb than those explored in this
research work. We have presented, discussed, and manipulated techniques and approaches
that probably require many more years to fully explore. Here we provide a list of ideas and
topics to carry the results obtained in this work to the next milestones.

1. Comparison: Presenting a comprehensive and fair comparison between
the many binary systems and PRE implementation for high performance
DSP algorithms. This would be a difficult task since binary systems, being
ubiquitous, enjoy having extensive design and synthesis tools available that
are highly optimized for silicon implementation.
2. PRE design tools: Developing a set of tools that assist the designer in optimizing the PRE architecture based on the specific DSP algorithm under
consideration.
3. Silicon Synthesis: Developing automatic silicon synthesis tools for the
PRE architectures through the use of the systematic approaches presented
in Chapter 3.
4. PRE Fault Tolerance: Investigating the use of redundant roots or indeterminates for fault detection and correction. This should be analogous to the
redundant residue number system.
5. Sizing Software: Improving the input and output format to be compatible
with SPICE and synthesis tools developed above. Also adding features
such as automatic area/ delay curve generation.
Conclusions and Future Work

118

University of Windsor

REFERENCES

[ 1]

Dillinger, Thomas E. "VLSI Engineering" 1988, Prentice Hall.

[2]

Lee, Lisa "MacWeek Upgrading and Repairing Your Mac" 1995, Hayden Books.

[3]

Hwang, Kai and Briggs, Fay A. "Computer Architecture and Parallel Processing"
McGraw Hill, New York, 1984.

[4]

Kung, H. T. and Padua, D. A., IEEE Computer Society Tutorial on Parallel Processing, 1981.

[5]

Zakharov, Vasilii "Parallelism and Array Processing" IEEE Trans. on Computers,
Vol. C-33, No. 1, pp. 45-76, January 1984.

[6]

Soderstrand, M.A., Jenkins, W. K., Jullien, G. A., and Taylor, F. J. "Residue Number
System Arithmetic: Modem Applications in Digital Signal Processing" IEEE Press,
NewYork, 1986

[7]

Jenkins, W. K. and Leon, B. J. "The Use of Residue Number Systems in the Design
of Finite Impulse Response Digital Filters" IEEE Trans. on Circuits and Systems,
Vol. CAS-24, pp. 191-201, April 1977.

[8]

Soderstrand, M. A. "A High Speed Low~ost Recursive Digital Filter Using Residue
Number Arithmetic" IEEE Proceedings, Vol. 65, pp. 1065-1067, July 1977.

[9]

Jullien, G. A. "Residue Number Scaling and Other Operations Using ROM Arrays"
IEEE Trans. on Computers, Vol. C-27, pp. 325-337, April 1978.

[1 O] Jullien, G. A., Miller, W. C., Tseng, B. D. "Hardware Realization of Digital Signal
Processing Elements Using The Residue Number System" IEEE Int. Conf. on
Acoustics, Speech, and Signal Processing, Hartford, CT. May 9-11, 1977.
[11] Kato, Yasuo "Application-Oriented high speed processors: Experiences and Perspectives" IEEE Proceedings on Application Specific Array Processors, pp. 1-3,
1990.
[12] Kung, H. T. and Leiserson, C. E. "Systolic Arrays (for VLSI)" Sparse Matrix Symposium, pp. 256-282, SIAM, 1978.

119

University of Windsor

[13] Fortes, J. A. B., Fu, K. S., and Wah, B. W. "Systematic Approaches to the design of
algorithmically Specified Systolic Arrays" Proc. of Int. Conf. on Acoustics, Speech,
and Signal Processing, ICASSP'85, pp. 8.9.1-8.9.5, April, 1987.
[14] Fisher, A. L. and Kung, H. T. "Synchronizing Large VLSI Processor Array" IEEE
Trans. on Computers, Vol. C-34, No. 8, pp. 734-740, August 1985.
[15] Svoboda, A. and Valach, M. "Operational Circuits" Storje Na Zpracovani Informaci,
Vol. 3 Nakl. CSAV, Praha, 1955.
[16] Szabo, N. S. and Tanaka, R. I. "Residue Arithmetic and its Application to Computer
Technology" MacGraw Hill, New York, 1967.
[17] Guffin, R. M. "A Computer for Solving Linear Simultaneous Equations Using the
Residue Number System" IRE Trans. on Electron Computers, Vol. EC-11, pp. 164173, April 1962.
[18] Wigley, N. M. and Jullien, G. A. "On Modulus Replication for Residue Arithmetic
Computations of Complex Inner Products" IEEE Trans. on Computers, Vol. 39, No.
8,pp. 1065-1076,August 1990.
[19] Wigley, N. M. and Jullien, G. A. "Large Dynamic Range Computations Over Small
Finite Rings" IEEE Transactions on Computers, In print 1992.
[20] Bizzan, S.S., Jullien, G. A, Wigley, N. M., Miller, W. C. "Integer Mapping Architectures for the Polynomial Ring Engine" IEEE Proceedings on Computer Arithmetic,
pp. 44-51, 1993.
[21] Games, R. A. "An Algorithm for Complex Approximations in Z[ e2rti/8]" IEEE
Trans. on Information Theory, IT-32, pp. 603-607, 1986.
[22] Jenkins, W. K. and Leon, B. J. "The Use of Residue Number Systems in the Design
of Finite Impulse Response Filters" IEEE Trans. on Circuits and Systems, Vol.
CAS-24, pp. 191-201,April 1977.
[23] Nagpal, H. K., Jullien, G. A., and Miller, W. C. "Processor Architectures for TwoDimensional Convolvers Using a Single Multiplexed Element with Finite Field
Arithmetic" in "Residue Number System Arithmetic: Modern Applications in Digital Signal Processing" Edited by Soderstrand, M.A., Jenkins, W. K., Jullien, G. A.,
and Taylor, F. J., IEEE Press, New York, 1986.
[24] Miller, D. D. and Polky, J. N. "An Implementation of the LMS Algorithm in the Residue Number System" in "Residue Number System Arithmetic: Modern Applications in Digital Signal Processing" Edited by Soderstrand, M.A., Jenkins, W. K.,
Jullien, G. A., and Taylor, F. J., IEEE Press, New York, 1986.

120

University of Windsor

[25] Jullien, G. A., Miller, W. C., Grondin, R., Wang, Z., Zhang, D., Del Pup, L., and Bizzan, S. "Woodchuck: A Low-Level Synthesizer for Dynamic Pipelined DSP Arithmetic Logic Blocks" Proceedings of the International Symposium on Circuits and
Systems, pp. 176-179, 1992.
[26] Jullien, G. A., Miller, W. C., Grondin, R., Del Pup, L., and Bizzan, S., Zhang, D.
"Dynamic Computational Blocks for Bit-Level Systolic Arrays" IEEE J. on SolidState Circuits, Vol. 29, No. 1, pp. 14-22, 1994.
[27] Foster, M. J. and Kung, H. T. "The Design of Special-Purpose Chips" IEEE Computer Magazine, pp. 26-40, January 1980.
[28] Barraclough, S.R., Southeran, M., Burgin, K., Wise, A.P., Vadher, A., Robbins, W.P.,
and Forsythe, R.M. "The Design and Implementation of the IMS A 110 Image and
Signal Processor," IEEE Custom Integrated Circuits Conference, pp. 24.5.1-24.5.4,
1989.
[29] Watson, R. W. and Hastings, C. W. "Self-Checked Computation Using Residue
Arithmetic" in "Residue Number System Arithmetic: Modern Applications in Digital Signal Processing" Edited by Soderstrand, M.A., Jenkins, W. K., Jullien, G. A.,
and Taylor, F. J., IEEE Press, New York, 1986.
[30] Mandelbaum, D. "Error Correction in Residue Arithmetic" in "Residue Number
System Arithmetic: Modern Applications in Digital Signal Processing" Edited by
Soderstrand, M.A., Jenkins, W. K., Jullien, G. A., and Taylor, F. J., IEEE Press, New
York, 1986.
[31] Barsi, F. and Maestrini, P: "Error Correcting Properties of Redundant Residue Number Systems" in "Residue Number System Arithmetic: Modem Applications in Digital Signal Processing" Edited by Soderstrand, M.A., Jenkins, W. K., Jullien, G. A.,
and Taylor, F. J., IEEE Press, New York, 1986.
[32] Taheri, M. "VLSI Fault-Tolerant Systolic Architectures" Ph.D. Dissertation, University of Windsor, 1988.
[33] Fraleigh, J.B. "A First Course in Abstract Algebra" Addison-Wesley Publishing
Company, 1989.
[34] McClellan, J. H. and Rader, C. M. "Number Theory in Digital Signal Processing"
Prentice-Hall Inc., Englewood Cliffs, 1979.
[35] Jullien, G. A. "Number Theoretic Techniques in Digital Signal Processing"
Advances in Electronics and Electron Physics, Vol. 80, Academic Press Inc., 1991.
[36] Baraniecka, A.Z., Jullien, G. A. "Residue Number System Implementation of Number Theoretic Transforms in Complex Residue Rings" IEEE Trans. Acoustic,
Speech, and Signal Processing. ASSP-28, pp. 258-291, June 1980.
121

University of Windsor

[37] Jenkins, W. K. "Complex Residue Number Arithmetic for High Speed Signal Processing" Electronic Letters, Vol. 16, pp. 66~661, August 1980.
[38] Nagell, T. "Introduction to Number Theory." 1981 New York.
[39] Jenkins, W. K. and Krogmier J. V. "The Design of Dual-Complex Signal Processors
Based on Quadratic Modular Number Codes" IEEE Trans. on Circuits and Systems,
Vol. CAS-34, April 1987.
[40] Jenkins, W. K, "A Technique for the Efficient Generation of Projections for Error
Correcting Residue Codes", IEEE Trans. Comput., C-22, pp"'762-767, August 1973.
[41] W. K. Jenkins and J. J. Krogmier, "Error Detection and Correction in Quadratic Residue Number Systems", 26th Midwest Symposium on Circuits and Systems, pp.
408-411, 1983
[42] Leung, S. H. "Application of Residue Number Systems to Complex Digital Filters"
Fifteenth Asilomar Conference on Circuits, Systems, and Computers. pp. 7~7 4,
1981.
[43] Jullien, G. A., Krishnan, R., Miller, W.C. "Complex Digital Signal Processing Over
Finite Rings" IEEE Trans. on Circuits and Systems. Vol. CAS-34, pp. 365-377,
April 1987.
[44] Soderstrand, M.A. and Poe, G.D. "Applications of Quadratic-like Complex Residue
Number Systems to Ultrasonics" International Conference on ASSP. pp. 28A.5. l28A.5.4, 1984.
[45] Krishnan, R., Jullien, G.A., Miller, W. C. "The Modified Quadratic Residue Number
System (MQRNS) for Complex High-Speed Signal Processing" IEEE Trans. on
Circuits and Systems. Vol. CAS-33 No. 3, pp. 325-327, March 1986.
[46] Krishnan, R., Jullien, G.A., Miller, W. C. "Implementation of Complex Number
Theoretic Transforms Using Quadratic Residue Number Systems" IEEE Trans. on
Circuits and Systems. Vol. CAS-33 No. 8, pp. 759-766, August 1986.
[47] Krishnan, R., Jullien, G. A., Miller, W. C. "Computations of Generalized FIR Filter
Structure Using the Modified Quadratic Residue Number System" IEEE Trans. on
Circuits and Systems-II. Vol. 39 No. 1, pp. 58-62, January 1992.
[48] Krishnan, R., Jullien, G. A., Miller, W. C. "Complex Digital Signal Processing Using
Quadratic Residue Number System" IEEE Trans. Acoustic, Speech, and Signal Processing. ASSP-34 No. 1, February 1986.
[49] Krishnan, R., Jullien, G. A., Miller, W. C. "VLSI Modular Architectures for Complex Digital Signal Processing" Proc. ICASSP. 3, pp. 33.6.1-33.6.4, 1987.

122

University of Windsor

[50] Skavantzos, A. and Taylor, F. J. "On the Polynomial Residue Number System" IEEE
Trans. on Signal Processing, Vol. 39, No. 2, pp. 376-382, February 1991.
[51] Skavantzos, A. and Mitash, N. "Theory and Implementation Issues of the 2-Dimensional Polynomial Residue Number System" IEEE Southeastcon, 0-7803-0494-2/
92,pp.226-33, 1992.
[52] Skavantzos, A. and Stouraitis, T. "Polynomial Residue Complex Signal Processing"
IEEE Trans. on Circuit and Systems-II: Analog and Digital Signal Processing, Vol.
40, No. 5, pp. 342-344, May 1993.
[53] Jullien, G. A., Taheri, M., Bandyopadhyay, S., and Miller, W. C. "A Low-Overhead
Scheme for Testing a Bit Level Finite Ring Systolic Array" Journal of VLSI Signal
Processing, Vol. 2.3, pp. 131-138, 1990.
[54] Taheri, M., Jullien, G. A., and Miller, W. C. "High Speed Processing Using Systolic
Arrays Over Finite Rings" IEEE Trans. on Selected Areas in Comm., VLSI in Communications III, 6(3), pp. 504-512, 1988.
[55] Bizzan, S. S., Jullien, G. A., Miller, W. C. "Analytical Approach to Sizing nFET
Chains" IEE Electronic Letters, Vol. 28, No. 14, pp. 1334-1335, July 1992.
[56] Bizzan, S.: "High Performance VLSI Circuit Techniques", M.A.Sc. Thesis, Faculty
of Graduate Studies and Research, University of Windsor, 1991, pp. 30-39
[57] Antognetti, P. and G. Massobrio. "Semiconductor Device Modeling with SPICE."
1988 McGraw-Hill Book Company.
[58] Chawla, B. R., H.K. Gummel and P. Kozak. "MOTIS-An MOS Timing Simulator."
IEEE Trans. of circuits and systems. CAS-22, 901-910, 1975.
[59] Corporation, C. M. Guide to the Integrated Circuit Implementation Services of the
Canadian Microelectronics Corporation. 1986.
[60] Elmore, W. C. "The transient response of damped linear networks with particular
regard to wideband amplifiers." J. Appl. Phys. 19, No. 1, 55-63, 1948.
[61] Fishburn, J.P. and A. E. Dunlop. TILOS: A posynomial programing approach to
transistor sizing. Int. Conf. Computer Aided Design. 326-328, 1985.
[62] Lin, T. and C. A. Mead. "Signal delay in general RC networks." IEEE Trans. computer-aided design. CAD-3, 331-349, 1984.
[63] Matson, M. D. and L. A. Glasser. "Macromodelling and optimization of digital MOS
VLSI circuits." IEEE Trans. on computer aided design. CAD-5, No. 4, 659-678,
1986.

123

•

University of Windsor

[64] McCanny, J.V. and J.G. McWhirter. "Optimised Bit-Level Systolic Array for Convolution." IEE Proceedings, Pt. G. Vol. 131, 6, pp. 632-637, 1984.
[65] Nagel, L. W. SPICE2: A Computer program to simulate semiconductor circuits.
1975.
[66] Ousterhout, J. K. "A Switch-Level Timing Verifier for Digital MOS VLSI." IEEE
tran. on computer-aided design. CAD-4, No. 3, 336-349, 1985.
[67] Rubinstein, J., P. Penfield and M.A. Horowitz. "Signal Delay in RC Networks."
IEEE Trans. computer-aided design. CAD-2, 202-211, l 983~
[68] Sakurai, T. and A. R. Newton. "Delay Analysis of Series-Connected MOSFET Circuits." IEEE Journal of Solid-State Circuits. 26, NO. 2, 122-131, 1991.
[69] Shoji, M. "FET Scaling in Domino CMOS Gates." IEEE Journal of solid-state circuits. SC-20, No. 5, pp. 1067-71, 1985.
[70] Shoji, M. "CMOS Digital Circuit Technology." 1988 Prentice Hall Inc.
[71] Sundblad, R. and C. Svensson. "Fully dynamic switch level simulation of CMOS
circuits." IEEE Trans. on computer-aided design. CAD-6, No. 2, pp. 282-89, 1987.
[72] Vladimirescu, A. and S. Liu. The Simulation of MOS Integrated Circuits using
SPICE2. 1980.
[73] Weeks, W. T., A. J. Jiminez, G. W. Mahoney, D. Mehta, H. Qassamzadeh and T. R.
Scott. "Algorithms for ASTAP-A network analysis program." IEEE Trans: of circuit
theory. vol 20, pp. 628-634, 1973.
[74] Wu, C. Y., J. S. Hwang, C. Chang and C. C. Chang. "An efficient timing model for
CMOS combinatorial logic gates." IEEE Trans. on Computer-Aided Design. CAD-4,
636-650, 1985.
[75] Yuan, J. and C. Svensson. CMOS Circuit Speed Optimization Based on Switch
Level Simulation. IEEE International Symposium on Circuits and Systems. 3: 21092112, 1988.
[76] Brocco, L. M., S. P. Mccormick and J. Allen. "Macromodeling CMOS Circuits for
Timing Simulation." IEEE Tran. on Computer-aided Design. vol. 7. No. 12, 12371249, 1988.
[77] Cherkauer, B. S. and Friedman, E. G. "Channel Width Tapering of Serially Connected MOSFETs with Emphasis on Power Dissipation" IEEE Trans. on VLSI, Vol.
2, No. 1, pp. 100-114, March 1994.

124

University of Windsor

[78] Afghahi, M. and Svensson, C. "A Unified Single-Phase Clocking Scheme for VLSI
Systems" IEEE J. Solid-State Circuits, pp.225-233, 1990.
[79] D. Mandelbaum, "Error correction in residue arithmetic," IEEE Trans. Computers,
Vol. 21, pp. 538-545, June 1972.
[80] R. J. Cosentino, "Fault tolerance in a systolic arithmetic processor array," IEEE
Trans. Comput. Vol. 37, No. 7, pp. 886-890, July 1988.
[81] W. K. Jenkins, B. A. Schnaufer, and A. J. Mansen, "Combined system-level redundancy and modular arithmetic for fault tolerant digital signal processing," Proceedings of the 11 th IEEE Int. Symp. on Computer Arithmetic, Windsor, Ont., pp. 28-35,
July 1993.
[82] A. Baraniecka, G.A. Jullien, "On decoding techniques for residue number system
realizations of digital signal processing hardware," IEEE Trans. Circuits and Systems, Vol. CAS-25, pp. 935-936, 1978.
[83] J.C. Czilli, P. Zhou, G.A. Jullien and W.C. Miller, 1994, "BiCMOS Current Steering
Pipeline Circuit Technique.", IEE Electronics Letters, Vol. 30, No. 12, pp. 943-945.
[84] J. D. Sun, H. Krishna, and K. Y. Lin, "A ~uperfast algorithm for single-error correction in RRNS and hardware implementation," Journal of VLSI Signal Processing, 6,
pp. 259-269, 1993.
[85] G.A. Jullien, S. Bizzan, N.M. Wigley, 1994, "Using Redundant Finite Rings for
Fault Tolerant Signal Processors", Invited paper to SPIE 1994, (Invitation to session
on Algorithmic Fault Tolerance)
[86] Wang, Z., Jullien, G. A., Miller, W. C., Wang, J., and Bizzan, S. S., "Fast Adders
Using Enhanced Multiple-Output Domino Logic" IEEE Journal of Solid State Circuits and Systems, Vol. 32, No. 2, pp. 206-214, Feb. 1997.
[87] Reaume, D. J., Wigley, N. M., and Jullien, G. A., "Statistical Methods for the Optimal Design of MRRNS Algorithms" Internal VLSI Research Group Report.
[88J Blahut, R. E. "Fast Algorithms for Digital Signal Processing" Addison-Wesley,
1985.
[89] R. Godement, and Hermann "Algebra" Paris, 1968.
[90] Mallat, S. G. "A Theory for Multiresolution Signal Decomposition: The Wavelet
Representation" IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 11,
No. 7, pp. 674-693, July 1989.
[91] Young, R. K. "Wavelet Theory and its Applications" Kluwer Academic Publishers,
1993.
125

University of Windsor

[92] Rioul, 0. and Vetterli, M. "Wavelets and Signal Processing" IEEE SP Magazine, pp.
14-38, Oct. 1991.
[93] Daubechies, I. "Orthonormal bases of compactly supported wavelets" Commun.
Pure Appl. Math, Vol. XLI, pp. 909-996, 1988.
[94] Private discussions with Dr. N. M. Wigley, Emeritus Professor, Dept. Mathematics
and Statistics, University of Windsor, 1994.
[95] Nortel 0.8µ BATMOS process available through Canadian Microelectronics Corporation, 21 OA Carruthers Hall, Queens University, Kingston, Ontario, Canada
K7L 3N6.
[96] Nortel 1.2µ CMOS4 process available through Canadian Microelectronics Corporation, 210A Carruthers Hall, Queens University, Kingston, Ontario, Canada
K7L3N6.

University of Windsor

Appendix A
The Residue Number System and its Extensions

A formal mathematical description of the Residue Number System, RNS, is included here
for completeness. Theorems are stated without proofs and concepts are summarized for
brevity. Proofs and more elaborate presentations of RNS concepts can be found in the
book of Szabo and Tanaka, "Residue Arithmetic and its Applications to Computer
Technology" [16], and also in the IEEE Press Book, "Residue Number System Arithmetic:
Modem Applications in Digital Signal Processing" [6].

This appendix starts by discussing the main properties of number systems, followed by a
formal definition of the Residue Number System. Arithmetic operations of multiplication
and addition in RNS are presented. Conversion techniques from binary to residue and vice
versa are also described along with architectures suitable for VLSI implementation.
Finally, the construction of other number representation systems, based on the RNS, are
discussed.

A.1

Properties of Number Systems

The main properties common to most integer number systems are characterized in the
following definitions. These properties provide means to assess the suitability and viability
of a number system as related to the problem of DSP algorithm implementation.

Definition (Dynamic Range): The dynamic range of a number system is defined as the
interval over which every integer can be represented by the system.

The Residue Number System and its Extensions

127

University of Windsor

Definition (Uniqueness): A number representation is said to be unique if each number in

the system has only one representation.

Definition (Redundancy): A number system is redundant if there are fewer numbers than

there are combinations of digits. Therefore, for some combinations of the digits, a defined
number may not exist. Alternatively, different combinations may correspond to the same
number. Nonuniqueness obviously implies redundancy.

Definition (Weighted Number System): A number system is said to be weighted if there

exists a set of weights w i such that, for any number X, it can be expressed as:

n

X = ~
L.i a.w.
l
l

(A.I)

i= 1

where

ai

are a set of permissible digits. If the values of w i are successive powers of the

same number, the number system has a fixed base or a fixed radix, e.g. decimal system
with base ten. Number systems in which the weights are not powers of the same number
are called mixed-radix systems. Clear advantages that weighted number systems have,
over non-weighted systems, are the ease in which magnitude comparison, sign detection
and overflow detection can be performed.

A.2

Residue Number System

A.2.1 General Characteristics
The residue number system provides a way of partitioning large dynamic range operations
into completely independent and parallel smaller dynamic range operations. This promises
faster computations. Compared to a weighted number system, RNS computations exhibit
no carry propagation among the residue digits. The residue number system is an integer
number system. Two important consequences from this property are: ( 1) quotients must
generally be rounded to the closest integer and (2) in most cases the absolute value of the
The Residue Number System and its Extensions

128

University of Windsor

result will be larger than the input values, hence rescaling is necessary. This implies
division, which is a slow process in this number system. Addition and multiplication
operations can be performed very quickly and, in the case of the latter, the need for partial
products is eliminated. Division is not, in general, a closed operation in an integer number
system and this includes the residue system. The multiplicative inverse of a number does
exist in a residue system, however, under certain conditions. The residue number system is
not a weighted number system, hence it does not have many of the advantageous
properties listed for weighted number systems, such as magnitude comparison, sign
detection and overflow detection.

A.2.2 Residue Representation
A residue number system can be completely represented by specifying its base. However,
unlike a fixed-radix number system, the base for residue numbers is not a single radix, but
a set of integers, m 1, m 2,

••• ,

mn, where each member is called a modulus. For any given

base, the integer X will have a residue representation given by an N-tuple {x 1, x 2,

••• ,

xn}

where the {x) are defined by a set of N equations:

X=q.m.+x
l
l

i

= I, 2, ... , N

(A.2)

and qi is an integer chosen such that O ~ xi <mi. qi can be thought of as the integer value
of X denoted as
mi

l J.
X
mi

The quantity xi is the least nonnegative integer remainder of the

division of X by mi designated as the residue of X modulo mi or

IXlm .. The integer xi

is

I

the th residue digit of X. Note here that X can have any sign, positive or negative, and

l!;J

will have the same sign as

X, but by definition IXlm,

must be nonnegative.

The Residue Number System and its Extensions

129

University of Windsor

TheoremA.1

(Periodicity of residue representations): Two integers X and X'

have the same residue representation for moduli mi, m 2,

••. ,

mn if and only if X - X' is an

integer multiple of the least common multiple of the moduli, denoted by M.
The above theorem reveals the following;

1. The residue representation is periodic. This means that if the representation
is to be unambiguous in computation, there will be the restriction of using a
single period, called the interval of definition.
2. The residue representation does not lend itself to magnitude comparison
within any interval of definition or for sign identification of any number.

A.2.3 Representation of Negative Numbers
Similar to the binary number system, the representation of negative numbers in residue
arithmetic is somewhat arbitrary. One method is to represent the absolute magnitude of a
number in residue code and use an external sign bit to represent the sign. Alternatively the
sign of the number can be included within the residue code, similar to the complement
representation in binary. It is common, for a dynamic range of M, to consider residue
.numbers in the range of [ 0,

[1,

M- I

J as

1- J
1

as positive and .residue numbers in the range of

negative. Therefore, if X is represented as {x 1, x 2,

•.• ,

xn}, -X 1s

A.2.4 Residue Operation Indentities
A number of arithmetic relationships are presented here as a foundation for work
discussed later in this chapter. Proof for the following identities can be found in [ 16].

Identity 1

Residues of multiples of the modulus m

IKmlm

= 0

where KE Z

The Residue Number System and its Extensions

•

(A.3)

130

University of Windsor

If K, X, a, m

E

Z, then from Eqn. (A.3) we can conclude the following observations:

1.

IXI 1

2

lf!J l!J

·

Identity 2

= 0

=

iff

O~a<m

iff

O~a<m

Addition of multiples of the modulus m
(A.4)

Again the following observations can be made from this identity:

where K, X, m

1.

Identity 3

E

Z

Additive inverse modulo m

The following identities are derived from Eqn. (A.4):

1-Xlm

where

Im- Xlm

=

l(m -

1 )Xlm

=

Im - Xlm

(A.5)

is called the additive inverse of x modulo m. Every number has a unique

additive inverse.

Identity 4

Addition and subtraction modulo m

The following identities form the basis for addition and subtraction modulo m:

(A.6)

The Residue Number System and its Extensions

131

University of Windsor

where IX± Yim is referred to as the sum or the difference of X and Y modulo m. This
id~ntity can be generalized to contain any number of terms:

(A.7)
,i = 1

Identity 5

m

i= I

m

Multiplication modulo m

As with addition and subtraction, the following identities form the basis for multiplication
modulo m:

(A.8)

A generalization of the above identity to include an arbitrary number of terms is:

N

IT

N
®mxi

;= 1

=

IT 1xi1m
i= 1

(A.9)
m

This identity holds for both negative and positive numbers. In the case of negative
numbers, the additive inverse identity is used, which preserves the laws of normal
arithmetic.

Identity 6

if gcd(K, m)

Identity 7

Cancellation law for multiplication:
1, and K®ma = K®mb, then lalm = lblm

Existence of the Multiplicative Inverse

Definition: If O ~a< m and a®mb = I, then a is called the multiplicative inverse of b

mod m, and is denoted by a =

lb- lm.
1

The Residue Number System and its Extensions

132

University of Windsor

1

TheoremA.2

The quantity lb- lm exists if and only if gcd(b, m)

lblm * 0. In this case

1 and

1

lb- lm is unique.

TheoremA.3

If the multiplicative inverse of b modulo m, lb -llm, is lalm, then

TheoremA.4

Fermat's Theorem

If p is a prime 1, then

(A.10)

The importance of Fermat's theorem lies in the ease of finding the multiplicative inverses
of any element !al P, recalling that p is prime. The multiplicative inverse of

lal P* 0

is

lap- 2 lp

form a® PX =

lbl P

since

ap- 2 ® Pa

for

= l by Fermat's Theorem. Hence an equation of the

may be solved uniquely and the solution is:

a

A.3

lal P,

-1

® pb

a

p-2

® pb

(A.11)

Residue Arithmetic Operations

Residue operations are the essential building blocks with which a good VLSI DSP
architecture can be constructed. It turns out that all residue operations can be realized by
modulo operators which in turn may be implemented by look-up tables, ROMs [9]. This
leads to the fact that residue multiplication implementation is as fast as residue addition
with similar hardware requirements. In the following discussion, it is assumed that the

1. The Fermat-Euler Theorem is a generalization of this theorem and requires only that a and p be relatively
prime.
The Residue Number System and its Extensions

University of Windsor

residue representation of the integer X is { x 1, x 2 ,

... ,

x N} defined with the moduli set

N

I

mi

and likewise for the integer Y.

i= l

A.3.1 Residue Addition and Subtraction
The addition and subtraction operations of two numbers X and Y is given by Theorem A.5

TheoremA.5

(Residue Addition Theorem): For a residue number system, consist-

ing of moduli m 1, m 2 ,

.•• ,

m N , let X and Y be residue numbers. The residue representation

of IX± YIM is given by
(A.12)
This theorem is a direct consequence of identity 4, but it has fundamental importance in
residue arithmetic. First, it shows that the operation on the residue digits are completely
independent and there is no intermodular carries (or borrows). Second, the sum is obtained
modulo M, hence if the number exceeds M, an ambiguity arises. Numbers of the form
.

.

IAIM and IA+ kMIM, where k

E

Z, have the same residue repres~ntation, hence M must

be chosen large enough to guarantee that computed results lie within the dynamic range,
thus avoiding overflow.

A.3.2 Residue Multiplication
For the residue system consisting of the moduli m 1, m 2,

.•• ,

m N let X and Y be represented

by residue digits. Then the residue representation of IX®YIM is

(A.13)

The Residue Number System and its Extensions

134

University of Windsor

Within the interval, [O, M - 1 ] , only one integer,

IXYI M, has this residue representation.

Multiplication, as with addition, is carry free and if XY exceeds M, an ambiguity results
from the periodic nature of the residue representation. An important aspect of
multiplication is that it lends itself to table lookup. In conventional binary systems, table
lookup for an n-bit wordlength would require

in entries which yields unreasonably large

ROMs for practical wordlengths. In the RNS, however, for a comparable dynamic range,
N

each modulus mi requires m/ entries in the table and hence a total of

L m/ entries are
i=1

needed for all moduli. This results in/ the use of many small RO Ms accessed
simultaneously for each modulo operation and hence fast arithmetic evaluation [9].

A.3.3 Residue Division
Division in residue arithmetic can be classified into three categories:

l.

Division remainder zero: Division where the dividend is known to be an
integer multiple of the divisor and the divisor is known to be relatively
prime toM.

2.

Scaling: .Division of an arbitrary dividend by any factor ~f M which is a
product of mi .

3.

General Division: General division of an arbitrary integer by an arbitrary
integer divisor.

The first type of division is limited to special numbers and its applicability is restricted.
Scaling is analogous to power of two division in a binary system, where this is
accomplished by simply shifting the binary bits. However, in the RNS it is not so simple,
but nevertheless is much faster than division by an arbitrary number. A more detailed
description of category 3 can be found in [ 16].

Division Remainder Zero:
TheoremA.6

Division Remainder Zero

The Residue Number System and its Extensions

135

University of Windsor

(A.14)

for all mi, if and only if a divides b (without remainder) and (a, m)

1

Scaling:
Fixed radix systems offer fast evaluation when determining the outcome of a division by
some power of the base. Binary systems require only shift operations to carry out this type
of division. Scaling in the residue system is defined to be a division operation where the
divisor is a product of some of the moduli. The implementation of such an operation is not
as simple as in the binary system case; however, it offers simplicity and speed advantages
over general residue division.

The method described here is generalized to include numbers of both positive and negative
signs. In the RNS, it is usual to represent a negative number of magnitude X as M-X.
Division of any number can be represented as:

(A.15)

where X is the dividend and Y is the divisor. The purpose of scaling is to find

restricted values of Y. From Eqn. (A.15)

lf J

lf J

for the

can be written as:

(A.16)

Therefore the residue representation of

lf J

will be:

The Residue Number System and its Extensions

136

University of Windsor

{I x-IXlrl
Y

mi

,1x -IXlrl
,... ,1x-1x1r1
}
Y
Y
m2

(A.I?)

mN

If Y is a product of the any of the moduli, then from Theorem A.6, for all (mi, Y)

1 one

obtains:

x-IXlrl
Y

I

=
mi

l!I
Y

=
mi

Eqn. (A.18) expresses all the residue digits of

1x-1x1r1
Y

lfJ

(A.18)
mi

for which (m;, Y)

=

1. The rest of the

digits can be obtained through base extension (see Section A.5 on page 142.)

If it is known that the number is negative, then one can easily obtain X from M+X, scale by

Y and represent the result as M +

lf J.

But if the sign of the number is not known, then if

we scale the number as if it were positive, according to the above algorithm, the result will
be

~+

lf J.

This can be avoided if we consider that division by Y maps all numbers in

the interval [ 0,

1- J
1

into [ 0,

i- J
1

and all numbers in the· range of

[1,

M

J into

[;,~].Hence, it is possible to divide the number by Y first, then, by noting the interval

in Which

Ilf JIM lies, determine the sign of X. If X is negative add I-~IM to the result to

obtain M +

lfJ.

In this method of scaling; the result is rounded to the integer value

closest but less than the actual answer. Methods that perform rounding to the closest
integer are discussed in [ 16).

The Residue Number System and its Extensions

•

137

University of Windsor

A.4

Conversion Techniques

The Chinese remainder theorem is a classical theorem from number theory which enables
the conversion from the residue number system to a weighted number system. Given the
residue representation { x1,
possible to determine

x2 , ... , xN}

IXI M

of X, the Chinese remainder theorem makes it

provided the greatest common divisor of any pair of moduli

is 1. Such moduli are called pairwise relatively prime.

TheoremA.7

The Chinese Remainder Theorem

(A.19)
i=I
N

where mi = M and gcd(m j' mk) = I for j
mi
Remainder Theorem

IXI M

"#

k and M =

IT mi. From the Chinese
i= I

is obtained, not X itself. If X lies in the range [ 0, M - I ] , then

it can be written as:

N

X = ~
L

&

wM

m-lx.(8) mi (fiz.f 11 mi
A

l

A

l

l

(A.20)

i= I

since the modulo M operator on the left is not needed. Alternatively the Chinese
Remainder Theorem can be written so that the sum appears without the modulo M
operator. This can be done with the use of an auxiliary function A(X) shown below:

N

X = ~
L
i

1
m-lx·®
z z mi (m.f mi
A

I

1

A

-

MA(X)

(A.21)

=I

(A.22)

The Residue Number System and its Extensions

138

University of Windsor

where A(X) is a function of X defined for any integer X. From Eqn. (A.22) it can be seen
that A(X) is always an integer, and it can be shown that if O ~ X < M then:

0 ~ A(X)

i= I

~

(A.23)

M

A.4.1 Binary to Residue Conversion
Eqn. (A.2) defines the residue of a number modulo
calculation is performed by dividing X by

mi

mi.

In conventional computers, this

and determining the remainder. In a residue

computer, which is capable of residue addition, multiplication, etc. a more efficient
method can be used to determine the residue representation. A number is represented in
the binary system as:

(A.24)
where bi are the binary digits· of the integer X. Taking the residue, modulo mi, on both
sides of Eqn. (A.24) yields:

(A.25)

If powers of2 modulo

mi

are directly available,

•

IXlm. may be computed by merely adding
I

(modulo mi) those powers of 2 for which bi = 1.

A.4.2 Mixed Radix Conversion
A direct implementation of the Chinese Remainder Theorem is one method of converting
residue numbers. The disadvantage of this method is the implementation of the mod M

The Residue Number System and its Extensions

139

University of Windsor

operator. The mixed radix conversion presented here, on the other hand, can be
implemented by only taking residues modulo {mi}.

The mixed radix representation is of great importance in residue computation due to two
reasons.

1. The mixed radix system is a weighted system and hence can be used in
magnitude comparison.
2. Conversion from residue to certain mixed-radix systems is relatively fast in
residue computers.
Before explaining the conversion procedure, it is necessary to explain the system itself.

Mixed Radix System:
A number X may be expressed in mixed radix form as:

N- l

X

(A.26)

aNITRi+ ... +a3R1R2+a2R1+a1
i= l

where R i are the radices,

ai

are the mixed radix digits, and O ~ ai ·< R i. For a given set of

radices, the mixed radix representation of X is denoted by (aN, aN _ 1,

•.• ,

a 1) where the

digits are in decreasing signific~ce. It is obvious that a positive number in the interval
[ 0,

i ~ R;-

I] may be represented uniquely in this manner. The multipliers of the mixed

radix digits are the weights. For the special case of the decimal number system the weights
of the digits are consecutive powers of ten.

Conversion to the Mixed-Radix System

The Residue Number System and its Extensions

140

University of Windsor

If, for a set of moduli m 1, m 2, ... , m N , a set of radices is chosen such that mi = R i , the

mixed radix system and the residue system are said to be associated and the two systems
N

have the same range of values, i.e.

IT mi. If mi

R i , the mixed radix expression will be

i = 1

of the form:

N-1

X

aN

IT mi+ ... + a3m1m2 + a2m1 + al

(A.27)

i=1

where the ai are mixed radix coefficients: The coefficients are determined sequentially
starting with a 1 • Taking mod m 1 of Eqn. (A.27), will determine a 1 , since all other terms
except the last are multiples of m 1 , therefore:

(A.28)
Hence a 1 is simply the first residue digit. To obtain a 2 , first the residue code of X - a 1 is
formed. This quantity is divisible by_ m 1 , and since m 1 is relatively prime to all other
moduli, then the division by zero procedure on page 135 can be used to find the residue
digits of order 2 to N of

X-a 1

. From Eqn. (A.27), it can be deduced that

m1

a2 = X - a I
ml

•

In the same manner all the other mixed radix digits can be obtained. In

m2

general the mixed radix digits can be found for i> 1 by:

a.l

The Residue Number System and its Extensions

(A.29)

141

University of Windsor

A.5

Base Extension

Frequently it is necessary to find the residue representation of a number in one base using
its representation in another base. In most cases, the new base will be an extension of the
original base, with one or more extra moduli from the original base. The procedure,
termed base extension, is a mixed radix conversion with an additional final step. Consider
a residue system consisting of moduli

[ 0, ;~ m;- I

J. If another modulus,

mi, m 2, ... , mN,

mN + 1 ,

1

definition will become [ 0, ;~ m;- I

and with the interval of definition

is added to the base, then the interval of

J and the mixed radix expression will be of the

form:

N-1

N

X =

aN + l

IT

mi+ aN

IT

mi+ ... + a3m1 m2

(A.30)

i=l

i=l

For any number in the original interval a N +

1

will be zero, and, in performing the mixed

radix conversion,
this fact can be used to determine
'

A.6

+ a2m1 + al

IXlm

N+ I

.

Redundant Residue Number System (RRNS)

A redundant residue number system is defined as a residue system

m 1, m 2, ... , m N

with R

additional moduli. All R+N moduli must be relatively prime to ensure a unique number
representation. The moduli
moduli

m 1, m 2, ... , m N

mN + 1, mN + 2, .•. , mN + R

are called the nonredundant moduli and the

are the redundant moduli. A number in this system will

be represented by N+R digits, N of which are nonredundant, and the rest are redundant
digits. The total range, the set of states represented by the RRNS, will be [O, M 7 - 1]

The Residue Number System and its Extensions

142

University of Windsor

N+R

where MT =

IT mi.

The interval [O, M - I] is termed the legitimate range where

i= 1
N

M =

IT mi

and the range [M, MT - I ] termed the illegitimate range. To make proper

i= 1

use of the redundancy, all operands and results must be restricted to the legitimate range.
This constraint defines the dynamic range of the system to be [-

odd and

M; 1, M; 1J if Mis

[-1, 1] if M is even. There exists a one-to-one correspondence between the

integers in the dynamic range and the state of the legitimate range in nonredundant RNS.
The mixed radix representations associated with the residue number states are used in both
overflow detection and correction. Extensive discussions on the subject of error and
overflow detection can be found in [6].

A.7

Complex Residue Number System

So far the discussion on residue number systems has focused on definitions and properties
defined over a finite ring or field, R(m) or F(p) if m is a prime, in which the elements of
this field/ring, M={O, ... ,m-1} are real integers. In this section, computations involving
complex numbers in modular arithmetic will be discussed. A general description of
complex residue number systems (CRNS) will be given, followed by special cases of the
CRNS, which allow simplification in complex number computations [39].

Ordinary complex number systems are based on the fact that the integer polynomial x2=-I
has no solution in the set of real numbers. In order to permit solutions to this polynomial,
the set of complex numbers is introduced, where j, the imaginary unit, is equal to the
square root of -1. Analogous to this, in order to form a complex modular structure, it is
necessary to first determine the solution of:

The Residue Number System and its Extensions

143

University of Windsor

x

2

= -lmod m

If a solution to Eqn. (A.31) exists, then j

E

(A.31)

R(m), and the equation is said to be solvable.

In this case -1 is a quadratic residue mod m. If it is otherwise, the equation is unsolvable,
and -1 is termed a quadratic nonresidue modulo m. The following theorem helps in
determining whether the equation is solvable or not, for an arbitrary m.

TheoremA.8

The number -1 is a quadratic residue of all primes of the form

p=4k+ 1 and a quadratic non residue of all primes of the form p=4k+3 [38].

If m is not a prime, then it is sufficient that -1 be quadratic residue of all the primes that

divide m for there to be a solution, j, of x

2

= -1 (modm). This is the first step in building

a complex modular structure. The next step is to construct the complex extension field (for
m

prime) or complex extension ring (for in non prime).

Complex Extension Fields/Rings
For the case

j =

m = p = 4k

F-i e F(p).

+ 3 , Eqn. (A.31) has no solution in F(p ), and

A complex modular structure, with p 2 elements, isomorphic to the

second degree Galois extension field, F(p 2 ) can be formed by the ordered pairs
(xr, xi):::::: xr + jxi with xr, xi

E

F(p). The binary modular operations of addition and

multiplication are defined as:

(xr, xi)EB(yr, y) = (ur, ui)
(xr, x)®(yr, Yi) = (zr, Zi)

(A.32)

where

The Residue Number System and its Extensions

•

144

University of Windsor

ur = (x/±\Yr)
u.l = (x/:Byi)

(A.33)

z r = (xryrEBp(-xiy))

z.l = (xiy/Bp(-xryi))
It can be seen that complex modular operations emulate ordinary complex arithmetic, and
similarly utilize four real multiplications and two real additions to perform complex
multiplication.

If m is not a prime, then a complex modular ring (R(m2 ) can be formed by the ordered

pairs defined above, following the same arithmetic rules. Work on this theme is presented
in [36] and [39].

If m = p = 4k + l, then Eqn. (A.31) is solvable, j

E

R(m) and -1 is a quadratic residue

mod m. This case leads to the definition of a mapping from R (m /
QR(m

2

)

which is isomorphic to R(m )

2

• This

to a quadratic ring

is described in more detail in the following

section.

A.7.1 Quadratic Residue Number System
A method of handling complex data, so that the two channels for real and imaginary data
are processed independently, is to use the Quadratic Residue Number System [39] [42].
This method maps the real and imaginary data to two channels that compute over finite
fields. The rings are built using Theorem A.8.

For moduli of the form 4k + 1 , -1 is a quadratic residue, therefore the monic quadratic

x2 + 1 =

0 has a solution in base ring QR(m) ={S: EB,

(8) ).

If j is a solution to the

monic quadratic then j and its multiplicative inverse will belong to QR(m). Although
an extension field cannot be built based on a solution of the monic quadratic, an extension
ring can be generated. This ring is referred to as a quadratic ring. The extension element
The Residue Number System and its Extensions

145

University of Windsor

can be written as

AQ l. = (A.l 0 ' A l.* )

where

A.l 0 = a l. EB 1··l ® a l.

(normal) and

A k*

The two binary operations of addition and multiplication, over the quadratic ring, are
computed as:

Addition:

(A.34)
Multiplication:

AQ l. @(BQ.)
= (A.@B
. A* l· ®B*·)
l
l
L'
L

(A.35)

where the real and imaginary part of the product can be formed from the normal and
conjugate parts of the result, Q and Q* respectively, as:

(A.36)

This forms a commutative ring with identity. It should be noted that the ring is isomorphic
to the finite ring of Gaussian integers, which will be denoted as R(m;), and that both
arithmetic operations only involve two base field operations ( EB,

@ ).

The concept can be extended to both special composite moduli and a system of quadratic
rings using a direct sum mapping.The isomorphism given in Eqn. (A.37) is used to allow
computations to be carried out in L parallel and smaller rings:

The Residue Number System and its Extensions

146

University of Windsor

(A.37)
L

M =

IT

mi

i= 1

The proof of the above can be found in [43].

A.7.2 Quadratic Like Residue Number System
The advantages which make the QRNS an attractive system for performing complex
arithmetic, is that complexity of complex multiplication is reduced from four real
multiplications to two real multiplications, and that real and imaginary data are mapped
into two independent channels. The limitation of this system is the restriction placed on
the moduli to be of the form 4k+ 1. Soderstrand [44] proposes a system that relaxes this
restriction, at the cost of reduced resolution. The underlying concept here is to find a
number in the RNS system that when squared yields a negative number. This residue
number system is termed the Quadratic-Like Residue Number System (QLRNS), and
retains the computational properties of QRNS. In the QRNS j = ~ , whereas in the
Q~RNS

ra

= j

Ja; hence the resolution is reduced by t~e length of the j vector. This is

an acceptable compromise; for example, in a 4-bit moduli set (16, 15, 13, 11 ), the
resolution is reduced from 15 bits to 12 bits in the imaginary term.

Complex number representation in the QLRNS proceeds using the following steps:

Step 1. Find integers m and n such that x + jy ::::: m + n]

Ja,

where the approximation

represents truncation or rounding to integer. Hence m=x and n = y

Ja .

Step 2. j Ja is now the real number ] Ja in QLRNS. Hence the complex number is
represented by a pair of RNS numbers formed like complex conjugates:

The Residue Number System and its Extensions

147

University of Windsor

z = m+n]Ja
z* = m-n]Ja

(A.38)

Step 3. A mappingf from the complex ring (C(M)) of complex RNS integers defined by
m + nj Ja to the QLRNS can be defined by Eqn. (A.38). This mapping is invertible 1• The

inverse mapping

f- 1 is defined by:

m = 2

n =

-1

(z + z*)

(2]Jaf

1

(A.39)

(z-z*)

Observing the inverse mapping, the QLRNS can be further categorized into two
subdivisions, one in which the multiplicative inverses of 2 and

2],Ja exist, and one in

which the inverses do not exist. For the latter case, the inverse mapping must be
performed, for example, by using standard mixed radix conversion scaling techniques.

Complex arithmetic in the QLRNS is defined by the following. Given two complex
QLRNS number (z 1, z* 1) and (z 2, z* 2 ), addition, subtraction and multiplication are
defined as:

(z 1, z* 1) ± (z 2, z*2) = (z 1 ± z 2, z* 1 ± z*2)
(z 1, z*1) · (z 2, z*2) = (z 1z 2, z*1z*2)

(A.40)

Eqn. (A.40) implies that complex multiplication in the QLRNS can be performed with
only two real multiplications.

As in the case of the QRNS, this concept can be extended to both composite moduli and a
system of quadratic rings using the direct sum mapping.

1. The mapping from C(M) to QLRNS is an isomorphism. Proof of this can be found in [44]
The Residue Number System and its Extensions

148

University of Windsor

A.7.3 Modified Quadratic Residue Number System
In order to relax the restriction that exists for QRNS, namely that moduli be of the form of
4k+ 1, a new number system is introduced. In this system the restriction on the form of the
modulus is removed at the cost of increasing the number of real multiplications involved
in a complex multiplication, from two to three. This method has been referred to as the
Modified Quadratic Residue Number System (MQRNS) [45][46][43][47]. This method,
unlike the QLRNS method, does not result in a reduction in the dynamic range.

2

For moduli other than of the form m = 4k + l , the momc equation x + 1 = 0, is
irreducible in R(m). Therefore the QRNS method cannot be employed. In order to relax
this restriction, the monic equation is generalized so that a solution other than

H, exists

over R(m). An extension ring MQR(m), is defined as:

MQR(m) = [ {A (MQ\: +, .]

(A.41)

The elements of this ring are defined by:

A (MQ) = (A, A*)

where A =

la+ )blm

la-)blm, a, b E R(m)

and A* =

solution to the monic quadratic x

2

-

(A.42)

and A, A*

E

R(m) with) as a

n = 0.

The binary operations modulo m are calculated in the following manner:

Addition:

A (MQ) + B(MQ) = (A+ B, A*+ B*)

(A.43)

Multiplication:

The Residue Number System and its Extensions

149

University of Windsor

{(A ·B)-S, (A* ·B*)-S}

A(MQ) ·B(MQ) =

(A.44)

,..2

where S = (} EBml)®mb®md: b, dare the imaginary parts of the complex samples.
Since

1/J

m =t:-

-1 , computation of the real and imaginary part of the product cannot be

formed from the normal and conjugates terms, in the same manner as Eqn. (A.36). Thus
an alteration of the real component of the complex multiplication is required. In order to
correct this value, S has to be calculated; this results in an extra multiplication over the
QRNS method and also involves a cross connection between the 'normal' and 'conjugate'

computations.

The real and imaginary parts of the product can be formed as:

-1

Y.R
= (2 ® mi (Q-EB
Q*·)EB
(-S.))
t
i
mi
t
mi
t
-1

.... -1

Y-t 1 = 2 ® m; j

(A.45)

® m; (Q-EB
(-Q*·))
i
mi
i

where Qi = IAi · Bil and Q*i = jA*i · B\j. If Si is subtracted directly from Qi and Q\,
then the real and imaginary parts of the complex product can be computed from Eqn.
(A.36).

Once again, using the isomorphism in Eqn. (A.46), it can be shown that computations over
R(M) can be performed in L parallel rings.

(A.46)
L

M =

IJ
i

mi

=1

Several works have been published which demonstrate the use of MQRNS and QRNS in
filter realizations [43][ 49].

The Residue Number System and its Extensions

150

University of Windsor

A.8

Summary

Discrete mathematics provides the necessary tools to construct suitable number systems
for specific algorithms. Mapping numbers from one system to another may result in a
more favourable computational environment even though the mapping will add some
overhead to the total hardware construction. Traditional RNS and its variants have been
discussed in detail for their use in implementing computations for DSP applications.

The Residue Number System and its Extensions

•

151

University of Windsor

Appendix B
NFET Chain Sizing

B.1

Introduction

Switching circuits can be analyzed using standard circuit simulators such as SPICE [65]
and ASTAP [73]; such circuit simulators require a great amount of CPU time and memory
storage particularly if they are used in the iteration loop of design algorithms. To reduce
the complexities of the analysis and the models, a number of techniques are available; for
example, using tables instead of transistor equations [58], using macromodels [63], ... etc.

One of the most successful approximate modeling techniques that has emerged in recent
years is the RC tree model [62][67] based on Elmore's delay formula [60]. This model
forms the computational basis of many switch level simulators such as TMODS [71] and
CRYSTAL [66].

Recently, RC model based delay simulators have been used in circuit speed optimization
for dynamic logic blocks; for example SLOP [75]. Circuit sizing techniques for such
families are powerful since it has been shown that 10% of the delay and 30% of the gate
area can be decreased simultaneously [69]. It appears that all sizing techniques, so far
published, rely on some type of iterative optimization procedure rather than an analytical
formulation to the sizing problem. It is clear that an analytical approach to circuit sizing
will both save substantial amounts of CPU time and, more importantly, provide an
algebraic foundation on which to base automated design procedures.

Appendix B

NFET Chain Sizing

152

University of Windsor

A rule of thumb that has evolved in the design of dynamic circuits is to limit the height of
the logic block in order to maintain a minimum performance level; however, as the feature
size enters the submicron region, this rule can be relaxed [68] and higher logic chains are
possible. This encourages more extensive use of NAND/NOR complex gates and makes
the transistor sizing more important and effective. In such circumstances, the use of
analytical techniques in transistor sizing can reduce the very large computational time that
is required for iterative techniques.

B.2

Delay Model

B.2.1 Discharge Delay of an nFET Chain
Figure B.1 Single CMOS Dynamic Gate Chain
VD

IN 2 ~

NFET 2

IN 1 ~

NFET 1

PHI ~

NFETo

GND
Consider the dynamic gate structure shown in Figure B.1. The nFET chain consists of
N + 1 serially connected transistors from the evaluation node to the ground. The bottom
nFET is the ground switch while the top pFET is the precharge transistor. CL models the

load capacitance connected to the evaluation node. / N 1 ,/ N 2 , ... ,/ N N is the input vector
and CK is the gate clock connected to both the ground switch and the precharge pFET.
This dynamic structure is general to most dynamic gates with only slight modifications.
The problem becomes too complex if the dependence of the discharge delay on the rate of

Appendix B

NFET Chain Sizing

153

University of Windsor

change of the input voltage is included in the analysis, therefore we assume that the gate
input voltages switch instantly.

Figure B.2 Dynamic Gate Timing Diagram

Precharge Cycle
Clock
PHI
Evaluation Cycle
The clock has two distinct cycles in which the gate operates; precharge cycle and
evaluation cycle as shown in Figure B.2.

Precharge cycle
In this cycle, the clock level is low and the ground switch nFETO is off, isolating the
circuit from ground. The precharge pFET is on and the evaluation node is pulled up to
VDD. Now, let us consider the worst case precharge condition where all inputs to the
nFET chain are high (we assume a voltage of VDD in our analysis). If the precharge

cycle is kept on for a relatively long period of time, the top tra~sistor, nFETN, in the
chain will enter the cut-off region leaving V s

=

VD - V THN

=

VDD- V THN. Here,

V THN is the threshold voltage of nFETN including the back-bias effect. Transistors
nFET 1 to nFETN-

I

will continue to conduct until V DS = 0 and all internal nodes of

the chain will be charged up to VDD - V THN. The state of the dynamic gate after this long
precharge period is shown in Figure B.3

Appendix B

NFET Chain Sizing

154

University of Windsor

Figure B.3 Long Precharge State
VDD

PHI=GND

PHI--g

PFETON

IN2 ----1

NFET 2 ON

IN1 ----1

NFET1 ON

PHI----1

E=VDD

VDD-VruN

NFETo OFF
GND

Evaluation cycle
After the input vector stabilizes, the clock level switches high and the precharge pFET is
turned off. The ground switch nFETO turns on and the evaluation node discharges low or
remains high depending on the state of the input vector. The worst case evaluation
condition is obtained when the evaluation node has to discharge low and the input vector is
high. In this case, the top transistor, nFETN, pulls out of the cutoff region to the
saturation region and then to the linear region. All other transistors in the chain stay in the
linear region throughout the discharge period including the ground switch. Figure B.4
shows the state of the dynamic gate after worst case evaluation condition.

Appendix B

NFET Chain Sizing

155

University of Windsor

Figure B.4 Worst Case Discharge State
VD

PHI

--g

E=LOW

INN ~

IN2 ~
IN1 ~

LOW
IN1 =IN2 = .... =INN =VDD
PHI=VDD

PHI~
S=Saturation Region
L=Linear Region
GND
It is worth mentioning that it is the discharge delay which limits the clocking speed of the
gate and any improvement to the discharge delay greatly enhances the logic gate
performance. There are two dominant factors, working against each other, that directly
affect the discharge delay. The first is the evaluation node capacitance, which consists of

cL

and the pFET drain capacitance. The second factor is the size of the nFET chain. As

the size of an nFET transistor in the chain increases, the current driving capability also
increases which tends to decrease the delay; however, the parasitic capacitances associated
with the nFET increase and this tends to increase the delay. In the next section, we deal
with the problem of predicting the discharge delay for given sizes of the nFET chain [69],
using these dominant factors.

B.2.2 RC Model
In order to analyze the delay characteristics of an nFET chain, the chain must be

represented by a simple and manageable circuit model that contains the essential physical
mechanisms. In this work we use the established RC model presented in [67] [62] since it
predicts the discharge delay in terms of circuit parameters not device parameters [74].

Appendix B

NFET Chain Sizing

156

University of Windsor

Figure B.5 RC Model Construction
VDD
PHI

-g

E

l"CL

INN-i

RN

b~
~
C2

IN 2 "--1

R2

"--1

R1

PHI"--1

Ro

IN 1

GND

~

~
~

GND

(a) Dynamic Gate Chain

(b) RC Model

This former method is useful in developing algorithms for gate-level delay simulators, but
the present objective is to approximate the nFET chain pulldown delay. Each transistor in
the nFET chain is replaced by a series resistance, R i, and a parallel parasitic capacitance,
Ci. The loading effect of the precharge pF ET is added to CL. Figure B.5 shows the

process of constructing the RC model.

Ci consists of the following contributions:

1. Drain-diffused island capacitance of nFETi.

2. Source-diffused island capacitance of nFETi + 1 •

3. 1/2 the gate-to-channel capacitance of nFETi.
4. Gate-to-drain overlap capacitance of nFETi.

5. 1/2 the channel-to-substrate capacitance of nFETi.
6. 1/2 the gate-to-channel capacitance of nFETi + 1 •
7. Gate-to-drain overlap capacitance of nFETi + 1 •
Appendix B

NFET Chain Sizing

157

University of Windsor

8. 1/2 the channel-to-substrate capacitance of nFETi + 1 •
Normally, the above parasitic capacitances are lumped to node Ni [70] and thus over
estimate the actual delay of the circuit [66]. Ri models the average channel resistance
during discharge and depends on the voltage of the nodes Ni_

1,

Ni, and !Ni. We assume

that the inputs switch instantly and V JN . = VDD; therefore, the current, (, that flows
I

through nFETi, and Ri are ideally related by Eqn. (B.1)

To

R. = _l
l
TD

J

V

-V
N;

N;_'dt

(B.1)

].

0

l

Now, for a given technology, it is possible to calculate Ci and Ri, as will be shown
shortly.

B.2.3 Elmore Delay Formula
We define the discharge delay as the time required for the evaluation node to drop to 36%
of its original value VDD; this is approximated by Elmore's delay_fonnula [60]

N

I
i=O

N

R;

I
j=i

N

CJ =

I
i=O

C;

I

Ri

(B.2)

j=O

In order to take account of the CL load, from Figure 3.1 (b ), we re-write Eqn. (B.2) as:

(B.3)

It is worth noting that Eqn. (B.3) implicitly assume that all internal node voltages of the
RC chain are initially equal to the evaluation node voltage. This assumption, however, is

NFET Chain Sizing

Appendix B

•

158

University of Windsor

not true for the nFET discharge chain and we expect some extra error due to this. Another
source of error is in the approximation of R i where the channel resistance is not linear and
in some cases is difficult to linearize, as in the case of the saturation region.

With appropriate linear definitions for the non-linear channel resistances and parasitic
capacitances, the approximate RC model produces satisfactory results for delay
approximation [62] and, more importantly, gives excellent results in sizing nFET chains
[69]. This will also be demonstrated later in this chapter.

B.3

R and C Approximations

B.3.1 Parasitic Capacitance Approximation
The capacitances involved in the RC model are relatively easy to calculate by suppressing
the non-linearity with the assumption of fixed voltages. The errors due to these
assumptions can be taken care of by scaling the resulting delay in accordance with SPICE
simulations. Figure B.6 shows the capacitances considered in the RC model representation
of an nFET (likewise for a pFET).

There are two things worth mentioning concerning Figure B.6 (b ). The first is that the
channel capacitance,

CcHAN,

which is due to the channel potential with respect to the

bulk voltage, has little affect on the total delay of the model, and therefore can be
neglected. The second thing is that the gate-to-channel capacitance, C G , is distributed
over the channel resistance and, for simplicity, we assume that it is lumped between CD .
I

and Cs.·
I

Appendix B

NFET Chain Sizing

159

University of Windsor

Figure B.6 CMOS Transistor Capacitance Model
CnB

CGDO

D

h_
B

G~

==t> G

B==t>

-

R-1

h_

s
CGso

(a)

Csi

CsB
(b)

-=-

(c)

Furthermore, in the RC model, a step input is assumed and hence the gate voltage is
constant and equal to VDD. Therefore, all capacitances connected to the gate voltage,
VG, will be replaced by a ground terminal (virtual ground) without affecting the behavior

of the capacitance.

Figure B. 7 Typical Transistor Layout

j_
j_

L

T

DH

Legend:

CJ diffusion
Ic> : j gate

11111 contact

T
In order to appropriately define the capacitances in the model we use the transistor layout,
shown in Figure B.7. Note that DH is the diffusion height not the same as HDIF
commonly used in SPICE to automatically compute drain/source diffusion area.

The drain and source island area and perimeter are given by:

Appendix B

NFET Chain Sizing

160

University of Windsor

AD=AS=WxDH

(B.4)

PD = PS = 2W + 2DH

(B.5)

Other layout techniques, such as diffusion node sharing to reduce area, and other
technology rules can be easily incorporated. Now we consider the capacitances shown in
Figure B.6(b) and their evaluations according to SPICE models [57][72].

1. Gate-to-channel capacitance
Cc = COX x gatearea + CEDGE x gateperimeter

(B.6)

2. Gate-to-drain overlap capacitance
CGDO = CGDO x W

(B.7)

3. Drain-to-substrate capacitance
CJxAD

C BD =
[

l _ V BD

+

]-Ml [l

CJSWxPD
_ V BD

PB

J-MJSW

(B.8)

PB

Assuming Vsv =0 to remove the capacitance non-linearity, we get

CBD = CJxAD+CJSWxPD

(B.9)

4. Gate-to-source overlap capacitance
Ccso = CGSO X w

(B.10)

5. Source-to-substrate capacitance as in Eqn. (B.9)
C 8 s = CJxAS+CJSWxPS

(B.11)

The capacitances in Figure B.6 (c) are computed as:

CG

CD.= CcDo+CBD+

2

I

Cs.
I

Appendix B

Cc
Ccso+CBs+

2

NFET Chain Sizing

(B.12)

(B.13)

161

University of Windsor

B.3.2 Channel Resistance Approximation
We model the non-linear behavior of the MOSFET channel with a linear resistor. There are
two distinct approaches to approximate the channel resistance sighted in the literature. The
first approach is to simulate a test circuit using SPICE and then extract the average
discharge resistance from the data obtained [76]. The second approach is to use SPICE
model equations to calculate the average channel resistance [69]. Here we combine both
approaches which results in a smaller area-delay product for the nF ET chain.

As explained in Section B.2, the transistors in the discharge chain are either in the linear or
saturation region. The top transistor in the chain mainly operates in the saturation region
while the others mainly operate in the linear region. We will therefore consider
approximations for these two different operating regions.

Channel Resistance Approximation for Linear Region nFETs
We model transistors in the linear region with a minimum channel resistance using the
SPICE MOSFETlevel-2 model, as shown in Eqn. (B.14);

where,

(3 = KP · W

(B.15)

L- 2Xjl

then,

1
1

(B.16)

B[V cs- V BJN-TJVDs-'Ys(2$F + VDs- V ss)2]
Appendix B

NFET Chain Sizing

162

University of Windsor

The minimum channel resistance is found for Vns = 0, and this value can be used for the
RC model. Hence:

1

r{

1

(B.17)

ssi2]

V GS- V BIN-'Ys(2<)>F- V

It is convenient to express the resistance with Was the single independent variable, thus:

R

_ PURL
W

(B.18)

DS -

where:

PURL

1

(B.19)

KP[VGS- V BIN-'Ys(2<)>F- Vss>2]

Channel Resistance Approximation for Saturation Region nFET
An improved approach to approximate PURS that eliminates the use of optimization
algorithm [55] is presented. This approach guarantees close to optimum sizing results.

1. Using a test transistor chain as shown in Figure B.8, consisting of
the top nFET and the ground switch, we set the latter to, for example, ten times the minimum width for the technology. This value is
somewhat arbitrary; we simply need a value that reflects approximate widths for typical optimum sizing results.

Appendix B

NFET Chain Sizing

163

University of Windsor

Figure B.8 Test Circuit
VD

PHI--tj

GND
2.· Using SPICE simulations, select the width of the top nFET, W opt,
that provides minimum delay. Typical SPICE results using the
above test circuit are shown in Figure B.9.
3. From Section B.2.2 and Section B.3.1 Ci and C N can be written as
follows,
(B.20)
(B.21)
where K2 , K 4, and K 5 are per unit capacitance coefficients and K 1 and K 3 are the constant
capacitance contributions. Note that K 1 includes the load capacitance .cL. Substituting
Eqn. (B.20) and Eqn. (B.21) into Elmore delay formula for the test circuit Eqn. (B.3) we
get

(B.22)

c)T

where PURS is the per unite resistance of the top nFET. Setting :) D = 0, we equate the
oW 1
resulting expression for W 1 to W opt and obtain the ratio:

R

Appendix B

PURS
PURL

(B.23)

NFET Chain Sizing

164

University of Windsor

As an example, for a 3µ-DLM technology from Northern Telecom [59], R(Wo=30µ)=1.57
and R(Wo=60µ)= 1.94; we show, however, in the following section, that the delay is a weak
function of this ratio.

Figure B.9 Delay VS Width for the Test Circuit
0.54 - . - - - - - - - - - - - - - - - - - - - -

,;/)

Cl

0.52

z

0

u

U.J

,;/)

0

z
<C
z
~

~

<C
....J
U.J
Cl
U.J
0

~

0.50

0.48

0.46

<C

::c:

-u
Cl

0.44

Minimum Delay 1-------~--:l~EEE'E!I!!-~
0.42 ...._-----------+------------1
40
50
30
10
20
0
GATE WIDTH IN MICRONS

B.4

Simple nFET Chain Sizing

B.4.1 Typical Optimization Approach
In this section we will discuss optimization techniques normally used to size nFET
discharge chains. We will use the results of this technique in a comparison study with the
analytical approach discussed in the next section. These comparison results are presented
in Section B.5.

The RC model is used to define the cost function, discharge delay, which is to be
minimized. The delay is given by Eqn. (B.3). Usually nFET sizes are only given by the
NFET Chain Sizing

Appendix B

•

165

University of Windsor

width of the channel; the length of the channel is kept constant at the minimum feature
size offered by the given technology. The parasitic capacitance and resistance
approximations are calculated according to Section B.3.1 and Section B.3.2.

In this approach, an area constraint is used and assumed to be proportional to the sum of
the nFET chain widths. The algorithm given below is similar to several sighted in the
literature e.g. SLOP [75], TILOS and ADVICE [61];

1. Select transistor width step size, ~W, and set all nFETs to mini-

mum widths.
2. Calculate the delay sensitivity, S, of all transistors
(B.24)
3. Replace Wi by Wi + ~W only for the transistor with the largest sensitivity.
4. Repeat 2 and 3 until the user specified area limit is achieved.
A major drawback of the above algorithm is the requirement of an initial width vector of
minimum transistor sizes, and thus the algorithm tends to require many iterations, even if
only moderate size logic gates are considered. In fact, for the purpose of comparisons, any
high level programing language with built in optimization algorithms can be used such as
Matlab and Extend.

B.4.2 Analytical Approach to nFET Chain Sizing
The discharge delay of the evaluation node E is given by Elmore [60] and can be
expressed as a sum of individual delay terms

N

I
i=O

N

Ri

I
j=i

N

cj

=

ITni

(B.25)

i=O

From extensive simulations for a variety of technologies, we have found that a close to
optimum transistor size distribution is achieved when all of the delay terms are set equal.
Appendix B

NFET Chain Sizing

166

University of Windsor

This appears to be a new observation and is the comer stone of our ability to produce an
approximate analytical solution to the optimum sizing problem.

To achieve close to optimum transistor sizing profile of an nFET chain, the delay terms are
set equal and given by Eqn. (B.26)

(B.26)
Let the number of transistors in the chain be N+ 1 and the discharge delay TD· From Eqn.
(B.25) and Eqn. (B.26), we obtain:

N

R.~

l~

c J.

(B.27)

j=i

Any node capacitance in the equivalent circuit for the chain is dependent on the two
transistors connected to that node. In the case of the top node, however, the capacitance is
only dependent on the width of the top transistor (we assume the precharge transistor is a
known fixed size and the load capacitance, Cv is known). We can therefore analytically
solve the problem by back substitution, starting from the top node. Using the capacitance
model in SPICE we obtain Eqn. (B.28), where WN is the width of the top transistor, K 1 is
the constant capacitance contribution, and K 2 is the per unit width drain/gate capacitance
of the top transistor.

(B.28)
The equivalent channel resistors are defined, in the usual manner, by per unit parameters.
Since the top nFET mainly discharges in the saturation region, we define its channel
resistance, RN, as shown in Eqn. (B.29).

Appendix B

NFET Chain Sizing

167

University of Windsor

RN

_ PURS
WN

-

(B.29)

Combining Eqn. (B.28), Eqn. (B.29) and solving for WN, we obtain:

(B.30)

The lower transistors in the chain mainly discharge in the linear region; the channel
resistance, Ri, is given in Eqn. (B.31).

R. = PURL
l

w.

(B.31)

l

Since any node capacitance in the chain has the form:

(B.32)
where K 3 is the constant capacitance contributions of Wiand Wi+l' K4 and K 5 are the per
unit width capacitance terms, we therefore obtain:

w.

(B.33)

l

where i = N - 1,N - 2, ... ,0. SPICE simulations and/or MOSFET transistor equations
can be used to generate parameters PURS, PURL, K 1, K 2 , K 3 , K4 , and K 5 (see
Section B.3).

Appendix B

NFET Chain Sizing

168

University of Windsor

B.5

Results and Discussions

In order to evaluate the technique we show two different comparisons with a typical
numerical optimization procedure. The first result compares delay/area curves obtained by
minimizing discharge delay subject to an area constraint. The second result compares a
transistor width profile for a fixed discharge delay. Although the results presented here are
for our target 3µ technology, our technique will work for any technology where the RC
model is a good approximation.

Figure B.10 Dela~ VS Area for Single Chain
--O--·

4.5

4.0
,--,
(/)

"'O
C

8
(1)

3.5

Optimization R= 1.57
Analytical R=l.57
SPICE R= 1.57
SPICE R=l.94
SPICE R=l
SPICE +5 micron
SPICE -5 micron

(/)

0

C

ro

S 3.0
>.
ro

d)

0
(1)

2.5

oJ)

a
..c
u

a 2.0
(/)

1.5

1.0 - + - - - - - - - - - - - - , - ~ ~ - - - . . - - , - - , i - - r - - r - - ~ ~
1 00

200

300

400

500

600

700

800

900

Chain Area [microns]

Our test circuit is shown in Figure B.8 with N=6, CL replaced by a minimum size static
inverter, and the precharge pFET set to a width of 10µ. Figure B. l O shows delay/area
curves for both our analytical and a standard numerical sizing technique. The lower plots
show delays obtained directly from the RC model; the upper plots give delays generated
by SPICE level 2 simulations. It is well known that the RC model underestimates the

Appendix B

NFET Chain Sizing

169

University of Windsor

actual delay value; however, this does not affect the optimum size profiles obtained. The
following points are important:
1. The results obtained from the analytical technique are almost identical to those from the numerical technique.
2. The delay/area curve is a weak function of the ratio, R; however,
simply choosing a ratio of unity [75] produces a sub-optimal solution compared to R in the range 1.5 ~ 2.0.
3. When we perturb the size profile by only ±5µ (for each transistor),
the delays obtained from SPICE simulations increase. Since this
test is independent of the RC model used to generate the profile, this
provides a reasonable confidence level that the technique produces
close to optimum sizing distribution.

Figure B.11 Single nFET Chain Sizing Profile

-~
~

~

a

Iterative Optimization
Analytical Technique

.~
~~

-~
~~~

I

0

20

40

60

80

NFET Width [microns]

Figure B.11 shows nFET sizing profiles obtained by both our analytical and a standard
iterative optimization technique for a chain gate area requirement of 600µ2.

B.6

Summary

This Appendix has discussed, in detail, an analytical technique for optimizing size profiles
of MOSFET transistor chains in terms of discharge delay. The technique uses an
empirically determined condition on the time constants of the classical RC delay model,
Appendix B

NFET Chain Sizing

170

University of Windsor

together with a technology parameter based on SPICE simulations. Results demonstrate
that the analytical technique is comparable to classical numerical optimization methods.
The ability to express the sizing problem algebraically has important ramifications in the
development of complex nFET logic block design tools.

Appendix B

NFET Chain Sizing

171

University of Windsor

Appendix C
Transistor Sizing Software

C.1

Introduction

This Appendix provides source code for the transistor sizing software used for sizing the
various computational blocks used in the main body of this dissertation. The code is
divided into two major parts named technology block and topology block. As the name
indicates the technology block processes the fabrication technology parameters under
consideration and prepares all intermediate parameters needed for the sizing algorithm.
The topology block reads in the circuit connectivity structure and user defined sizing
criteria and then forms all nFET paths. The sizing kernel uses the analytical formula to
size individual paths and satisfy all conditions set forth. The output sizes are given in the
same format as the input for convenient inspection and analysis. The code is written in the
MODL language (with similarities to both 'C' and Pascal) and is run in interactive mode
by the Extend™ simulation engine.

C.2

Code Listing

C.2.1 Technology Block
// Declare constants and static variables here.
real
lminn,ldn, vgsn, vbsn, vton,eppsoxn, toxn,coxn,un,kpn,cedgen,cgdson
,cjn,cjswn,wminn,adsan,bdsan,adspn,bdspn,rration,purln,pursn,acdsn
,bcdsn,ac gn, bcgn,leffn,dbcn,rshn ,rcn
,lminp,ldp,vgsp,vbsp,vtop,eppsoxp,toxp,coxp,up,kpp,cedgep,cgdsop
,cjp,cjswp,wminp,adsap,bdsap,adspp,bdspp,rratiop,purlp,pursp,acdsp
,bcdsp,acgp,bcgp,leffp,dbcp,rshp,rcp;
integer i,j,nodatan,nodatap;
Transistor Sizing Software

Appendix C

•

172

University of Windsor

Procedure treatdata()
{

/////// Initialize dialog items////////
for(i=O;i<30;i++)
{

if(parameter[i] [ 1]=="computed") parameter[i] [O]=BLANK;
if(parameter[i] [3]==''computed") parameter[i] [2]=BLANK;
parameter[i][l]=BLANK;
parameter[i] [3]=BLANK;
}

parameter[O] [4] ="lmin";parameter[O] [5] ="meter";
parameter[l] [4] ="ld";parameter[l] [5] ="meter";
parameter[2][ 4] ="vgs";parameter[2][5] ="volt";
parameter[3][ 4] ="vbs";parameter[3][5] ="volt";
parameter[ 4][ 4] ="vto";parameter[ 4] [5] ="volt";
parameter[5][ 4] ="eppsox";parameter[5][5] ="eppsox";
parameter[6][4] ="tox";parameter[6][5] ="meter";
parameter[7][ 4] ="cox";parameter[7][5] ="farad/meter2";
parameter[8][ 4] ="u"; parameter[8][5] ="u";
parameter[9] [4] ="kp";parameter[9] [5] =''kp";
parameter[l O] [4]="cedge";parameter[l O] [5]="farad/meter";
parameter[l l] [4]="cgdo/cgso";parameter[l 1][5]="farad/meter";
parameter[12][4]="cj";parameter[12][5]="farad/meter2";
parameter[13}[4]="cjsw";parameter[13][5]="farad/meter";
parameter[l 4][ 4]="wmin";parameter[l 4][5]="meter";
parameter[15][ 4]=''adsa";parameter[15][5]="meter2";
parameter[l 6][ 4]=''bdsa";parameter[l 6][5]=''meter";
parameter[l 7] [4]="adsp";parameter[l 7] [5]="meter";
parameter[l 8][ 4]="bdsp";parameter[l 8] [5]="1/meter";
parameter[l 9] [4]="rratio";parameter[l 9] [5]="-";
parameter[20] [4]="purl";parameter[20] [5]="ohm.meter";
parameter[21][4]=''purs";parameter[21][5]=''ohm.meter'';
parameter[22] [4]="acg";parameter[22] [5]="farad";
parameter[23][ 4]="bcg";parameter[23][5]="farad/meter";
parameter[24] [4]="acds";parameter[24] [5]="farad";
parameter[25] [4]= "beds ";parameter[25] [5]= "farad/meter";
parameter[26] [4]= "dbc";parameter[26] [5]= "meter";
parameter[27] [4]="rsh";parameter[27] [5]="ohm/square' ';
parameter[28][ 4]="rc";parameter[28][5]="ohm";
parameter[29] [4]=BLANK;parameter[29] [5]=BLANK;
//NFET parameters
lminn =StrToReal(parameter[O] [O]);
Appendix C

Transistor Sizing Software

173

University of Windsor

ldn
=StrToReal(parameter[ 1][O]);
vgsn =StrToReal(parameter[2] [O]);
vbsn =StrToReal(parameter[3] [O]);
vton =StrToReal(parameter[ 4] [O]);
eppsoxn=StrToReal(parameter[5] [O]);
toxn =StrToReal(parameter[ 6] [O]);
coxn =StrToReal(parameter[7] [O]);
un
=StrToReal(parameter[8] [O]);
kpn
=StrToReal(parameter[9] [O]);
cedgen =StrToReal(parameter[ 1O] [O]);
cgdson =StrToReal(parameter[ 11 ][O]);
cjn
=StrToReal(parameter[ 12] [O]);
cjswn =StrToReal(parameter[13][0]);
wminn =StrToReal(parameter[ 14] [O]);
adsan =StrToReal(parameter[ 15] [O]);
bdsan =StrToReal(parameter[ 16][0]);
adspn =StrToReal(parameter[l 7][0]);
bdspn =StrToReal(parameter[ 18] [O]);
rration =StrToReal(parameter[ 19] [0]);
purln =StrToReal(parameter[20] [O]);
pursn =StrToReal(parameter[21][0]);
acgn =StrToReal(parameter[22] [O]);
bcgn =StrToReal(parameter[23] [O]);
acdsn =StrToReal(parameter[24] [O]);
bcdsn =StrToReal(parameter[25] [O]);
dbcn =StrToReal(parameter[26] [O]);
rshn =StrToReal(parameter[27] [O]);
rcn
=StrToReal(parameter[28] [O]);

//PFET parameters
lminp =StrToReal(parameter[O] [2]);
ldp
=StrToReal(parameter[ 1] [2]);
vgsp =StrToReal(parameter[2] [2]);
vbsp =StrToReal(parameter[3] [2]);
vtop =StrToReal(parameter[ 4] [2]);
eppsoxp=StrToReal(parameter[5] [2]);
toxp =StrToReal(parameter[6] [2]);
coxp =StrToReal(parameter[7] [2]);
up
=StrToReal(parameter[8] [2]);
kpp
=StrToReal(parameter[9] [2]);
cedgep =StrToReal(parameter[l O] [2]);
cgdsop =StrToReal(parameter[l 1][2]);
cjp
=StrToReal(parameter[12][2]);
cjswp =StrToReal(parameter[ 13] [2] );
wminp =StrToReal(parameter[14][2]);
adsap =StrToReal(parameter[ 15] [2]);
Appendix C

Transistor Sizing Software

174

University of Windsor

bdsap
adspp
bdspp
rratiop
purlp
pursp
acgp
bcgp
acdsp
bcdsp
dbcp
rshp
rep

=StrToReal(parameter[ 16] [2]);
=StrToReal(parameter[l 7] [2]);
=StrToReal(parameter[l8][2]);
=StrToReal(parameter[l 9] [2]);
=StrToReal(parameter[20] [2]);
=StrToReal(parameter[21] [2]);
=StrToReal(parameter[22] [O]);
=StrToReal(parameter[23] [O]);
=StrToReal(parameter[24] [2] );
=S trToReal(parameter[25] [2] );
=StrToReal(parameter[26][2]);
=StrToReal(parameter[27] [2]);
=StrToReal(parameter[28] [2]);

I IIIIIIIIIIIIIIIII NFET data treatment//////////////////

nodatan=O;//this flag is set if data is missing
leffn=lminn-2*ldn;
if(No Value(leffn))
{

if(NoValue(lminn)) parameter[O][l]="must enter";
if(No Value(ldn)) parameter[ 1] [1]="must enter";
nodatan=l;
}

///// Ressitance calculations
if(No Value(purln))
{

if(NoValue(coxn) AND NoValue(kpn))
{

coxn=eppsoxn/toxn;
if(NoValue(coxn))
{

if(NoValue(eppsoxn)) parameter[S][l]="must enter";
if(NoValue(toxn)) parameter[6] [1 ]="must enter";
nodatan=l;
}

else parameter[7][1 ]=''computed'';
}

else parameter[?] [1 ]="accepted";
if(NoValue(kpn))
{

kpn=un*coxn;
if(NoValue(kpn))
Appendix C

..+,,

Transistor Sizing Software

175

University of Windsor

{

if(NoValue(un)) parameter[8][l]="must enter";
nodatan=l;
}

else parameter[9][1]="computed";
}

else parameter[9] [I ]="accepted";
purln=leffn/(kpn*(vgsn-vton))+2*rshn*bdsan+2*rcn*dbcn;
if(No Value(purln))
{

if(NoValue(vgsn)) parameter[2][1] ="must enter";
if(NoValue(vton)) parameter[4][1] ="must enter";
if(No Value(rshn)) parameter[27][1 ]="must enter";
if(No Val ue(bdsan)) parameter[ 16][ 1]="must enter";
if(No Value(rcn)) parameter[28] [1 ]="must enter";
if(NoValue(dbcn)) parameter[26][l]="must enter";
nodatan=l;
}

else parameter[20] (1 ]="computed";
}

else parameter[20] [ I ]="accepted";
if(No Value(pursn))
{

pursn=purln *rration;
if(No Value(pursn))
{

parameter[l 9][l]=''must enter'';
nodatan=l;
}

else parameter[2 l] [ 1]="computed";
}

else parameter[2 l ][ 1]="accepted";

//Capacitance calculations
if(NoValue(acgn))
{

acgn=2 *cedgen *leffn;
if(NoValue(acgn))
{

if(NoValue(cedgen)) parameter[IO][l]="must enter";
nodatan=l;
}

else parameter[22] [1 ]="computed";
}
Appendix C

Transistor Sizing Software

176

University of Windsor

else parameter[22] [1 ]="accepted";
if(No Value(bcgn))
{

if(No Value(coxn))
{

coxn=eppsoxn/toxn;
if(No Value(coxn))
{

if(No Value(eppsoxn)) parameter[S][l]="must enter";
if(No Value(toxn)) parameter[6][1]="must enter";
nodatan=l;
}

else parameter[?] [1 ]="computed";
}

else
{

if(parameter[7] [ 1]==BLANK) parameter[?] [1 ]="accepted";
}

bcgn=coxn *leffn+ 2 *cedgen;
if(No Value(bcgn))
{

if(No Value( cedgen))
{

parameter[l O] [1 ]="must enter";
nodatan=l;
}
}

else parameter[23] [1 ]="computed";
}

else parameter[23][1 ]="accepted";
if(No Value(acdsn))
{

acdsn=acgn/2+cjn*adsan+cjswn*adspn;
if(No Value(acdsn))
{

if(NoValue(cjn)) parameter[12][1]="must enter";
if(NoValue(adsan)) parameter[15][1]="must enter'';
if(NoValue(cjswn)) parameter[13][1]="must enter";
if(NoValue(adspn)) parameter[l 7][1]="must enter";
nodatan=l;
}

else parameter[24] [1 ]="computed";
}

else parameter[24][1]="accepted";

Appendix C

Transistor Sizing Software

177

University of Windsor

if(No Value(bcdsn))
{

bcdsn=bcgn/2+cgdson+cjn *bdsan+cjswn *bdspn;
if(No Value(bcdsn))
{

if(NoValue( cgdson)) parameter[l 1][ 1]= "must enter";
if(No Value( cjn)) parameter[12] [1 ]="must enter";
if(No Value(bdsan)) parameter[l 6][1 ]="must enter";
if(NoValue(cjswn)) parameter[13][1]="must enter";
if(No Value(bdspn)) parameter[l 8][1 ]="must enter";
nodatan=l;
}

else parameter[25] [1 ]="computed";
}

else parameter[25] [ 1]="accepted";

////////////////// PFET data treatment//////////////////
nodatap=O;//this flag is set if data is missing
leffp=lminp-2*ldp;
if(No Value(leffp ))
{

if(NoValue(lminp)) parameter[0][3]="must enter";
if(No Value(ldp)) parameter[!] [3]="must enter";
nodatap=l;
}

///// Ressitance calculations
if(N o Value(purlp))
{

if(NoValue(coxp) AND NoValue(kpp))
{

coxp=eppsoxp/toxp;
if(No Value( coxp))
{

if(No Value(eppsoxp)) parameter[5] [3 ]="must enter";
if(NoValue(toxp)) parameter[6][3]="must enter";
nodatap=l;
}

else parameter[?] [3]=' 'computed";
}

else parameter[7][3]="accepted";
if(No Value(kpp))
{
Appendix C

Transistor Sizing Software

178

University of Windsor

kpp=up*coxp;
if(No Value(kpp))
{

if(No Value(up )) parameter[8] [3]="must enter";
nodatap=l;
}

else parameter[9] [3 ]=''computed'';
}

else parameter[9] [3 ]="accepted";
purlp=leffp/(kpp*(vgsp-vtop))+2*rshp*bdsap+2*rcp*dbcp;
if(No Value(purlp ))
{

if(NoValue(vgsp)) parameter[2][3] ="must enter";
if(No Value(vtop )) parameter[ 4] [3] ="must enter";
if(No Value(rshp )) parameter[27] [3]="must enter";
if(N o Value(bdsap)) parameter[ 16][3 ]= "must enter";
if(No Value(rcp )) parameter[28] [3]="must enter";
if(NoValue(dbcp)) parameter[26][3]="must enter";
nodatap=l;
}

else parameter[20] [3 ]="computed";
}

else parameter[20] [3]="accepted";
if(No Value(pursp ))
{

pursp=purlp*rratiop;
if(No Value(pursp ))
{

parameter[l 9][3]="must enter";
nodatap=l;
}

else parameter[21] [3 ]="computed";
}

else parameter[21 ][3]="accepted";

//Capacitance calculations
if(No Value(acgp))
{

acgp=2 *cedgep*leffp;
if(NoValue(acgp))
{
if(No Value(cedgep)) parameter[ 10] [3 ]= "must enter";
nodatap=l;
}
Appendix C

Transistor Sizing Software

•

179

University of Windsor

else parameter[22] [3 ]="computed";
}

else parameter[22] [3]="accepted";
if(No Value(bcgp ))
{

if(No Value( coxp ))
{

coxp=eppsoxp/toxp;
if(No Value(coxp ))
{

if(N oValue( eppsoxp)) parameter[5] [3 ]= "must enter";
if(No Value(toxp )) parameter[6][3]="must enter";
nodatap=l;
}

else parameter[7] [3 ]="computed'';
}

else
{

if(parameter[7] [3 ]==BLANK) parameter[7] [3 ]=''accepted'';
}

bcgp=coxp*leffp+2*cedgep;
if(No Value(bcgp ))
{

if(No Value(cedgep ))
{

parameter[I0][3]="must enter";
nodatap=l;
}
}

else parameter[23] (3 ]="computed";
}

else parameter[23] [3 ]="accepted'';
if(No Value(acdsp ))
{
acdsp=acgp/2+cjp*adsap+cjswp*adspp;
if(No Value( acdsp))
{
if(NoValue(cjp)) parameter[l2][3]="must enter";
if(No Value(adsap)) parameter[ 15] (3 ]="must enter";
if(No Value( cjswp)) parameter[ 13] (3 ]=' 'must enter'';
if(NoValue(adspp)) parameter[l 7][3]="must enter'';
nodatap=l;
}

else parameter[24] [3]="computed ";
}
Appendix C

Transistor Sizing Software

180

University of Windsor

else parameter[24][3]="accepted";
if(No Value(bcdsp ))
{

bcdsp=bcgp/2+cgdsop+cjp*bdsap+cjswp*bdspp;
if(No Value(bcdsp ))
{

if(NoValue(cgdsop)) parameter[l 1][3]="must enter";
if(NoValue(cjp)) parameter[12][3]="must enter";
if(N oVal ue(bdsap)) parameter[ 16][3 ]= "must enter";
if(No Value( cjswp )) parameter[ 13] [3]="must enter";
if(N oValue(bdspp)) parameter[ 18] [3]="must enter";
nodatap=l;
}

else parameter[25] [3 ]=''computed";
}

else parameter[25] [3 ]="accepted";

I/JIii/i// Copy data to the dialog items///////////
///NFET data
parameter[O][O] =RealToStr(lminn, 5);
parameter[l ][O] =RealToStr(ldn, 5);
parameter[2][0] =RealToStr(vgsn, 5);
parameter[3][0] =RealToStr(vbsn, 5);
parameter[4] [O] =RealToStr(vton, 5);
parameter[5][0] =RealToStr(eppsoxn, 5); ·
parameter[6][0] =RealToStr(toxn, 5);
parameter[7][0] =RealToStr(coxn, 5);
parameter[8][0] =RealToStr(un, 5);
parameter[9][0] =RealToStr(kpn, 5);
parameter[l O][O]=RealToStr(cedgen, 5);
parameter[l 1] [O]=RealToStr(cgdson, 5);
parameter[l 2][0]=RealToStr(cjn, 5);
parameter[l3][0]=RealToStr(cjswn, 5);
parameter[l 4] [O]=RealToStr(wminn, 5);
parameter[l 5] [O]=RealToStr(adsan, 5);
parameter[l 6] [O]=RealToStr(bdsan, 5);
parameter[l 7][0]=RealToStr(adspn, 5);
parameter[l 8][0]=Rea1ToStr(bdspn, 5);
parameter[l 9] [O]=RealToStr(rration, 5);
parameter[20] [O]=RealToStr(purln, 5);
parameter[21] [O]=RealToStr(pursn, 5);
parameter[22] [O]=RealToStr(acgn, 5);
parameter[23] [O]=RealToStr(bcgn, 5);
parameter[24] [O]=RealToStr(acdsn, 5);
Appendix C

Transistor Sizing Software

181

University of Windsor

parameter[25] [O]=RealToStr(bcdsn, 5);
parameter[26] [O]=RealToStr( dbcn, 5);
parameter[27] [O]=RealToStr(rshn, 5);
parameter[28] [O]=RealToStr(rcn, 5);
///PFET data
parameter[O] [2] =RealToStr(lminp, 5);
parameter[1][2] =RealToStr(ldp, 5);
parameter[2][2] =RealToStr(vgsp, 5);
parameter[3][2] =RealToStr(vbsp, 5);
parameter[ 4] [2] =RealToStr(vtop, 5);
parameter[5] [2] =RealToStr(eppsoxp, 5);
parameter[6][2] =RealToStr(toxp, 5);
parameter[?] [2] =RealToStr(coxp, 5);
parameter[8] [2] =RealToStr(up, 5);
parameter[9][2] =RealToStr(kpp, 5);
parameter[l 0][2]=RealToStr(cedgep, 5);
parameter[l 1] [2]=RealToStr(cgdsop, 5);
parameter[ 12] [2]=RealToStr( cjp, 5);
parameter[l 3] [2]=RealToStr( cjswp, 5);
parameter[l 4] [2]=RealToStr(wminp, 5);
parameter[l 5][2]=RealToStr(adsap, 5);
parameter[l 6] [2]=RealToStr(bdsap, 5);
parameter[l 7][2]=RealToStr(adspp, 5);
parameter[l 8] [2]=RealToStr(bdspp, 5);
parameter[l 9][2]=RealToStr(rratiop, 5);
parameter[20] [2]=RealToStr(purlp, 5);
parameter[21] [2]=RealToStr(pursp, 5);
parameter[22] [2]=RealToStr(acgp, 5);
parameter[23] [2]=RealToStr(bcgp, 5);
parameter[24] [2 ]= RealToStr( acdsp, 5);
parameter[25] [2]=RealToStr(bcdsp, 5);
parameter[26] [2]=RealToStr(dbcp, 5);
parameter[27] [2]=RealToStr(rshp, 5);
parameter[28][2]=RealToStr(rcp, 5);
////////// Broadcast values using global variables//////////
GLOBAL5 =wminn;
GLOBAL6 =purln;
GLOBAL7 =pursn;
GLOBAL8 =acgn;
GLOBAL9 =bcgn;
GLOBALlO=acdsn;
GLOBALl 1=bcdsn;
GLOBAL12=wminp;
GLOBALl 3=purlp;
Appendix C

Transistor Sizing Software

182

University of Windsor

GLOBAL14=pursp;
GLOBAL15=acgp;
GLOBAL16=bcgp;
GLOBALl 7=acdsp;
GLOBALl 8=bcdsp;
}

// This message occurs for each step in the simulation.
on simulate
{
}

// If the dialog data is inconsistent for simulation, abort.
on checkdata
{
}

on compute
{

treatdata();
}

//clear all dialog items
on clear
{

for(i=O;i<30;i++)
{

for(j=O;j<6;j++) parameter[i] [j]=BLANK;
}

parameter[O][ 4] ="lmin";parameter[0][5] ="meter";
parameter[!][ 4] ="ld 11 ;parameter[l ][5] ="meter";
parameter[2][ 4] ="vgs";parameter[2][5] ="volt";
parameter[3][ 4] ="vbs";parameter[3][5] ="volt";
parameter[ 4] [4] ="vto";parameter[ 4][5] ="volt";
parameter[5][ 4] ="eppsox";parameter[5][5] ="eppsox";
parameter[6][ 4] =''tox'';parameter[6][5] =''meter'';
parameter[?][ 4] ="cox";parameter[7][5] ="farad/meter2";
parameter[8][ 4] ="u"; parameter[8][5] ="u";
parameter[9] [4] ="kp";parameter[9] [5] ="kp";
parameter[10][4]="cedge'';parameter[10][5]=''farad/meter'';
parameter[l l] [4]="cgdo/cgso";parameter[l 1][5]="farad/meter";
parameter[ 12] [4]="cj";parameter[12] [5]="farad/meter2";
parameter[l 3] [4]="cjsw";parameter[l 3][5]="farad/meter";
parameter[l 4] [4]="wmin";parameter[l 4] [5]="meter";
parameter[ 15] [4]="adsa";parameter[ 15] [5]="meter2";
parameter[l 6] [4]="bdsa";parameter[l 6][5]="meter";
parameter[l 7] [4]="adsp";parameter[l 7][5]="meter";
parameter[l 8] [4]="bdsp";parameter[l 8][5]="1/meter";
Appendix C

Transistor Sizing Software

183

University of Windsor

parameter[l 9] [4]="rratio";parameter[l 9] [5]="-";
parameter[20] [4]="purl";parameter[20] [5]="ohm.meter";
parameter[21] [4]= "purs ";parameter[21] [5]= "ohm.meter";
parameter[22] [4]="acds";parameter[22] [5]="farad";
parameter[23] [4]="bcds ";parameter[23] [5]="farad/meter";
parameter[24] [4]="acg";parameter[24] [5]="farad";
parameter[25] [4]="bcg";parameter[25] [5]="farad/meter";
parameter[26] [4]="dbc";parameter[26][5]=''meter";
parameter[27] [4]="rsh ";parameter[27] [5]="ohm/square";
parameter[28] [4]="rc";parameter[28] [5]="ohm ";
parameter[29] [4]=BLANK;parameter[29] [5]=BLANK;
}

// Initialize any simulation variables.
on initsim
{

treatdata();
}

C.2.2 Topology Block
// Declare constants and static variables here.
real

wminn,purln,pursn,acgn,bcgn,acdsn,bcdsn,wminp,purlp,pursp,acgp,bcgp
,acdsp,bcdsp,c,r;

integer i,j,oni,ni,nj,tpc,npc,tc,nn,bti,ti,tj;
Procedure showtitle()
{

AnimationText(l, title);
AnimationShow(l );
}

Procedure readglobal()
{

wminn=GLOBAL5;
purln=GLOBAL6;
pursn=GLOBAL7;
acgn=GLOBAL8;
bcgn=GLOBAL9;
acdsn=GLOBALl O;
bcdsn=GLOBALl 1;
wminp=GLOBAL12;
purlp=GLOBALl 3;
pursp=GLOBALl 4;
acgp=GLOBAL15;
bcgp=GLOBALl 6;
Appendix C

Transistor Sizing Software

184

University of Windsor

acdsp=GLOBALl 7;
bcdsp=GLOBALl 8;
}

real cgn(real w)
{

return( acgn+bcgn *w );
}

real cdsn(real w)
{

return( acdsn+bcdsn *w );
}

real rsn(real w)
{

return(pursn/w );
}

real rln(real w)
{

return(purln/w );
}

real c gp(real w)
{

return( acgp+bcgp*w );
}

real cdsp(real w)
{

return( acdsp+bcdsp*w );
}

real rsp(real w)
{

return(pursp/w );
}

real rlp(real w)
{

return(purlp/w );
}

PROCEDURE firstpath()
{

labell:
//UserError("At the start of firstpath loop ni="+ni+" nj="+nj);
if( ni>=iorder)
{

//Bingo, first path done
path[tpc] [2]=(tc-3)/2;
tpc++;
return;
Appendix C

Transistor Sizing Software

185

University of Windsor

}

else if(nj>=jorder)
{

UserError("path is not complete (does not reach the bottom)");
abort;
}

else if(NOT (tm[ni][nj]=="b" OR tm[ni][nj]=="w"))
{

UserError("Encountered unvalid node located at row "+ni+" col "+nj);
abort;
}

else if(nn)
{

label2:
if(nj-1 >=0 AND tm[ni][nj-1 ]=="w")
{
nJ--;

GOTO label2;
}

nn=O;
GOTO label 1;
}

else if(tm[ni+ 1] [nj]=="nfet")
{

path[tpc] [tc ]=ni+ 1;
path[tpc] [tc+ 1]=nj;
tc+=2;
ni+=2;
nn=l;
GOTO label 1;
}

else if(tm[ni+ 1] [nj]=="w")
{

ni+=2;
nn=l;
GOTO label 1;
}

else if(nj+l<jorder AND tm[ni+l][nj+l]=="nfet")
{

path[tpc] [tc ]=ni+ 1;
path[tpc][tc+ l]=nj+ 1;
tc+=2;
ni+=2;
nj++;
nn=l;
GOTO label 1;
}
Appendix C

Transistor Sizing Software

•

186

University of Windsor

else if(nj+ I <jorder AND tm[ni+ l][nj+ I ]==''w'')
{

ni+=2;
nj++;
nn=l;
GOTO label I;
}

else if(nj+ I <jorder AND tm[ni] [nj+ 1]=="w")
{

nj++;
GOTO label I;
}

else
{

UserError("Unvalid topology in firstpath loop at row "+ni+" col "+nj);
abort;
}
}

Procedure findallpaths()
{

tpc=O;
oni=O;
while(NOT (NoValue(ont[oni][O]) OR NoValue(ont[oni][O])))
{

ni=ont[oni][O];
nj=ont[oni][l];
npc=O;
path[ tpc] [O]=oni;
path[ tpc] [ 1]=npc;
tc=3;
nn=l;
// Search for first path
firstpath();
//Search for other paths for the same output node
startsearch:
nn=O;
npc++;
path[ tpc] [O]=oni;
path[tpc][l]=npc;
bti=path[tpc-1] [2]*2+ I;
for(j=3;j<=bti+ I ;j++) path[tpc] fj]=path[tpc-l]fj];
search:
//UserError("At the start of search loop bti="+bti);
tc=bti;
if(bti<3)
{

//no more paths for this output node
Appendix C

Transistor Sizing Software

187

University of Windsor

GOTO done;
}

else
{

ni=path[tpc] [bti]-1;
nj=path [tpc] [bti + 1];
}

search2:
//UserError(" after search2 ni="+ni+" nj="+nj);
if(tm[ni][nj]=="b" OR (tm[ni][nj+ 1]=="b" AND tm[ni+ 1][nj+ 1]==''b"))
{

//there is no other path for this node
bti-=2;
GOTO search;
}

else if(tm[ni] [nj]=="w" AND (tm[ni+ 1][nj+ 1]=="nfet" OR tm[ni+ 1][nj+ 1]=="w"))
{

//bingo, another path for this node
nj++;
firstpath();
GOTO startsearch;
}

else if(tm[ni] [nj]=="w" AND tm[ni] [nj+ 1]=="w")
{

// shift to the adjacent wire (same node)
nj++;
GOTO search2;
}·

else
{

UserError("Unvalid topology in search loop at row "+ni+" col "+nj);
abort;
}

done:
oni++;
}

.

if(oni==O)
{

UserError("No output node is specified");
abort;
}

//Clean up the path matrix
for(i=tpc;i>O;i--)
{

if(NoValue(path[i][2]) AND NOT NoValue(path[i-1][2]))
{
Appendix C

Transistor Sizing Software

188

University of Windsor

forU=O;j<path[i-1 ][2]*2+3;j++) path[i][j]=BLANK;
}

else if(NOT NoValue(path[i][2]))
{

forU=path[i][2]*2+3;j<path[i-1][2]*2+3;j++) path[i][j]=BLANK;
}
}
}

PROCEDURE nodecap(integer wi, integer wj)
{

integer temp,tempO,temp 1;
label I:
if(nj>O AND tm[ni][nj-l]=="w")
{

nj--;
GOTO label 1;
}

oni=O;
label2:
if(NOT NoValue(ont[oni][O]))
{

if(ont[oni][O]==ni AND ont[oni][l]==nj)
{

if(NOT NoValue(ont[oni][2])) c+=cdsp(ont[oni][2]);
if(NOT NoValue(ont[oni][3])) c+= cgn(ont[oni][3]);
if(NOT NoValue(ont[oni][4])) c+= cgp(ont[oni][4]);
}

oni++;
GOTO label2;
}

IIUserError("Processing node "+ni+", "+nj+" and wi="+wi+" wj="+wj);
if(ni+l<iorder AND tm[ni+l][nj]=="nfet" AND NOT (ni+l==ti AND nj==tj))
{

II

c+= cdsn(sm[ni+ 1] [nj]);
temp=ni+l;
UserError("Load transistor at ni="+temp+" nj="+nj);
}

if(ni>l AND tm[ni-l][nj]=="nfet" AND NOT (ni-l==ti AND nj==tj))
{

II

c+= cdsn(sm[ni-1] [nj]);
temp=ni-1;
UserError("Load transistor at ni="+temp+" nj="+nj);
}

if(ni+l<iorder AND tm[ni+l][nj]=="w" AND NOT (ni+l==wi AND nj==wj))
{

II

UserError("Jump down");

Appendix C

Transistor Sizing Software

189

University of Windsor

tempO=ni;
templ=nj;
wi=ni+l;

WJ=nJ;
ni+=2;
nodecap( wi, wj);
ni=tempO;
nj=templ;
}

if(ni> 1 AND tm[ni-l][nj]=="w" AND NOT (ni-l==wi AND nj==wj))
{

II

UserError("Jump up");
tempO=ni;
templ=nj;
wi=ni-1;
wj=nj;
ni-=2;
nodecap( wi, wj);
ni=tempO;
nj=templ;
}

if(nj+ 1<jorder AND tm[ni][nj]=="w")
{

nj++;
GOTO label2;
}

if( tm[ ni] [nj]==' 'b' ')
{

return;
}

UserError("Something is wrong in the nodecap search");
}

PROCEDURE computepathdelays()
{

i=O;
while(NOT NoValue(del[i][O]))
{

del[i][O]=BLANK;
i++;
}

tpc=O;
while(NOT No Value(path[tpc ][O]))
{

del[tpc ][0]=0;
Appendix C

Transistor Sizing Software

190

University of Windsor

tc=3;
C=O;
while(NOT No Value(path[tpc][tc]))
{

ti=path[tpc] [tc ];
tj=path[ tpc] [tc+ 1];
ni=ti-1;
nj=tj;
c+=cdsn(sm[ti] [tj]);
nodecap(O,O);
if(tc==3)del[tpc] [O]+=rsn(sm[ ti] [tj])*c;
else
del[tpc] [O]+=rln(sm[ti] [tj])*c;
tc+=2;
}

del [tpc] [O] *= 1e9*ffactor;
tpc++;
}
}

Procedure ini tializematrices()
{

for(i=O;i<40;i++) for(j=O;j<20;j++) sm[i] [j]=BLANK;
for(i=O;i<20;i++) for(j=O;j<45;j++) path[i] [j]=BLANK;
for(i=O;i< 1O;i++) del[i] [O]=BLANK;
//Rename topolgy matrix and set size matrix to minimum sizes
if(No Value(iorder)) ·iorder=40;
if(No Value(jorder)) jorder=20;
for(i =0; i<i order; i++)
{

for(j=O;j<jorder;j++)
{

if(tm[i][j]=="B" OR tm[i][j]=="b") tm[i][j]=''b";
else
if(tm[i][j]=="W" OR tm[i][j]=="w") tm[i][j]="w";
else
if(tm[i][j]=="G" OR tm[i][j]=="g") tm[i][j]="nfet";
else
if(tm[i][j]=="R" OR tm[i][j]=="r") tm[i][j]="nfet";
else
if(tm[i][j]=="NFET" OR tm[i] [j]=="nfet") tm[i][j]="nfet";
else
if(tm[ i] [j]== BLANK) tm[ i] [j] = "b ";
else
{

Appendix C

Transistor Sizing Software

191

University of Windsor

U serError("Topology matrix entery is not recognized located at row
"+i+" col "+j);

ABORT;
}
if(tm[i][j]=="nfet") sm[i] [j]=wminn;
}
}

}
Procedure size()
{
real
k,temp,dt;
integer
done,pointer;
if(NoValue(perror) OR perror<.01) perror=.01;
k=l .O+perror/100;
Label:
computepathdelays();
i=O;
done=l;
temp=O;
while(NOT NoValue(del[i][O]))
{
if( del[i] [O]>temp)
{
temp=del[i] [OJ;
pointer=i;
}
if( del[i] [O]>ddelay*k) done=O;
i++;
}
if( done) return;
dt=ddelay* I e-9/ffactor/path[pointer] [2];
ti=path[pointer] [3];
tj=path[pointer] [4];
ni=ti-1;
nj=tj;
C=O;
nodecap(O,O);
sm[ti][tj]=(pursn*(c+acdsn))/(dt-pursn*bcdsn);
if(sm[ti] [tj]<wminn) sm[ti] [tj]=wminn;
c+=cdsn( sm[ ti] [tj] );
j=5;
while(NOT NoValue(path[pointer][j]))
{
ti=path[pointer]U];
tj=path[pointer][j+ I];
ni=ti-1;
Appendix C

Transistor Sizing Software

192

University of Windsor

nj=tj;
nodecap(O,O);
sm[ti] [tj]=(purln*( c+acdsn) )/(dt-purln*bcdsn);
if(sm[ti][tj]<wminn) sm[ti][tj]=wminn;
c+=cdsn(sm[ti][tj]);
j+=2;
}

GOTO Label;
}

Procedure addgates()
{

tgw=O;
nst=O;
atw=O;
for(i=O;i<iorder;i++)
{

for(j=O;j<jorder;j++)
{

if(NOT NoValue(sm[i]U]))
{

tgw+=sm[i][j];
nst++;
}
}
}

atw=tgw/nst;
}

on computedelays
{

showtitle();
readglobal();
ini tializematrices();
findallpaths();
compu tepathdelays();
}

on compu tesizes
{

showtitle();
readglobal();
initializematrices();
findallpaths();
size();
addgates();
}

on adjust
Appendix C

Transistor Sizing Software

•

193

University of Windsor

{
showtitle();
ffactor=l;
readglobal();
find all paths();
compu tepathdelays();
ffactor=ddelay / del [OJ [O];
}

// This message occurs for each step in the simulation.
on simulate
{
readglobal();
ini tializematrices();
findallpaths();
size();
add gates();
}

// If the dialog data is inconsistent for simulation, abort.
on checkdata
{
}

// Initialize any simulation variables.
on initsim
{
}
// Clear dialog items
on cleartm
{

for(i=O;i<40;i++) for(j=O;j<20;j++) tm[i] UJ=BLANK;
}
on clearsm
{
for(i=O;i<40;i++) for(j=O;j<20;j++) sm[i] [j]=BLANK;
}

on clearont
{
for(i=O;i<l O;i++) for(j=O;j<5;j++) ont[i][j]=BLANK;
}
on clearpt
{
for(i=O;i<40;i++) for(j=O;j<25;j++) path[i] UJ=BLANK;
}
on cleardel
{
for(i=O;i<40;i++) del[i][O]=BLANK;
}
Appendix C

Transistor Sizing Software

194

University of Windsor

on clearall
{

for(i=O;i<40;i++) forU=O;j<20;j++) tm[i] UJ=BLANK;
for(i=O;i<40;i++) forU=O;j<20;j++) sm[i] UJ=BLANK;
for(i=O;i<lO;i++) forU=O;j<5 ;j++) ont[i][j]=BLANK;
for(i=O;i<40;i++) forU=O;j<25;j++) path[i] UJ=BLANK;
for(i=O;i<40;i++) del[i] [O]=BLANK;
}

Appendix C

Transistor Sizing Software

195

Vita Auctoris

Sarni Bizzan was born on February 25, 1965 in Tripoli, Libya, and completed his high school education at El-Nusur High School in Tripoli. He obtained the Bachelor of Applied Science and Master of Applied Science in 1989 and 1991 respectively from the Electrical Engineering Department
at the University of Windsor. He started working for ATI technologies Inc. in September 1995 as
/

an IC Design Engineer. He is currently a candidate for the Ph.D. degree at the University of Windsor.

Mr. Bizzan 's research interests include high performance parallel computing architectures, RNS

based systems, fast dynamic logic, differential analog circuits, and low noise analog circuit
design. He has published several papers on transistor sizing and parallel computing architectures
and he is currently in the process of submitting several more papers on his thesis work. At ATI he
has completed several memory designs that are embedded in video graphics chips currently used
in millions of personal computers around the world. He has recently been working on a new product line that involves the processing of low noise analog audio signals and_ the design of signal
converters. He is currently filing for a patent in the area of analog circuit design.

196

289044

i

1111111 II 1111111111111111111111111111111 Ill" 11111111111111111111111 Ill

3 1862 015 104 082
University of Windsor Libraries

