VLSI signal processing through bit-serial architectures and silicon compilation by Renshaw, D.
(1) 
VLSI SIGNAL PROCESSING 
THROUGH BIT-SERIAL ARCHITECTURES AND SILICON COMPILATION 
D. RENSHAW, B.Sc, M.Sc. 
A Thesis submitted to the Faculty of Science of the 
University of Edinburgh, for the degree of 
Doctor of Philosophy. 
DEPARTMENT OF ELECTRICAL ENGINEERING 	 JULY 1984. 
DECLARATION OF ORIGINALITY 
I hereby declare that this thesis has been composed by myself. 
The work described has been developed by a group of researchers, 
including P. B. Denyer, N. W. Bergmann, A. F. Murray and S. G. 
Smith in addition to myself. I declare that I have made a major 
contribution to the work, being the only researcher with full-time 
involvement throughout the duration of the project, from start to 
finish. My main area of responsibility has been the development of 
the methodology, design environment specification and the design 
and verification of the cell library. I have, however, also con-
tributed in all other areas including software, test methodology 




I would like to thank my supervisor Dr. P. B. Denyer for his 
guidance, support and encouragement throughout the period of the 
research and study for this thesis. I would also like to thank the 
many people in both the departments of Electrical Engineering and 
Computer Science who have assisted me during this time. Special 
thanks are due to N. W. Bergmann S. G. Smith and J. G. Hughes. 
I am grateful to the University of Edinburgh and to the Sci-
ence and Engineering Research Council for providing facilities and 
financial support for this research. I would like to acknowledge 
the publishers Addison-Wesley and thank them for permission to 
reproduce items of text and illustration which will appear in "VLSI 
Signal Processing: A Bit-Serial Approach" by Peter B. Denyer and 
David Renshaw. I would also like to thank Dr. A. F. Murray for 
taking chip photographs, S. C. Smith for preparing Figures, Dr. A. 
R. Dinnis and J. Goodall for use of the Scanning Electron Micro-
scope and Mrs C. A. Burns for secretarial assistance. 
Finally I want to thank my wife Ann for her support and 
encouragement. 
(v) 
LIST OF CONTENTS: 	 page no. 
Title Page 	. . . . . . 0 . 	. 	. . . . . 	. 	. 	. 	. 	(I) 
Abstract .........• • .............• 	(ii) 
Declaration of Originality 	.............. . 	(iii) 
Acknowledgements . . . 	........... 	. 	. 	. 	(iv) 
List of Contents 	. . . 	. 	. 	. . . . . 	. 	. 0 . 	. 	(v) 
CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . 	1 
CHAPTER 2: SIGNAL PROCESSING SYSTEMS, VERY LARGE SCALE 
INTEGRATION AND COMPUTER AIDED DESIGN 	. 5 
2.1. Introduction 	. . . . 	. . . . 	. . . . . . 5 
2.2. Signals & Signal Processing Systems 	......5 
2.3. Applications 	. . . . 	. . . . 	. . . . . . 6 
2.4. System Implementation 	. . . . . . . 	. . 7 
2.5. Architectures 	. . . . . . . 	. . . . . . 11 
2.6. Architectures for Digital Signal Processing 12 
2.7. IC Technology and System Implementation 	. 17 
2.8. VLSI Design Methodology & 
Computer Aided Design 	. . . 	. . . 	. 20 
2.9. Summary 	.......... 	. . . . . . 24 
References for Chapter 2 	. . . . . . 	. . 26 
LIST OF CONTENTS (continued): 	 page no. 
CHAPTER 3: A DESIGN ENVIRONMENT FOR GENERATING CUSTOM ICs 
TO IMPLEMENT DIGITAL SIGNAL PROCESSING SYSTEMS 
IN BIT-SERIAL ARCHITECTURES 	. . . . 	. . . 39 
3.1. Introduction 	. . . . 	. . . . . . . . . . . . . 39 
3.2. Overview 	. • ...............• 	• • 40 
3.3. Architectural 	Basis 	........... . . . 42 
3.4. Specification and System Design Capture 	. . . 	. 42 
3.5. The FIRST HDL Compiler 	. . . . 	. . . . . . . . 46 
3.6. Primitives and the Primitive Cell Library 	. . . 48 
3.7. Interfaces and Architectural Templates: 
Foundations for a Physical Design Subsystem 	. 51 
3.8. The Physical Design Subsystem 	. . . 	. . . . . 63 
3.9. The Behavioural Description Subsystem 	. . 	. . 73 
3.10. Correctness and verification 	. . . . . . . . . 80 
3.11. Design For Test, ATPG and Self-Test 	. . . 	. . 84 
3.12. Summary 	. . . 	. . . . 	. . . . . . . . . . . . 89 
References for Chapter 3 . 	. . . . 	. . . 	. . . 91 
CHAPTER 4: ARCHITECTURAL DESIGN TECHNIQUES 	. . . . . . . . 94 
4.1. Introduction 	. . . 	. . . . . . . . . 	. . . . 	. 94 
4.2. An Example 	. . . . . . . . 	. . . . . . . . . . 96 
4.3. From Function to Architecture via Algorithm . . 100 
4.3.1. Some Terminology 	. ... 	. .. 	. . . . 	. . . . 100 
4.3.2. Recurrences 	. . . . . . . . . . . . . . . . . . 102 
LIST OF CONTENTS (continued): page no. 
4.3.3. Recurrence Architectures 	. . . 	. . . . 	. . . . 	109 
4.3.4. Recurrence Types 	. . . . . . . . . . . . . . . 114 
4.3.5. Bandwidth Matching 	. . . 	. . . . 	. . . . 	. . . 114 
4.4. Miscellaneous Related Topics 	. . . . . . . . . 128 
4.4.1. Processor-Time-Value Graphs, Hodographs & 
Hardware Signal Flow Graphs 	. . . * 	. . . . . 	128 
4.4.2. Evaluation 	Criteria 	. . . . . . . . . . . . . . 133 
4.5. Examples 	. . . . . . . . . 	. . . . 	. . . . 	. . 134 
References for Chapter 4 	. . . . . . . . . . . 148 
CHAPTER 5: SYSTEM SYNTHESIS & SYSTEM MAPPING TECHNIQUES 
USING FIRST 	. . . . 	• • • • 	• • • • 	• • • 149 
5.1. Introduction 	. . . . . . . . . . . . . . . . . 149 
5.2. Implementing Arithmetic Engines 	. 	. . . 	. . 152 
5.2.1. Time Tagging 	. . . 	. . . . . . . . . . . . 153 
5.2.2. Further 	Conventions 	. . . . 	. . . 	. . . 	. . 154 
5.3. Setting the System 	rd-Length 	. 	. . . . . 162 
5.3.1. Factors 	....... 	. . . . . . . . . 	. . 162 
5.3.2. General Objectives 	. ....... 	• . . . 163 
5.3.3. lower Bounds on Word Length 	. 	. o . . . . 	. 163 
5.3.4. Bit Growth 	. . . 	. . . . . . . . . . . . . 164 
5.3.5. Format Restrictions 	. . . . 	. . . 	. . . 	. . 165 
5.3.6. Summary 	. . . . . . . . . . . . 0 	. . . . . . 165 
LIST OF CONTENTS (continued): 	 page no. 
5.4. 	Implementing State Memory, Multiplexing 
and Net Synthesis . . . . . . . . . . . . 	165 
5.5. 	Initialisation . . . . . . . . . . . . . . . . 	167 
5.6. 	Design-for-Test . . . . . . . . . . . . . . . . 	168 
5.7. 	Implementation of the Control Network . • . . . 	169 
5.8. 	Partitioning . . . . . . . . . . . . . . . . . 	173 
5.9. 	Completion . . . . . . . . . . . . . . . . . . 	174 
5.10. 	Summary . . . . . . . . . . . . . . . . . . . . 	178 
References for Chapter 5 . . . . . . . . . . . 	180 
CHAPTER 6: A SURVEY OF CASE STUDIES IN SYSTEM DESIGN 
USING FIRST 	. . . . 	. . . 	. . . . 	. . . 181 
6.1. Introduction 	. . . . . . . . . . . . . . . . 181 
6.2. System Statistics 	. . . . 	. . . 	. . . . 	. . . 182 
6.3. Pilot 	Studies 	. . . . . . . . . . . . . . . . 183 
6.4. Complex to Magnitude Converters • 	. . . . 	. . . 184 
6.5. Finite Impulse Response Filters 	. . . . . . . . 185 
6.6. Adaptive Lattice Filter 	. . . . . . . 	. . . . 185 
6.7. Echo 	Canceller 	. . .. . . ........ 186 
6.8. Wave Digital Filters 	• • ....... 	. . . . 187 
6.9. Systolic Discrete Fourier Transform Engine 	. . 187 
6.10. thssless Discrete Integrator 
Recursive Ladder Filters 	. • . 	. . . . . . 188 
6.11. Fourier Transform Engines 	. . . . . . . . 	. . 190 
6.12. Miscellaneous Small Systems 	. . . . . . . . . 192 
6.13. Conclusion 	. . . 	. . . . . . . . . 	. . . . 	. 192 
References for 	Chapter 6 	...........194 
LIST OF CONTENTS (continued): page no. 
CHAPTER 7: HARDWARE FABRICATION AND TEST 	. . 	. . 	197 
7.1. Introduction 	. . . 	. . . . . . . . 197 
7.2. FIRST Hardware Assumptions 	. 	. . . 	197 
7.3. Verification of Primitives 	. 	. . . 198 
7.4. Testing 	. . ......• 	• • • • .......200 
7.5. Organisation of Primitive design and test . 	202 
7.6. Phase 	1 . 	. . . • 	• • ........• 	• • • 203 
7.7. Phase 	1 Designs 	. . . . . 	. . . . . 204 
7.8. Phase 	1 Testing 	. . . . . . . . 	. 207 
7.9. Phase 	1 test results 	. . . 	. ....... 207 
7.10. Phase 1 	Redesign 	. . . . . . . 	. . 	212 
7.11. Implications 	. . . . . . 	. . . . . 212 
7.12. Implications for Cell Design 	. ...... 212 
7.13. Implications for Test 	. . . . . . 	213 
7.14. Implications for FIRST . 	. . . 	. . 214 
References for Chapter 7 . . . . . 	215 
CHAPTER 8: CONCLUSION 	. 	. . . . 	. . . . 	. . . . 	. . . . . 218 
APPENDIX I 	The FIRST Hardware Description Language 
APPENDIX II 	The FIRST Primitive Library 
APPENDIX III 	Case Studies In System Design with a Silicon 
Compiler: An LMS Adaptive Transversal 
Filter for Speech Echo Cancellation 
APPENDIX IV 	List of Publications 
To my wife Ann. 
CHAPTER! 
INTRODUCTION 
This thesis is concerned with the problem of how to implement 
digital signal processing systems in silicon integrated circuit 
(IC) technology. Traditional methods for system design have typi-
cally relied on the following sequence of design actions. Initial 
system concepts are explored algorithmically by high level computer 
simulation. A prototype system is conceived in terms of existing 
standard components, typically families of TTL, etc. A prototype 
breadboard system is constructed. There then follows a period of 
testing debugging and redesign, even perhaps extending to modifica-
tion of the system specification. Once the prototype breadboard is 
working it is then mapped directly into silicon ICs. This may be 
done using custom or semi-custom methods, depending on the size, 
complexity and bandwidth required of the system under design. 
A number of problems attend the steps of this design sequence. 
Firstly breadboarding in standard parts imposes severe architec-
tural and partitioning strictures. Secondly, there is no inherent 
structuring of design. Thirdly, there are few if any debugging 
aids. Fourthly, a direct mapping of the breadboarded prototype 
into silicon creates gross inefficiency in terms of silicon circuit 
design. In addition to these disadvantages, traditional methods of 
custom IC design are time consuming and costly. 
By way of contrast, progress in IC fabrication technology has 
advanced by orders of magnitude, providing the technological capa-
bility of integrating hundreds of thousands of transistors onto a 
- 2 - 	 (Ca 1) 
single chip. This throws down a challenge to systems designers: 
that of designing systems requiring this complexity. In other 
words the two significant problems are how to design and what to 
design. The 'what to design divides into two: mapping known and 
improved versions of known systems, and real innovation in the form 
of new system invention. The challenge of the 'how to design is 
the discovery of methods for fast, efficient design generation. 
Added to this challenge of technology there are applications 
requirements for miniaturisation of systems for reasons of cost, 
weight and size, power dissipation and reliability; and there are 
economic pressures for fast cheap design. 
One approach to the problem of how to design is to design pro-
grams and use existing computers for hardware. This correspond to 
a direct implementation of the high level simulation design. The 
advantage is that the design cycle is significantly reduced and 
that there are debugging aids and design can be inherently struc-
tured. There are also significant problems. Until very recently 
there were usually no fast enough, small enough, cheap enough com-
puters. Movement in the direction of providing such computers has 
occurred. However, for the foreseeable future these will not be 
capable of addressing high bandwidth, computationally intense 
real-time applications. Further, they always possess inefficiency 
with respect to dedicated, or specific applications. In order to 
fulfil these requirements, custom design is still needed as it pro-
vides the only solution. 
Typically the principal concerns of the system designer in 
addressing the custom implementation of a system are as follows. 
Firstly, he needs some method of design capture which can be used 
-3- 
to communicate all the necessary details of the system specifica-
tion to the custom design process. Secondly he is interested in 
the time delay between submitting the design specification and 
receiving fabricated parts. Thirdly he is concerned about the 
cost. Fourthly he is concerned about testing the ICs returned to 
him. Here it should be remembered that control and observation of 
nodes on a breadboard of discrete parts is a totally different pro-
position to control and observation of nodes inside an IC. The 
details of silicon circuit design are not of interest to him except 
with respect to the testability of the circuit and with respect to 
power dissipation and the efficient use of silicon area. 
Thus a requirement for rapid, cheap custom prototyping in sil-
icon is identifiable. The objective of the research described here 
is to develop a design environment capable of providing this facil-
ity. The features it should offer will include the following. It 
must provide a simple but powerful language for system design cap-
ture. It must allow flexibility in system architecture and parti-
tioning. It must provide a method for rapidly evaluating alterna-
tive systems prior to fabrication. Generation of mask geometries 
must be automated, but still reasonably efficient, and include a 
guarantee of performance. Testability must be in-built so that 
both the necessary test patterns are automatically generated and so 
that they give the fully required test coverage. 
The contents of this thesis are arranged as follows. Chapter 
2 gives a reviews signal processing systems, very large scale 
integration (VLSI) and computer aided design (CAD), in order to 
reveal the techniques adopted from these areas and so as to relate 
this research to that of other researchers. Chapter 3 describes 
- 4 - 	 (CH 1) 
the design environment which has been developed. Chapters 4 and 5 
give an outline of architectural design techniques, system syn-
thesis and system mapping. These lead to efficient methods for 
system design capture. Chapter 6 gives a brief summary of designs 
completed and uses these to evaluate design capture. Chapter 7 
describes the preliminary hardware fabrication and testing associ-
ated with the development of the design environment. Finally 
Chapter 8 is a short summary and conclusion. 
There are four appendices. The first contains details of a 
version of the hardware description language. The second contains 
descriptions of an initial library of primitive constructs. The 
third contains an example case study of a system design. The 





SIGNAL PROCESSING SYSTEMS, VERY LARGE SCALE INTEGRATION 
AND COMPUTER AIDED DESIGN 
2.1. Introduction 
The research described in this thesis centres on establishing 
methods and tools for the implementation in custom, very large 
scale integrated (VLSI) circuits of digital signal processing sys-
tems using bit-serial architectures. As such, this research is 
interdisciplinary, - embracing areas of signal processing, 
integrated circuit engineering, system testability, VLSI design 
methodology and computer science. In this chapter a review of the 
relevant, directly related research areas is given. This is done 
so as to set in context the choice of direction, as set out in the 
subsequent chapters. 
2.2. Signals & Signal Processing Systems 
The terms signal processing system and signal have three esta-
blished interpretations [1] . Firstly, systems can be defined to be 
networks of electrical components (resistors, capacitors, inductors 
and independent sources). Signals are then defined to be the vol-
tages and currents on nodes of the network. This is the approach 
of classical circuit theory which is covered in many standard texts 
including, e.g. [2,3] . Secondly, instead of viewing the elements 
of a network as electrical components they can be interpreted as 
mathematical functions, or transforms (multipliers, integrators, 
differentiators etc). Signals are then viewed as functions of 
time, satisfying the equations determined by the system. Such an 
(CH 2) 
approach is more general than that of circuit theory and has long 
been a topic of study in mathematics, physics, engineering and 
other fields. Again there is a well established body of knowledge 
in this domain as exemplified, e.g., in [4,5]. Thirdly, and more 
recently systems have come to be viewed as digital devices and sig-
nals as sequences of numbers. This approach can be used to approx-
imately implement the previously known systems, through the theory 
and techniques of digital signal processing. Again there is well 
understood body of knowledge in this area with a large literature 
including the classical text [6].  Digital systems, however, extend 
beyond the limited structures of classical signal processing; these 
also span the area of computing and programmed instruction 
machines. 
This thesis addresses aspects of digital signal processing. 
2.3. Applications 
Applications of signal processing cover an ever widening area, 
including communications, audio recording and broadcast, sonar and 
radar systems. More recently applications in geophysics, speech 
analysis, synthesis and recognition as well as image processing and 
recognition have become very active areas for research. A survey 
of applications can be found in [7].  The development of applica-
tions presupposes certain enabling bodies of knowledge. Firstly an 
experimental and theoretical understanding of the subject area is 
needed. Secondly, appropriate technology for system implementation 
is also needed. Both theoretical studies and the available tech-
nology impose direct constraints on implementations. Reciprocally, 




provide directional pressure on the development of technology. 
The research reported in this thesis is not application specific 
but is primarily concerned with developing a method for rapid sys-
tem implementation. Such a method will be relevant to all the 
above mentioned application areas. 
2.4. System Implementation 
Implementations of digital signal processing systems can be 
either analog or digital. Such classification rests on the method 
of representation of time and signal within the system. Time can 
be represented as a continuous variable, or as a discrete variable. 
In sampled data systems time is a discrete variable. The signal 
(voltage or current etc.) can also be represented as a continuous 
or discrete varible. Signals which are quantised are discrete sig-
nals, referred to as digital signals. Analog systems operate on 
continuous signals which are either continuous or time-sampled. 
Digital systems operate on quantised, time-sampled data. In the 
past, because of the real-time bandwidth requirements of signal 
processing applications, systems have been dominated by analog 
implementations. There are, however, accuracy and matching prob-
lems associated with constructing systems from analog components. 
Digital elements have fundamental advantages with respect to accu-
rate long term memory, elimination of the need for calibration-
adjustment for drift and the flexibility of reprogrammability for 
new requirements. Initially, real-time digital signal processing 
was not feasible. With the development of micro chip technology 
and progress in device scaling and the corresponding increase in 
component performance, digital components can now be designed to 
- 8 - 	 (CH 2) 
compete with analog devices both in respect of complexity, 
bandwidth and cost [8,9]. Further progress in these directions 
assures the superior performance of digital processing in ever 
increasing areas. Thus many analog systems are being and will 
increasingly be superseded by digital replacements. 
Digital signal processing functions require computationally 
intensive algorithms. Many applications require high real-time 
bandwidths. It is these two demanding requirements which dominate 
implementation. For any given technology, the overriding require-
ment is to design efficient hardware capable of meeting bandwidth 
requirements. 
The important area of real-time digital implementation of signal 
processing systems is addressed in this thesis. 
Implementation is constrained "from above" by application 
area, the required bandwidth and algorithm development. "From 
below" implementation is constrained by the available technology. 
These externally imposed constraints dictate the limits on what 
systems can be implemented in hardware. Intrinsic to implementa-
tion are the problems of design, testability and cost. 
Broad implementation alternatives are: 
standard parts; 
programmable and micro-programmable parts; 
semi-custom design (fixed cell library and gate array); 
full custom design. 
Table 2.1 lists a number of criteria against which to assess any 
implementation method and Table 2.2 gives tentative evaluations 
against the four methods. A precise cost function will depend on 
the market rates for each resource at the given time and their 
estimated rate of change over the period of the implementation. 
SOME ASSESSMENT CRITERIA 
Design time 
Debug & redesign iterations 






Cost of testing 
Cost of production 
Cost of design 
Product security 
Table 2.1 
For digital signal processing, discrete component and full 
custom design have been the most widely used methods, partly 
because they have often been the only feasible methods. This 
situation is a result of the fact that real-time digital signal 
processing largely requires computationally intensive, fixed func-
tion, high speed processing, which cannot be achieved within the 
restrictions of low performance implementation techniques. The 
situation is gradually changing, as higher performance programmable 
digital signal processors are developed and as gate array technol-
ogy improves. However there still remain a large number of systems 
which can only be implemented in custom designs. 
COMPARISON BETWEEN IMPLEMENTATION METHODS 
GATE 
T T L 	it D S P 	ARRAY 	CUSTOM 	COMPILED 
ARCHITECTURE CHOICE flexible fixed limited flexible constrained 
DESIGN TIME high/mod. medium low very high very low 
DEBUG & REDESIGN at design at design few several not necessary 
COMPLEXITY MANAGEMENT no aids variety some few comprehensive 
POWER DISSIPATION high high medium low low 
SIZE large med./large medium smallest small 
BANDWIDTH wide range low/med. medium wide range wide range 
TESTABILITY DEPENDS ON 	DESIGN very good 
TEST COST high high medium high low 




11 - 	 (CH 2) 
The research described in this thesis addresses the area of custom 
design. 
2.5. Architectures 
Implementation is primarily concerned with design. This 
involves the organisation of components (sometimes thought of as 
resources) to create hardware which carries out a required function 
with the specified performance. In a formal sense design is 
derived from a specification. At the highest level the components 
or resources available can be classified as arithmetic, logic, 
storage, control and communications. The system architecture is 
defined to be the particular organisation and structure of these 
components in the system. Two contrasting examples of a system 
architecture are the von Neumann architecture and the systolic 
array architecture. In a similar way, complex components have an 
internal architecture comprising their circuit elements. Examples 
of component architectures can be found in multipliers, adders, 
RANs, ROMs etc. A chip architecture may be a system architecture 
or a component architecture, the distinction is largely a question 
of complexity and self -con tainedness, and may be reflected in the 
degree of integration (MSI & LSI chips are component architectures, 
whereas LSI & VLSI chips may be system architectures). Examples of 
chip architectures can be found amongst the many micro-processors, 
controllers and other custom designed chips commercially available 
today. 
Architectures can be classified in a variety of ways. Clas-
sifications are based on some feature of the resource organisation. 
Thus there are synchronous architectures and asynchronous (or 
- 12 - 	 (CH 2) 
self-timed) architectures, a classification with respect to commun-
ication. Structuring of communication or processing gives rise to 
a classification into bit-serial or bit-parallel architectures. 
Classification with respect to control function distinguishes 
hard-wired and programmable architectures. As a further refinement 
there are SISD (single instruction single data stream), SIMD (sin-
gle instruction multiple data), MISD (multiple instruction single 
data) and MIMD (multiple instruction multiple data) architectures 
[10]. Further, classification according to arithmetic processing 
is also possible giving fixed-point, floating-point, twos comple-
ment, canonic sign digit and distributed arithmetic, etc., archi-
tectures. 
2.6. Architectures for Digital Signal Processing 
The importance of architecture to the implementation of signal 
processing systems is evidenced by the many papers which emphasise 
it [11,12,13,14,15,16,17,18,19]. 
For any specific system, an evaluation of the alternative 
architectural possibilities can be made. Conversely, the classes 
of system suitable for implementation in a particular architecture 
can be investigated. Attempts to develop general purpose implemen-
tation methods must address classes of systems specific to one or 
more architectures. 
The major concern of this research was to apply appropriate 
design automation techniques to the implementation of real-time 
signal processing systems. At the start of the project, the state 
of research in design automation recognised that short to medium 
term success was obtainable only through implementations which 
- 13 - 	 (CH 2) 
targeted at constrained architectures [20,21]. Thus the correct 
choice of constrained architecture becomes crucial. 
With respect to communication and processing, two fundamen-
tally different architectural styles have been developed in signal 
processing systems. These are bit-serial architectures and bit-
parallel architectures. In this section a brief comparison of the 
features of each is given. This is done to justify the choice of 
architecture. 
The main physical features of bit-serial architectures are 
single wire communication and small area processing elements, 
(order one or order a, where a is the number of bits per word). 
Arithmetic, logic and communication all operate with a bit-wise 
organisation, which is extended in time for each full word. By 
contrast the main physical features of bit-parallel architectures 
are multi-wire communication buses and large area processing ele-
ments, (order a or order a squared). Arithmetic, logic and commun-
ication all operate with a word-wise organisation which is extended 
in space. The consequences of this are as follows. 
Bit-serial architectures have order one or order n area corn-
pared with order a or order n squared area for bit-parallel 
architectures. This relates to communication area, to pro-
cessing area and to control function area. Storage area is 
approximately equivalent for both. 
As a result of the reduced number of circuit elements per unit 
function in bit-serial architectures, power dissipation might 
be expected to be reduced by a factor a as compared with bit-
parallel architectures. However, there is usually additional 
- 14 - 	 (CH 2) 
saving due to the absence of multiple bus drivers and organi-
sational overheads. 
Component operating speed, as measured by maximum clock rate 
for clocked schemes, is determined by the unit element maximum 
operating delay. In bit-serial parts this is usually that of 
a full adder. In any event it is independent of word-length. 
In bit-parallel parts the maximum operating delay is usually 
determined by carry propagation and depends on word-length 
(typically order a or order log a, at added area cost). This 
unit element maximum operating delay forms the upper bound on 
system operating speed. Consequently, much tighter pipelining 
is possible in bit-serial architectures than is the case in 
bit-parallel architectures. Another factor reinforcing this 
effect is the nature of communication delays in each architec-
ture. 
Cost per unit function is approximately inversely proportional 
to yield, which is approximately inversely proportional to an 
exponential function of area. Thus bit-serial components pos-
sess a considerable cost advantage per unit function. 
Bit-serial function is best suited to fixed (dedicated or 
hard-wired) configuration, though general purpose architec-
tures are being pioneered [22]. Bit-parallel architectures 
are dominated by the instruction based ALU or data path 
models. Programmability is inherently easy to build in, with 
low overheads. The flexibility which this allows gives a con-
siderable advantage over the restrictions of a dedicated, 
non-reconfigurable architecture. Thus fixed function archi- 
- 15 - 	 (CE 2) 
tectures are less common and in general bit-parallel architec-
tures do have an advantage over bit-serial architectures in 
this respect. 
With respect to overheads, bit-parallel architectures tend to 
require more overheads in the shape of communications busses, 
additional registers, and more complex control than is the 
case with bit-serial architectures. 
In system design the importance of partitioning is recognised, 
as for example in [23,11]. Ease of partitioning is jointly a 
function of the amount of communication that crosses a parti-
tion boundary and the size of objects to be allocated to each 
partition. In respect of both of these, as also in terms of 
pin count, the bit-serial architectures have a marked advan-
tage. 
The cost of functional parallelism is much less in bit serial 
architectures than is the case in bit-parallel, as is exempli-
fied in [24]. 
Other important criteria are modularity and extensibility. 
Both architectures can be organised to exhibit these. The cost is 
a function of the features outlined above and tends to be less for 
bit-serial architectures. Overall system performance will depend 
on the composite effects of the single feature cost functions out-
lined above. Simple metrics such as speed area product, and speed 
squared area product have been proposed [25] but these do not cover 
all aspects which need to be taken into consideration. 
The advantages of bit-serial architectures have long been 
- 16 - 	 (CH 2) 
recognised for dedicated applications [26] and many systems have 
been built using this architecture as evidenced by 
[27,28,29,30,31,32,33]. An architectural methodology for the 
design of bit-serial signal processors has been developed, as set 
out in [12] . This approach constituted a major influence on the 
research reported in this thesis. The wide range of architectural 
variants available within a bit-serial overall organisation are 
well set out in this text. 
Recently, bit parallel architectures have become increasingly 
popular for low bandwidth applications. The trend in this area is 
to design general purpose programmable signal processors and to 
customise applications by programming them [34,35,36,37,38,39,40]. 
The research of this thesis has chosen to restrict its investiga-
tions to bit-serial, dedicated architectures principally for appli-
cations beyond the capabilities of the programmable, general pur-
pose micro-processor. 
2.7. IC Technology and System Implementation 
System implementation is constrained by the hardware technol-
ogy available. Over the past twenty years integrated circuit tech-
nology has progressed from single devices through small scale 
integration (SSI), medium scale integration (MSI), large scale 
integration (LSI), to very large scale integration (VLSI) and 
beyond. Definitions of levels of integration tend to vary from 
author to author. A rough characterisation in terms of transistor 
count might give SSI in the range ten to one hundred, MSI in the 
range one hundred to one thousand, LSI in the range one thousand to 
ten thousand and VLSI as being above ten thousand. 
- 17 - 	 (CH 2) 
Integrated circuit engineering can be viewed as having two 
main components: design and manufacture. Originally design was 
simple (in terms of the number of devices, their logic and inter-
connection) It was dominated by circuit engineering and device phy-
sics. Manufacturing technology formed the research and development 
frontier. With progress in this technology the following changes 
took place. Firstly, the number of devices that could be placed on 
a chip increased by orders of magnitude. Secondly the manufactur-
ing processes became well understood, replicable and characterised. 
The impact of these developments on design was that the technology 
rapidly became commensurate with the requirements for complex sys-
tem implementation. The early design methods, appropriate to small 
scale circuit engineering, were not suited to dealing with systems 
comprising from tens or hundreds of thousands to millions of 
transistors. In this context, design has become a major research 
topic. There is a further fundamental reason for the importance of 
design research as expressed in [41],  where Mead states: 
...After an evolution of six or more orders of magni-
tude in the most important metrics of the underlying 
technology, we are still using the same conceptualisa-
tion of computing that was common in the era of vacuum 
tubes and core memories. A quantitative improvement of 
many orders of magnitude makes a qualitative difference 
in the way one must conceptualise a field. What 
started as an evolution in cost reduction must, sooner 
or later, lead to a true revolution in our basic think- 
it ing . 
Gradually, as design becomes more complex a division of labour 
becomes necessary. Thus circuit design engineering, system design 
engineering and fabrication engineering separate out and there is a 
requirement for efficient interfaces between these areas. In the 
early history of integrated circuit design, circuit engineering 
dominated chip design. For VLSI technology system design 
- 18 - 	 (CH 2) 
engineering must dominate chip design, for reasons of economy, 
efficiency and testability. 
A potential cost escalation and communication crises were 
foreseen in the area of chip design. To avoid these, and in con-
junction with engineering education, an initiative in VLSI systems 
design was launched in the USA with considerable impact on subse-
quent trends in chip design. The objectives of this program are 
set out in [42] and its success can be gauged from the results, 
reviewed in [43] . The pedagogical program became accessible to a 
wide circle of engineers through the text [44]. 
The major evolutions resulting from this movement were as fol-
lows. Firstly processing was "factored" out from design, by set-
ting up a silicon foundry service [45] and defining "interfaces" 
into this, so that engineers from small scale organisations as well 
as universities,- groups traditionally not directly involved with a 
processing or design house, - could gain access to fabrication. 
Secondly standards for data format were proposed [44,46,471 and 
used; these are the simplified design rules and caltech intermedi-
ate form (CIF). Thirdly computer programming and the methodology 
of software design started to have a direct impact on chip design. 
Further discussion of this last influence is presented in the next 
section. 
These developments opened up the area of VLSI systems on silicon. 
This thesis attempts to contribute to this area. 
Historically the developments leading to VLSI were associated 
with MOS technology. Further, the implementation technology made 
available for this research was nMOS technology. Thus an 
- 19 - 	 (CH 2) 
evaluation and choice of technology are not subjects addressed here 
but can be found adequately treated elsewhere, e.g. [48,49]. There 
are, however, issues of "technology binding" which are nonetheless 
of major importance. These concern the re-implementation of 
designed systems in a new technology as this becomes available and 
the cost and time involved in doing this. Any research into design 
must take account of this. 
The objective of the research reported in this thesis is to develop 
means for designing complex signal processing systems in silicon so 
as to enable system engineers to have easy access to integrated 
circuit technology, without the expense and delay of custom design. 
This program of research is based on the assumption that there is 
and will continue to be a trend towards rapid, cheap silicon fabri-
cation for small volume proto-typing and experimental system 
development, as has been seen in the USA. 
It is now appropriate to list the known VLSI problem areas. 
Firstly there are the technology and scaling problems [50]. 
Secondly there are the timing, clocking and communications problems 
[51,52]. Thirdly there are power distribution and dissipation 
problems associated with scaling, Increased speed and increased 
circuit density [50,53]. Finally, there are the design associated 
problems [41,54,55,56,57]. These include: algorithm development, 
complexity management, specification, partitioning, technology 
independence and technology binding, physical design generation 
(including floor planning, placement and routing) simulation, 
verification, testing, documentation and finally design time and 
cost 
- 20 - 	 (CH 2) 
The first three areas require combined technology and design 
developed solutions, and are not areas which this research 
addresses. The remaining chapters propose and evaluate one possi-
ble solution to the design associated problems. 
2.8. VLSI Design Methodology and Computer Aided Design 
The ideas developed in design methodology and computer aided 
design have been applied, in order to solve the design problems. 
In this section a brief review of the key ideas is given. 
Until the mid 1970s, CAD for IC design consisted of dissoci-
ated specific programs, mainly graphics editors, circuit simulators 
and data post-processing programs, including some layout rule 
checking [58]. As the level of integration increased, many prob-
lems in using these aids started to become evident, until it became 
clear that the currently available aids could not handle the com-
plexity of the circuits to be designed. Also, a cost escalation 
crisis for custom design precipitated pressure to develop new ways 
of designing integrated circuits. A major problem of traditional 
custom design was the cost required for the necessary number of man 
years per project. This comprised a large initial value with 
perhaps the same again for debugging and redesign before a success-
ful product could be released. It was imperative that new tech-
niques and tools be invented. This situation inspired research 
into design methodologies and lead to structured, hierarchical 
design and a drive towards constructing methodology based CAD 
tools. Only in this way could complexity be handled, design cycles 
shortened and costs cut. Another key concept was that of bound 
structural, behavioural and physical descriptions [59,60]. As a 
- 21 - 	 (CH 2) 
result, anything other than an integrated, hierarchical design 
environment is now seen as both inadequate and unsatisfactory. 
This was the first significant trend to influence the present 
decade of CAD - namely a movement towards structured, hierarchical 
and methodology—based, integrated design environments. 
The second trend occurred simultaneously but if anything has 
even more importance. This trend originated from the area of pro-
gramming languages and artificial intelligence. Circuits are com—
posed of collections of comparatively small cells. These cells can 
be viewed as families of related variants. Instead of generating 
chip designs using graphics aids by placing fixed instances of 
cells, a structure of parameterised cells can be generated by com—
puter program, through procedural definition. Implementation of 
this idea leads 	to creation of embedded languages for IC design. 
Early examples of these can be found in [61,62]. Two main types of 
embedded language have been developed: translation based languages 
and data structure based languages (cp. chapter 3 of [631).  In the 
former, graphic primitives are output as they are encountered. In 
the latter, a data structure is built up for the entire design 
before it is output. 
The concept that chip design can be done by writing computer 
programs which generate the mask geometries has had "revolutionary" 
importance. This can best be understood by looking at the process 
of composing a chip from small previously defined cells, usually 
called leaf cells. In this discussion a comparison is made between 
on the one hand graphics based cell design & assembly and on the 
other hand design & assembly based on programming languages. 
Classes of complex structures, especially arrays for example, can 
- 22 - 	 (CH 2) 
be efficiently described in terms of loops and conditionals. 
Further, if the definitions of the pre-defined cells are not fixed, 
but are parameterised, then evaluation of parameters can occur 
within such loop and conditional expressions. By way of contrast, 
the assembly of fixed cells by means of graphics editing requires a 
large collection of cells and a time consuming process of assembly. 
Further it results in only a single structure. Additionally, the 
size of the cells to be assembled has an upper bound imposed by the 
graphics system screens and operator limitations. By contrast, a 
greater range of cell size can be generated by parameterised pro-
cedural definition. More significantly, if cell definition 
requires instantiation of all its internals then it is fixed and, 
on assembly, no alteration is possible. Potentially much more 
efficient layouts can be generated if there is the flexibility to 
absorb global assembly constraints into cell definitions. With 
parameter isation of definitions there is the possibility to do this 
and the way is opened for computerised search for optimised solu-
tions, if suitable algorithms can be found. This approach, how-
ever, is at present in its infancy. Developments along these lines 
lead through silicon assembly (which is usually based on translator 
embedded languages) to silicon compilation [63-80]. Some of these 
have since reached commercial "maturity" [78,81,82,83]. The 
latter, if based on a data structure embedded language, does have 
the potential to search for optimisation. The success of this 
approach to chip design was evidenced by the early designs gen-
erated by this method and their successful fabrication [84,85,20] 
Finally, this development has opened the way for the development of 
the so called "intelligent, knowledge-based systems" for IC design 
[861 
- 23 - 	 (CH 2) 
The early silicon compilers have all been architecture 
specific. Their success was possible because of this. The first 
architecture to be investigated was the data path architecture 
[63,21]. Subsequently the application of the ideas of silicon com-
pilation have been applied to gate arrays [78] and to a variety of 
signal processing architectures [20,87,88]. Concurrently with 
these developments the research reported in this thesis was under 
development. Progress has been reported elsewhere in 
[89,90,91,92,93,94]. 
Another important strand of research, suitable for incorpora-
tion into the framework of silicon compilation applied to system 
design, can be found in the high level algorithmic research 
reported in [56,95,96,97,98,99,100]. 
The main short-comings of the early silicon compilers were 
that they addressed, almost exclusively only the physical design 
problem. In signal processing, performance is a dominant con-
straint. To date the conventional silicon compiler approach to 
performance has been to "post-design" analyse it, or ignore the 
problem. The problems of binding to behavioural description, 
verification, performance and testability are crucially important 
areas which were not adequately addressed. Further inadequacy can 
be seen in the way technology binding was achieved and the problems 
of translation to a new technology. 
The research reported in the remainder of this thesis has the 
objective of addressing with equal emphasis structural, physical, 
behavioural, performance and testability requirements. The tools 
- 24 - 	 (cl-i 2) 
and techniques for achieving this were assessed as being attainable 
in the three year time scale. The problem of technology binding 
remained only partially solved, mainly because the prerequisites 
for a satisfactory solution were not available at the start of the 
project. Progress in this direction has, however been made and is 
reported in [101,102]. 
2..2.. Summary 
A brief review of the background research areas of relevance 
to this thesis has been given. The objectives of the research have 
also been identified and set in context. These may be summarised 
as the development of methods and tools which permit the rapid, 
"first-time right" design, the evaluation and the implementation of 
complex digital processing systems in IC technology. The approach 
adopted incorporates the use of program language definition for 
specification and a method (the system mapping method) for 
transforming a class of general algorithms into this language. It 
utilises the techniques of embedded languages and parameterised 
procedural generation of physical design. It features bound func-
tional, behavioural and structural descriptions with an economic 
verification strategy, based on simulation and data base "proving". 
Performance is addressed from the outset through the use of custom, 
parameterised macro cell engineering, through the definition of 
communication standards and through the bit-serial, pipelined 
architecture chosen. Spectacularly economical system testability 
is achievable as a result of the architecture chosen and through 
the use of the system mapping method. For reasons of availability 
and widespread use, the implementation technology used is five 
micron nMOS polysilicon gate technology. The arithmetic format 
- 25 - 	 (CH 2) 
chosen is fixed point twos complement and communication follows 
standard, two phase, non-overlapping clocked design. The realisa-
tion of these objectives and an evaluation of the results is 
presented in the remainder of this thesis 
It should be noted that there are two external assumptions upon 
which the success of this project will be based. These are, 
firstly, that fast turn around silicon processing is available for 
testing and proto-typing and, secondly, that the cost of such a 
facility will be comparatively cheap, or at least competitive with 
that of the alternatives. 
- 26 - 	 (CH 2) 
A. Papoulis, Circuits and Systems: A modern Approach, Holt-
Saunders (1980). 
P. R. Adby, Applied Circuit Theory, Wiley (1980). 
J. Millman, Microelectronics: Digital and Analog Circuits and 
Systems, McGraw-Hill (1979). 
T. Kailath, Linear Systems, Prentice Hall (1980). 
M. Schwartz and L. Shaw, Signal Processing, McGraw-Hill 
(1975) 
L. R. Rabiner and B. Gold, Theory and Application of Digital 
Signal Processing, Prentice Hall (1975). 
A. V. Oppenheim (Ed.), Applications of Digital Signal Process-
Prentice Hall (1978). 
C. M. Rader, "On Digital Filtering," IEEE Transactions on 
Audio and Electronics Vol. AU-16(3) pp.  303-313 (September 
1968). 
H. J. Whitehouse and K. Bromley, "Can Analog Signal Processing 
Survive the VHSIC Challenge?," Microwave Systems News, pp. 
91-98 (April 1981). 
M. J. Flynn, "Very High Speed Computing Systems," Proc IEEE 
Vol. 54 pp.  1901-1909 (December 1966). 
- 27 - 	 (CH 2) 
E. E. Swartzlander Jr., "VLSI Architecture," pp.  178-221 in 
Very Large Scale Integration (VLSI) Fundamentals and Applica-
tions, ed. D. F. Barbe,Springer Verlag (1980). 
R. F. Lyon, "A Bit-Serial VLSI Architectural Methodology for 
Signal Processing," pp. 131-140 in VLSI 81: Very Large Scale 
Integration, ed. J. P. Gray,Academic Press (1981). 
J. Allen, "VLSI Architectures for Signal Processing," pp. 
242-254 in VLSI architecture, ed. B. Randell, P. C. 
Treleaven,Prentice Hall (1983). 
B. A. Bowen and W. R. Brown, VLSI Systems Design for Signal 
Processing, Prentice Hall (1982). 
P. B. Denyer, "An Introduction to Bit-Serial Architectures for 
VLSI Signal Processing," pp. 225-241 in VLSI architecture, ed. 
B. Randell, P. C. Treleaven,Prentice Hall (1983). 
P. R. Cappello and K. Steiglitz, "Completely-Pipelined Archi-
tectures for Digital Signal Processing," IEEE Transactions of 
Acoustics, Speech and Signal Processing Vol. ASSP-31(4) pp. 
1016-1023 (August 1983). 
P. C. Treleaven, "Decentralised Computer Architectures for 
VLSI," pp.  348-380 in VLSI architecture, ed. B. Randell, P. C. 
Treleaven,Prentice Hall (1983). 
D. J. Kinniment, "VLSI and Machine Architecture," pp.  24-33 in 
VLSI architecture, ed. B. Randell, P. C. Treleaven,Prentice 
Hall (1983). 
- 28 - 	 (CH 2) 
F. Anceau, "VLSI-Processor Architecture and Design," pp. 138-
148 in VLSI architecture, ed. B. Randell, P. C. 
Treleaven,Prentice Hall (1983). 
J. M. Siskind, J. R. Southard, and K. W. Crouch, "Generating 
Custom High Performance VLSI Designs from Succinct Algorithmic 
Descriptions," pp.  28-39 in Proceedings, Conference on 
Advanced Research in VLSI, ed. P. Penfield Jr.,Artech House 
Inc. (1981). 
H. E. Schrobe, "The Data Path Generator," pp. 175-181 in 
Proceedings, Conference on Advanced Research in VLSI, ed. P. 
Penfield Jr.,Artech House Inc. (1981). 
W. P. Burleson, "A Programable Bit-Serial Processing Chip," M. 
Sc. Thesis, M.I.T. (1983). 
M. Stefik, D. G. Borrow, A. Bell, H. Brown, L. Conway, and C. 
Tong, "The Partitioning Concerns in Digital System Design," 
pp. 43-52 in Proceedings, Conference on Advanced Research in 
VLSI, ed. P. Penfield Jr.,Artech House Inc. (1981). 
P. P. Reusens, R. W. Linderman, and W. H. Ku, "CUSP: A Custom 
VLSI Processor for Digital Signal Processing Based on the Fast 
Fourier Transform," Proceedings, IEEE International Conference 
on Computer Design, (1983). 
J. E. Savage, "Planar Circuit Complexity and the Preformance 
of VLSI Algorithms," pp. 61-68 in VLSI Systems and Computa-
tions, ed. H. T. Kung, B. Sproull, G. Steele,Springer Verlag 
(1981). 
- 29 - 	 (CH 2) 
L. B. Jackson, J. F. Kaiser, and H. S. McDonald, "An Approach 
to the Implementation of Digital Filters," IEEE Transactions 
on Audio and Electroacoustics Vol. AU-16(3) pp. 413-421 (Sep-
tember 1968). 
N. R. Powell and J. M. Irwin, "Signal Processing with Bit-
SerialWord-Parallel Architectures," SPIE Vol. 154 Real Time 
Signal Processing pp. 98-104 (1978). 
N. R. Powell, "Functional Parallelism in VLSI Systems and Com-
putations," pp.  41-49 in VLSI Systems and Computations, ed. H. 
T. Kung, B. Sproull, C. Steele,Springer Verlag (1981). 
A. P. Reeves and J. D. Bruner, "Efficient Function Implementa-
tion for Bit-Serial, Parallel Processors," IEEE Transactions 
on Computers Vol. C-29(9) pp.  841-844 (September 1980). 
M. R. Buric and C. A. Mead, "Bit-Serial Inner Product Proces-
sors in VLSI," Proceedings, Caltech Conference on VLSI, 
(1981). 
R. W. Linderman, P. P. Reusens, P. M. Chau, and W. H. Ku, 
"Digital Signal Processing Capabilities of CUSP, a High Per-
formance Bit-Serial VLSI Processor," Proc. IEEE ICASSP84, 
pp. 16.1.1 - 16.1.4 (San Diego, March 1984). 
Norihiko Ohwada, Tadakatsu Kimura, and Masanobu Doken, "LSI7s 
for Digital Signal Processing," IEEE Transactions on Electron 
Devices Vol. ED-26(4) pp.  292-298 (April 1979). 
D. J. Myers and P. A. Ivey, "STAR -- A VLSI Architecture for 
Signal Processing," pp. 179-183 in Proceedings, Conference on 
- 30 - 	 (CH 2) 
Advanced Research in VLSI, ed. P. Penfield Jr.,Artech House 
Inc. (1984). 
Akira Sawai, "Programmable LSI Digital Signal Processor 
Development," pp. 29-40 in VLSI Systems and Computations, ed. 
H. T. Kung, B. Sproull, G. Steele,Springer Verlag (1981). 
B. Fecte, D. Harrison, D. Olson, and S. P. Allen, "A Family of 
Special Purpose Micro-Programable Digital Signal Processor 
IC's in an LPC Vocoder System," IEEE Journal of Solid State 
Circuits Vol. SC-18(1) pp.  25-33 (February 1983). 
S. P. Pope, B. Solberg, and R. W. Brodersen, A Single-Chip LPC 
Vocoder, Department of Electrical Engineering and Computer 
Science, University of California, Berke1y 0. 
TMS32010 Data Manual, Texas Instruments (1982). 
Intel 2920 Signal Processor, Intel (1983). 
"AND2900 Family CPU, Sequencers, Peripoherals, Interface," 
in Advanced Micro Devices Data Book, Advanced Micro Devices 
(1982). 
K. Okada, T. Ehara, H. Suzuki, K. Yanagida, K. Saito, and N. 
Ichiura, "A Digital Signal Processor Module Architecture and 
its Implementation Using VLSI," Proc. IEEE ICASSP84, pp. 
44.5.1 - 44.5.4 (San Diego, March 1984). 
41.. C. Mead, "VLSI and the Foundations of Computing," in Informa-
tion Processing 83, ed. R. E. A. Mason,Elsevier Science Pub-
lishers (September 1983). 
- 31 - 	 (CH 2) 
L.Conway, A. Bell, and M. E. Newell, "MPC79: The Large-Scale 
Demonstration of a New Way to Create Systems in Silicon," 
Lambda Vol. 1(2) PP.  10-19 (2-nd Quarter 1980). 
C. Lewicki, D. Cohen, P. Losleben, and D. Trotter, "MOSIS 
Present and Future," pp. 124-128 in Proceedings, Conference on 
Advanced Research in VLSI, ed. P. Penfield Jr.,Artech House 
Inc. (1984). 
C. Mead and L. Conway, Introduction to VLSI Systems, Addison-
Wesley (1980). 
D. J. Fairbairn, "The Silicon Foundry: Concepts and Reality," 
Lambda Vol. 2(1) Pp. 16-26 (First Quarter 1981). 
R. F. Lyon, "Simplified Design Rules for VLSI Layouts," Lambda 
Vol. 2(1) pp. 54-59 (1-st Quarter 1980). 
C. H. Sequin, "Generalised IC Layout Rules and Layout 
Representations," pp.  13-23 in VLSI 81: Very Large Scale 
Integration, ed. J. P. Gray,Academlc Press (1981). 
A. Barna, VHSIC: Technologies and Tradeoffs, Wiley (1981). 
A. B. Glaser and G. E. Subak-Sharpe, Integrated Circuit 
Engineering, Addison Wesley (1979). 
J. L. Prince, "VLSI Device Fundamentals," pp.  4-41 in Very 
Large Scale Integration (VLSI) Fundamentals and Applications, 
ed. D. F. Barbe,Springer Verlag (1980). 
C. Mead and L. Conway, Introduction to VLSI Systems, Addison-
Wesley (1980). Chapter 7 on System Timing by C. Seitz 
- 32 - 	 (CH 2) 
J. Alves Marques and A. Cunha, "Clocking of VLSI Circuits," 
pp. 165-178 in VLSI architecture, ed. B. Randell, P. C. 
Treleaven,Prentice Hall (1983). 
W. S. Song and L. A. Glasser, "Power Distribution Techniques 
for VLSI Circuits," pp.  45-52 in Proceedings, Conference on 
Advanced Research in VLSI, ed. P. Penfield Jr.,Artech House 
Inc. (1984). 
C. Seitz, "Ensemble Architectures for VLSI: A Survey and Tax-
onomy," pp. 130-135 in Proceedings, Conference on Advanced 
Research in VLSI, ed. P. Penfield Jr.,Artech House Inc. 
(1981). 
M. Rem, "The VLSI Challenge: Complexity Bridling," pp. 65-74 
in VLSI 81: Very Large Scale Integration, ed. J. P. 
Gray,Academic Press (1981). 
H. T. Kung, "Let's Design Algorithms for VLSI Systems," 
Department of Computer Science Report, Carnegie Mellon 
University (1979). 
P. thsleben, "Computer Aided Design for VLSI," pp. 89-127 in 
Very Large Scale Integration (VLSI) Fundamentals and Applica-
tions, ed. D. F. Barbe,Springer Verlag (1980). 
A. R. Newton, "Computer Aided Design of VLSI Circuits," 
Proceedings, IEEE Vol. 69(10) pp. 1189-1199 (October 1981). 
I. Buchanan, "Modelling and Verification in Structured 
Integrated Circuit Design," PhD Thesis, University of Edin-
burgh, Computer Science Department (July 1980). 
- 33 - 	 (CH 2) 
C. A. Mead, "Structural and Behavioural Composition of VLSI," 
pp. 3-8 in Proceedings of the IFIP International Conference on 
VLSI, ed. F. Anceau, B. J. Aas,North Holland (1983). 
B. thcanthi, "LAP: A Simula Package for IC Layout," Caltech 
SSP Report, California Institue of Technology (1978). 
R. F. Ayers, "IC Design under ICL, Version 1," Caltech SSP 
Report, California Institue of Technology (1978). 
D. L. Johannsen, "Silicon Compilation," PhD Thesis, Califor-
nia Institue of Technology (1981). 
C. R. Rupp, "Components of a Silicon Compiler," pp.  227-236 in 
VLSI 81: Very Large Scale Integration, ed. J. P. Gray,Academic 
Press (1981). 
R. F. Ayers, VLSI Silicon Compilation and the art of Automatic 
Chip Design, Prentice Hall (1983). 
F. Anceau and J. P. Schoellkopf, "CAPRI: A Silicon Compiler 
for VLSI Circuits Specified by Algorithms," pp. 149-154 in 
VLSI architecture, ed. B. Randell, P. C. Treleaven, Prentice 
Hall (1983). 
F. Anceau, "CAPRI: A Design Methodology and a Silicon Compiler 
for VLSI Circuits Specified by Algorithms," pp. 15-31 in 3-rd 
Caltech Conference on VLSI, ed. R. Bryant,Computer Science 
Press (1983). 
A. A. Szepieniec, "SAGA: An Experimental Silicon Compiler," 
19-th Design Automation Conference, pp. 365-370 (1982). 
- 34 - 	 (CH 2) 
G. Brebner and D. Buchanan, "On Compiling Structural Descrip—
tions to Floorplans," Proceedings, IEEE International Confer-
ence on Computer Aided Design, (1982). 
M. R. Buric, C. Christensen, and T. G. Matheson, "The Plex 
Project: VLSI Layouts of Microcomputers Generated by a Com—
puter Program," IEEE International Conference on Computer 
Aided Design, (September 1983). 
T. G. Matheson, M. R. Buric, and C. Christensen, "Embedding 
Electrical and Geometric Constraints in Hierarchical Circuit 
Layout Generators," IEEE International Conference on Computer 
Aided Design, (September 1983). 
A. M. Peskin, "Towards a Silicon Compiler," Proceedings, Cus—
tom Integrated Circuits Conference, (1982). 
T. J. Kowalski and D. E. Thomas, "The VLSI Design Automation 
Assistant: Prototype System," Proceedings, 20—th Design Auto—
mation Conference, (1983). 
D. E. Thomas, C. Y. Hitchcock, T. J. Kowalski, J. V. Rajan, 
and R. A. Walker, "Automatic Data Path Synthesis," IEEE Com—
puter, pp.  59-69 (December 1983). 
J. Werner, "The Silicon Compiler: Panacea, Wishful Thinking, 
or Old Hat?," VLSI Design Vol. III(5)(September/October 1982). 
J. Werner, Silicon Compiler Part 1: The Front End" "Progress 
Towards the "Ideal" Silicon Compiler Part 1: The Front End," 
VLSI Vol. IV(5) pp. 38-41 (September 1983). 
- 35 - 	 (CH 2) 
J. Werner, Silicon Compiler Part 2: Automatic IC Layout"" 
"Progress Towards the "Ideal" Silicon Compiler Part 2: 
Automatic IC Layout," VLSI Vol. IV(6) pp.  78-81 (October 
1983). 
J. P. Gray, I. Buchanan, and P. S. Robertson, "Designing Gate 
Arrays Using a Silicon Compiler," Proceedings, 19-th Design 
Automation Conference, pp. 377-383 (1982). 
A. R. Deas, "The UNIT Silicon Compiler," MSc Thesis, Univer-
sity of Edinburgh, Computer Science Department (September 
1983). 
A. R. Deas, "Silicon Compilation: A VLSI Complexity Management 
Strategy," Proceedings, Euromicro 84 Conference, (August 
1984, Copenhagen). 
P. Wallich, "On the Horizon: Fast Chips Quickly," IEEE Spec-
trum, pp. 28-34 (March 1984). 
S. C. Johnson, "VLSI Cicuit Design Reaches the level of Archi-
tectural Description," Electronics, pp. 121-128 (3-rd May 
1984). 
J. R. Fox, "The MacPitts Silicon Compiler: A view from the 
Telecommunications Industry ," VLSI Design Vol. IV(3) pp. 30-
37 (May/June 1983). 
R. A. Rivest, "A Description of a Single Chip Implementation 
of the RSA Cipher," Lambda Vol. 1(3) pp.  14-18 (4-th Quarter 
1980). 
- 36 - 	 (CH 2) 
J. Batali, E. Goodhue, C. Hanson, H. E. Schrobe, R. M. Stall-
man, and G. J. Sussman, "The SCHEME-81 Architecture -- System 
and Chip," pp. 69-77 in Proceedings, Conference on Advanced 
Research in VLSI, ed. P. Penfield Jr.,Artech House Inc. 
(1981). 
H. E. Schrobe, "Al Meets CAD," pp.  387-399 in Proceedings of 
the IFIP International Conference on VLSI, ed. F. Anceau, H. 
J. Aas,North Holland (1983). 
S. P. Pope and R. W. Brodersen, "Macrocell Design for Con-
current Signal Processing," pp. 395-412 in 3-rd Caltech 
Conference on VLSI, ed. R. Bryant,Computer Science Press 
(1983). 
P. A. Ruetz, S. P. Ekpe, and R. W. Brodersen, Computer Genera-
tion of Digital Filter Banks, University of California, Berk-
ley (March 1984). 
P. B. Denyer, D. Renshaw, and N. Bergmann, "A Silicon Compiler 
for VLSI Signal Processors," Proc. ESSCIRC82, pp.  215 - 218 
(Brussel, September 1982). 
N. W. Bergmann, "A Case Study of the FIRST Silicon Compiler," 
pp. 413-430 in 3-rd Caltech Conference on VLSI, ed. R. 
Bryant,Computer Science Press (1983). 
P. B. Denyer and D. Renshaw, "Case Studies in VLSI Signal Pro-
cessing Using a Silicon Compiler," Proc. IEEE ICASSP83, pp. 
939 - 942 (Boston, April 1983). 
-37-. 	 (CH 2) 
A. F. Murray, P. B. Denyer, and D. Renshaw, "Self-Testing in 
Bit-Serial Parts: High Coverage at Low Cost," Proc. IEEE 
International Test Conf., pp. 260 - 268 (Philadelphia, 
October 1983). 
C. H. Allen, P. B. Denyer, and D. Renshaw, "A Bit-Serial 
Linear Array DFT," Proc. IEEE ICASSP84, pp. 41A.1.1 - 
41A.1.4 (San Diego, March 1984). 
L. H. Turner, P. B. Denyer, and D. Renshaw, "A Bit Serial LDI 
Recursive Digital Filter," Proc. IEEE ICASSP84, pp. 41A.3.1 
- 41A.3.4 (San Diego, March 1984). 
S. Y. Kung, "VLSI Array Processor for Signal Processing," 
Conference on Advanced Research in Integrated Circuits, MIT 
(1980). 
S. Y. Kung, "Wave Front Array Processor: Language, Architec-
ture and Applications," IEEE Transactions on Computers Vol. 
C-31(11) pp. 1054-1065 (November 1982). 
S. Y. Kung, "From Transversal Filter to VLSI Wavefront array," 
pp. 247-261 in Proceedings of the IFIP International Confer-
ence on VLSI, ed. F. Anceau, E. J. Aas,North Holland (1983). 
D. Cohen, "Mathematical Approach to Computational Networks," 
Technical Report, University of Southern California, Informa-
tion Sciences Institute (November 1978). 
L. Johnsson and D. Cohen, "A Mathematical Approach to Nodel-
hag the Flow of Data and Control in Computational Networks," 
pp. 213-68 in VLSI Systems and Computations, ed. H. T. Kung, 
- 38 - 	 (CH 2) 
B. Sproull, G. Steele,Springer Verlag (1981). 
L. Johnssou, U. Weiser, D. Cohen, and A. L. Davis, "Towards a 
Formal Treatment of VLSI Arrays," Technical Report, Califor-
nia Institute of Technology, Computer Science Department 
(January 1981). 
N. W. Bergmann, "Idiomatic Integrated Circuit Design," PhD 
Thesis, University of Edinburgh, Computer Science Department 
(1984). 
B. Ackland and N. Weste, 	XI Automatic Assembly Tool for Vir- 
tual Grid Symbolic Layout," pp. 457-466 in Proceedings of the 
IFIP International Conference on VLSI, ed. F. Anceau, E. J. 
Aas,North Holland (1983). 
- 39 - 	 (CH 3) 
CHAPTER 3 
A DESIGN ENVIRONMENT FOR GENERATING CUSTOM ICs 
TO IMPLEMENT DIGITAL SIGNAL PROCESSING SYSTEMS 
IN BIT-SERIAL ARCHITECTURES 
31. Introduction 
This chapter describes a design environment for implementing 
digital signal processing systems. This design environment has 
been given the acronym FIRST, standing for Fast Implementation of 
Real-time Signal Transforms. It consists of a design methodology 
together with a unified collection of CAD tools. It has been 
created to fulfil the objectives set out in chapter 1 using tech-
niques identified in chapter 2. The underlying concepts and struc-
ture were originated by the author and P. B. Denyer. Software 
expertise was recruited through the Department of Computer Science 
at Edinburgh University, in the form of N. W. Bergmann, who worked 
at writing the first and second versions of the software, for six 
to nine months. These versions were written in IMP [1] . From this 
the author learned the necessary programming skills to maintain and 
extend the software thereafter. A third version of the software 
was translated from the original by J. Nash into the C language 
[2], with modifications to introduce dynamic storage and increased 
run time efficiency. Pilot studies of system design using FIRST 
were undertaken by the author and P. B. Denyer. System implementa-
tion studies have also been done by a number of visiting academic 
and industrial visitors. Latterly, this area of system designs has 
been greatly extended by S. G. Smith. Finally, the details of the 
test architecture were worked Out mainly by A. F. Murray in 
- 40 - 	 (CH 3) 
consultation with P. B. Denyer and the author. The author has con-
tributed jointly in all areas of the design environment but princi-
pally in the areas of the methodology, the underlying architectural 
principles and the cell library. 
3.2. Overview 
Figure 3.1 shows the main components of the design environment 
which are as follows. Firstly, an underlying design methodology 
has been developed. Secondly, a hardware description language 
(HDL) has been defined. This provides a mechanism for precise 
design specification. Thirdly, an intermediate description code 
(IDC) has been defined. Fourthly there is a language compiler 
which compiles HDL to IDC. Fifthly, there is a physical design 
subsystem (PDS), which generates mask geometries from a compiled 
IDC. Sixthly, there is a behavioural description subsystem (BDS), 
which generates simulated behaviour corresponding to a compiled 
system from files of input signals. As a part of the design metho-
dology, there is a strategy for built-in-testability and self test. 
This leads to the seventh component of the design environment, 
namely facilities for automatic test pattern generation (ATPG). 
Additionally, a number of peripheral pre-processors and post-
processors have been developed. These take the form of short pro-
grams written by a variety of users to ease their task, and typi-
cally fulfil the function of specific signal generators, display 
packages and HDL generators. Finally, as a result of the unified 
automation environment, its HDL and the generated output files, 
documentation of system design is part of the actual design pro-
cess. 




















Set of I 	I Service 
Primitives I Primitives 
I 	 I (Pads etc. 
lanner & Route 
I Behaviour I 	 Final Artwork 
Figure 2.1 
- 42 - 	 (CH 3) 
3.3. Architectural Basis 
The key to the structural organisation of the design environ-
ment has been the adoption of a bit-serial architectural design 
methodology. This was proposed in [3]. Reasons for this choice 
were given in chapter 2. Some details of the form adapted and 
developed for this work and have been set out in [4].  The main 
impact of this approach has been on the implementation of the phy-
sical design subsystem (see section 3.8), on system partitioning, 
on the use of pipelining and functional parallelism and on communi-
cation. 
3.4. Specification and System Design Capture 
A system design specification, however formulated, consists of 
a functional component and a performance component. Design capture 
is the process of formulating this specification in a succession of 
forms which eventually translate into the actual system. 
At a comparatively high level, the functional component of the 
specification can be completely defined by using a flow graph of 
interconnected functional elements. Systems engineers are accus-
tomed to designing using this idiom. It translates very simply 
into a net list description, consisting of functional elements with 
inputs and outputs and nodes which define how the inputs and out-
puts are connected. The HDL defined in FIRST is simply a hierarch-
ical version of such a net description language which also reflects 
the key physical boundaries (chip boundaries) of the system when 
implemented. The hierarchy allows arbitrary grouping and conceal-
ment of design detail. 
- 43 - 	 (CH 3) 
Keywords and reserved symbols 
OPERATOR PIN 	 = 
CHIP 	POUT 
SUBSYSTEM PCIN 







PADIN 	 TIMES 	 + 
PADOUT THROUGH - 






ABSOLUTE ADD 	BITDELAY 
CBITDELAY CONSTGEN CWORDDELAY 
DPMULTIPLY 	DSHIFT 	FFORMAT1T0I 
FFORMAT2TO1 FFORNAT3TO1 	FLIMIT 
FORMAT1TO2 	MSHIFT 	MULTIPLEX 
MULTIPLY ORDER 	 SUBTRACT 
WORDDE LAY 
Table 3.1 
A FIRST IIDL specification consists of a sequence of keywords, 
identifiers and constants together with arithmetic operators and 
various separator characters. Table 3.1 gives a listing of the 
keywords. Keywords are used to define the classes of objects 
comprising a system. These are nodes and functional elements. 
Nodes can be of type SIGNAL or of type CONTROL. The only prede-
fined nodes are VDD, GND and NC, which are the supplies and not 
connected; all other nodes must be user defined. Functional ele-
ments are of type primitive, OPERATOR, CHIP, SUBSYSTEM or SYSTEM. 
Elements of type primitive are all predefined; they are given by 
the primitive name keywords. Elements of the other types are all 
user defined. The elements of type primitive, CHIP and SYSTEM, 
-44-. 	 (CH 3) 
have actual hardware counterparts. The elements of type OPERATOR 
and SUBSYSTEM reflect structuring within the specification. The 
general form for any functional element is: 
name [plist] (clist) slist 
where name is the user chosen, or predefined identifier, plist is a 
list of parameters, clist is a list of control inputs and outputs 
and slist is a list of signal inputs and outputs. If preceded by 
one of the keywords bPERATOR, CHIP, SUBSYSTEM or SYSTEM the above 
statement is a definition start. Definitions are terminated by the 
keyword END. Without one of these keyword prefixes the statement 
becomes a call or instantiation of the element. All user defined 
elements must be defined before they are called. The definitions 
of type CHIP and SYSTEM are both implicitly also calls. Constants 
and arithmetic expressions are used for parameterisation of func-
tional elements, thus giving the facility to define classes of 
functional element. The parameters are evaluated on instantiation 
of a particular instance. This mechanism allows rapid modification 
of specifications to investigate quantisation effects etc., by the 
minimum alteration of an HDL specification. The HDL also supports 
a mechanism for condensed expression of long lists of nodes and 
repeated calls of functional elements (cascading). 
The language is much simpler and more easily learned than con-
ventional high level programming languages. It is, however, ade-
quate for specifying a wide range of simple to complex digital sig-
nal processing systems. A formal definition and description of the 
language (version 2) is given in appendix I; details concerning the 
primitives are given in appendix II. The syntactic form of the 
language owes much to the influence of [5,6], as does the overall 
- 45 - 	 (CH 3) 
software architecture. 
It will be observed that the definition of function is expli-
citly captured in an HDL specification. The question remains as to 
how the performance component of the specification can be 
expressed. Primitives form an important group of keywords, to 
which there corresponds a defined function. However the definition 
of each primitive comprises more than simply its syntax and func-
tion. To each primitive there also corresponds an actual hardware 
circuit. As a result of the way that this has been designed it 
also possesses a known performance. Further, because of the way in 
which systems are synthesised from primitives the performance of 
any arbitrary synthesis is also determinate. As a result of these 
two factors any HDL specification will possess an implicit perfor-
mance. Conversely, by adopting the design methodology and system 
mapping techniques presented in chapters 4 and 5, a specific per-
formance can be built into the HDL specification. 
In summary the HDL specification of a system captures its 
function explicitly as a net list of functional elements. The per-
formance specification is derived from the primitives and the func-
tional architecture. The architectural methodology and system map-
ping methods described in chapters 4 and 5 provide a means for 
translating a mathematical definition of function and a bandwidth 
definition of performance into an HDL specification. The HDL pro-
vides a level of description appropriate for compiler implementa-
tion. An example of how to describe a system using FIRST HDL is 
given in appendix III. 
- 46 - 	 (CH 3) 
3.5. The FIRST HDL Compiler 
The FIRST HDL provides a mechanism for exact design specifica-
tion. The next step in complete design capture is to generate an 
intermediate code (IDC) from which both the behaviour and the phy-
sical circuit design can be generated. This IDC is an expanded but 
coded version of the HDL with the hierarchy removed. In addition 
the significant portions are factored out. These are the chip 
definitions, for passing to the physical design subsystem, and the 
system definition, for passing to the bound behavioural simulation 
subsystem. Compiler technology for accomplishing this task is well 
known. For reasons of initial convenience, the implementation was 
based on SKIMP [7],  a simple compiler designed for teaching the 
principles of compiler writing. The principles involved can also 
be found in [8]. The structure of the compiler is shown in the 
outline pseudo-code given next. 
main 
declarations 
decode options and set flags 
read in and reduce syntax definitions 
read in and store predefined primitives 
open files for IDC output and diagnostics 
open input HDL file 
while( !EOF) { 
process statement from input 
} 
} 
This pseudo-code, if expanded, refined and translated into a pro-
gramming language can be made to implement the required function. 
The routine process_statement expands as shown next. 
- 47 - 	 (CH 3) 
process--statement { 
read statement, clean up and reconstruct 
perform lexical analysis 
perform recursive descent analysis of syntax 
generate IDC 
The input to the first phase is HDL; the output is cleaned, abbre-
viated HDL. This is passed to the lexical analysis phase, which 
recognises keywords, names and constants and inserts a coded record 
of these into a lexical array and hash table. This record is taken 
by the syntax analyser, which performs an analysis of it with 
respect to the table of reduced syntax definitions. The analysis 
routine is called recursively, and attempts to find a match between 
the record and the definitions. U successful it generates a con-
densed analysis record in the form of a linearised analysis tree. 
The analysis record is interpreted by the code generator, which 
then outputs appropriate IDC for each item. 
The run time performance of the compiler is satisfactory, typ-
ical systems can be compiled almost instantaneously. For this rea-
son the simple approach has been adopted and sophisticated optimi-
sation schemes eschewed. 
The entry of new primitives into the compiler is trivial, 
requiring only the addition of the corresponding keyword to the 
file of predefined primitives. The format requires also informa-
tion on the number of parameters, control inputs, control outputs, 
signal inputs and signal outputs. The entry of new syntax is not 
so straightforward and requires a detailed knowledge of the com-
piler implementation. This is mainly related to the code genera-
tion phase. Addition of new syntax is easily ncorporated into 
line reconstruction, lexical and syntax analysis. 
- 48 - 	 (CH 3) 
3.6. Primitives and the Primitive Cell Library 
Primitives constitute the kernel of the design environment. 
This is so because the approach to system synthesis is based on 
assembling functional elements from primitives. As a preview of 
what follows it is worth summarising the definitions associated 
with each primitive at the various levels at which a primitive can 
incarnate within the design environment. In the HDL and IDC, each 
primitive is defined as a keyword and syntax. In the cell library, 
each primitive has a dissociated definition in the form of the com-
ponent leaf cells required to construct all instances of it. Prim-
itives also possess definition as a composition template (see 
below). This is formulated, in the physical design subsystem, as a 
parameterised procedural definition (cp., section 3.8). The intro-
duction of a new primitive into the design environment involves 
definition at each of these levels and in the correct format for 
incorporation. A major issue associated with the design of the 
primitives is that of verification. Verification is achieved by 
establishing the consistency of these definitions with respect to 
each other. This topic is addressed in section 3.10. 
The approach to defining primitives has been dominated by the 
requirement to meet a guaranteed operating performance, through 
which the system performance can be ensured. In order to achieve 
this, it was decided that primitives should be composed from custom 
designed component circuits, or leaf cells. Custom design can 
guarantee both performance and local optimisation for each such 
cell. 
The generation of primitives from such custom designed leaf 
- 49 - 	 (CH 3) 
cells is achieved by means of an architectural template. This is a 
general pattern for the assembly of a set of leaf cells to generate 
a class of similar circuits. The general pattern is expressed in 
terms of the particular leaf cells used, in terms of a set of one 
or more parameters and in terms of the arrangement of the leaf 
cells. Each such class of circuits constitutes a primitive. Each 
case of the pattern is a particular instance of the primitive. 
These general patterns will be termed composition templates. 
The assembly of the leaf cells according to such templates 
must be governed by an internal set of conventions and defined 
interfaces. These are local and internal to the primitive and need 
only be sufficient to ensure that, within the generality of the 
composition template, any assembly of cells will satisfy the 
required performance. Further each primitive must have an external 
interface. It must also satisfy a set of geometric and electrical 
conventions to ensure that communication between primitives satis-
fies the functional and performance requirements set down. The 
creation of these design conventions and interfaces ensures that, 
within any system, arbitrary assemblies of primitives will satisfy 
both function and performance. 
The design of primitives is also constrained by the need to 
fit a global assembly strategy. The latter is an architectural 
template for chip generation and constitutes a class of floor plans 
which the physical design subsystem can produce. The adoption of 
such an approach has advantages with respect to performance and 
local optimisation and also in simplifying the software implementa-
tion of a physical design subsystem. However, it also has disad-
vantages. Firstly, iteration on the design of leaf cells, in the 
- 50 - 	 (CH 3) 
light of global assembly requirements, is excluded. It was 
adjudged, however, that the sophistication required to achieve suc-
cessful implementation of such a technique for primitive generation 
was beyond the scope of this research. The second disadvantage is 
that custom designed leaf cells have to be technology specific. 
The significance of this can be understood in the light of the 
rapid development of new technologies and the need to re-implement 
existing leaf cell libraries in such technologies. The custom 
design effort required will be considerable. Again a solution to 
this problem lies outwith the scope of this research, but urgently 
requires attention. 
The technology chosen for leaf cell design was the so called 
standard five micron polysilicon gate nMOS process. This choice 
was dictated by the availability of such a process through the SERC 
supported Edinburgh Micro-fabrication Facility (EMF). The 
geometric design rules adopted were a set of generalised geometric 
design rules given in [9]. The electrical design rules were taken 
from the EMF design rules [10].  For this process an operating 
clock frequency of 8 MHz should be attainable. 
Circuit design techniques were based on to phase, non-
overlapping-clocked, dynamic, ratioed logic design [11,121. Ratio 
conventions of 4:1 and 8:1 were adhered to for rectangular transis-
tors and a ratio of 10:1 or 12:1 where snaked or hooked' 
transistor misalignment could be critical. To avoid race hazards, 
every stage in the pipeline should start with the appropriate 
clocked pass transistor followed by an inverter, or gate and then 
any further pass or gate transistor logic. Phil clocked stages 
must alternate with Phi2 clocked stages. It is worth mentioning 
- 51 - 	 (CH 3) 
that although pass transistor networks may appear preferable from 
the point of view of area and power dissipation, they do inflict 
reduction on test coverage. 
Using these circuit techniques, together with the design con-
ventions outlined below,leaf cells were iteratively designed using 
circuit simulation to make up the cell library. To avoid confusion 
it should be stated that this process of iteration occurs only dur-
ing leaf cell design; it does not constitute a part of the physical 
design subsystem. Leaf cells, once designed and verified are fixed 
objects. 
3.7. Interfaces and Architectural Templates: 
Foundations for a Physical Design Subsystem 
Interface definition is a mechanism for ensuring that the 
required performance of primitives, chips and systems is attained. 
There are three important classes of interface: those between leaf 
cells, those between primitives, and those between chips. In order 
to ensure performance the following conventions are used. 
Internal to primitives, each leaf cell is designed so that it 
is capable of driving (at the required fraction of the minimum 
operating frequency of 8 MHz) any of the possible outputs which may 
be connected to it as defined by the composition template. Usually 
this is done by worst case design and the task is much simplified 
by the features of modularity, local communications, and pipelined 
operation exhibited by bit-serial circuits. 
Between primitives, a common standard interface buffer is 
used. This is associated with primitive outputs. It has been 
0 
	
- 52 - 	 (CH 3) 
designed to drive worst case loads of 10 pF, (which is equivalent 
to maximum wire lengths with a fanout of five). This buffer dissi-
pates approximately 3 mW and has an area of approximately 0.02 sq 
mm. Again the sparseness of bit-serial communication makes the use 
of such an interface a feasible proposition. The operation of this 
buffer uses one half bit cycle (or one clock phase) within the 
operational timing of the primitive. For convenience and effi-
ciency, there are inverting and non-inverting versions of this 
buffer. Signal and control are latched by phi2 at the input to a 
primitive and are also valid through the output buffers during 
ph12. The buffers are made available 	during primitive leaf cell 
design, for ncorporation into the circuits. In this way a 
primitives capacity to operate the buffers correctly can be veri-
fied. Figure 3.2 shows the circuit schematic for a non-inverting 
buffer. An actual buffer can be seen in Figure 3.3, which shows a 
micro photograph of a section of a chip including a buffer in the 
field of view. 
Between chips the interface is pipelined to take one full bit 
time to transfer one bit. This is necessary because inter-chip 
loads can be in the range 20-30 pF due to pad, lead-to-lead and 
wire-to-wire capacitances. Output pads latch their signals using 
phi2 and input pads latch their signals using phil. Output pads 
require a very large driver. Input pads use the common-standard on 
chip buffer common to all primitives. The public domain Xerox-Parc 
[13], pad designs were modified slightly to suit. 
The overall consequences of these interface definitions is 
that a mechanism exists for ensuring performance. It should be 




- 54 - 	 (CH 3) 
is that there is a one bit latency overhead associated with parti- 
tioning. 
Architectural templates constitute the basis for a mechanism 
whereby the physical design subsystem can carry out its function of 
generating chip mask geometries from a compiled IDC. They consist 
of a set of geometrical patterns and conventions which govern the 
design of primitives, the assembly of primitives and the synthesis 
of chips. Because the context in which a leaf cell or primitive is 
to be placed partly conditions the best design for that leaf cell 
or primitive it is necessary to outline architectural templates for 
chips then primitive and finally leaf cells. Global context is 
important in determining a local configuration, especially when the 
local configuration cannot be subsequently changed to suit its sur-
roundings. 
The concept of synthesising systems from primitives suggests a 
general composition strategy, in which primitives are maintained as 
tight physical clumps of layout, to be wired together according to 
the system design. This obvious approach is nevertheless a power-
ful one, and its advantages are worthy of examination. Most impor-
tantly, the high functional level of the primitives secures a low 
connection-to-layout ratio. All of the internal connections 
between gates and transistors are compact, and externally invisi-
ble. This is not the case for example with gate-array and 
standard-cell techniques, where the atoms are at the gate level 
only. In that case all functions must be reduced to the gate 
level, where inflexible positioning results in an inefficient wir-
ing scheme. By permitting custom primitive layout advantage can be 
taken of any implied regularity and structure in, say a multiplier 
- 55 - 	 (CH 3) 
or delay element. Also a degree of local optimisation can be 
applied in designing such elements. This leaves only the rela-
tively sparse interconnection of operands and results between prim-
itives. Here the bit-serial architecture works to advantage, for 
these signals are generally single wires only. 
The architectural template for chips is a generalised floor-
plan. The datapath (Mead and Conway, 1980, Sherburne et al., 1981, 
Batali et al., 1981, Schrobe, 1981) is an excellent example of such 
a generalised floorplan for bit-parallel architectures. A success-
ful alternative architecture, again with bit-parallel organisation, 
has been proposed by Siskind et al., (1981) where the need for con-
strained architectures is also recognised. Several generalised 
floorplans are feasible for bit-serial systems. The generalised 
floorplan described next is a simple but appropriate one for 4 to 6 
micron uMOS technologies, using the circuit techniques outlined 
above. 
This floorplan style, shown in Figure 3.4, will be termed 
Manhattan Skyline for the similarity it bears to a cluster of 
skyscrapers and reflections in the water. It comprises a central 
communication channel, flanked by two regions for the placement of 
bit-serial primitives. Signal routing is implemented through the 
central channel only; there is no connection between neighbour 
primitives except, possibly, through the central wiring channel at 
their base. Thus primitives communicate by receiving and transmit-
ting data via the channel across the waterfront. Chip input and 
output signals are routed to peripheral pads via the ends of the 
channel. Note that, for bit-serial architectures, the communica-
tion core rarely dominates the chip area. 





Lower Rank c 
Primitive Modu 
Figure 3.4 
The simplicity of this floorplan style belies some important 
advantages. Signals in the communication channel are for the most 
part routed in metal, beginning and ending in short stubs of diffu-
sion (or polysilicon). This is a useful feature where the tech-
nology offers only one low-impedance, low-capacitance interconnec- 
tion medium (in this case a single level of metal). 	The single 
channel is easy to route, and primitive ordering, to minimise chip 
area, can also be computed quickly. 
The provision of comin services (global power and clock 
busses) throughout the chip must now be considered. There are two 
well defined sub-tasks: servicing the primitives, and servicing the 
pad ring. To do this service ducts are provided along the water- 
- 57 - 
	
(CH 3) 
front for the primitives, and a service ring around the pad chan—
nel, as shown in Figure 3.5. This principle is easily extended to 
arbitrary floorplan conventions if the primitive service duct is 
allowed to become a flexible umbilical cord, routed around arbi—
trary 
Pad service rthg—..\  
Figure 3.5 
primitive placements. 
One problem with such an arrangement is the desirability of 
maintaining low impedance power and clock busses as far as possi—
ble. Normally this means maintaining these busses in metal for the 
most part. There is no problem when more than one metal layer is 
available, for then power and clock lines can be crossed without 
resorting to a relatively high impedance diffusion or polysilicon 
- 58 - 	 (CH 3) 
crossunder. However, when only one metal layer is available, as is 
the case in the process being used, it becomes impossible to supply 
power from the duct to an adjacent primitive without using at least 
one crossurider. the solution is to split one of the power rails 
(VSS say) away from the duct and route it separately on the oppo-
site side of the primitives. As shown in Figure 3.6 it then 
becomes possible to route power to all primitives, and cells within 
primitives, without crossunders. A similar solution exists for the 
pad channel, supplying VSS and VDD from opposite sides of the ring. 
Crossunders are needed for clock distribution, but their number and 
length are kept to the absolute minimum. 
Figure 3.6 shows the complete service routing schema for FIRST 
based on these principles. It includes fixed positions for each of 
the supply pads (at opposite ends of the communication core), and 
for the clock pads, phil and ph12. Without altering this scheme it 
is possible to replace the clock pad pair by an on-chip clock gen-
erator if this is preferred. Long runs of parallel pairs of clock 
lines are separated by a power supply line as a precaution against 
noise and cross talk. Pad names in the diffusion layer can be out-
put below this ring, adjacent to each pad, for bonding identifica-
tion. 
In addition to its role as a power bus, VSS is also sometimes 
used as a source for signals of zero value (000 ... OO). Thus a 
secondary VSS bus is also routed within the waterfront duct for 
this purpose. It has a higher impedance than the main power busses 
(because of cross-unders) and so may not generally be used as a 
power source. A convenient exception to this rule is the use of 
this bus to partially power input and output buffers to primitives 
VDD distribution 
- 59 - 	 (CH 3) 
VSS distribution 
distribution 	 02 distribution 
Figure 3.6 
- 60 - 	 (CH 3) 
which span this service duct, as discussed below. The crossunders 
are designed to cope with this function. 
An architectural template for primitives can now be defined in 
relation to the chip generalised floorplan. There are no absolute 
restrictions on primitive width or height. The factors which must 
be taken into consideration are as follows. An attempt is made to 
minimise the waterfront dimension. This is done so as to maximise 
the number of primitives that can be conveniently placed in one 
chip. (In practice, chips generated using the proposed floor plan 
tend to have aspect ratios (length to width) greater than one 
whereas fabrication and packaging conventions usually demand a 
close to square aspect ratio.) All data inputs and outputs are res-
tricted to occur along the waterfront. To aid channel routing, the 
allowable positions of i/o ports are quantised to be at even 
integer multiples of the wire pitch of the vertical wiring medium. 
(This will allow vertical wires from opposing rows of primitives to 
interdigitate.) In this case the standard adopted was a wire width 
of 4 lambda, with a space of 3 lambda, giving a total wire pitch of 
7 lambda in the diffusion layer. Thus primitive i/o ports are 
positioned on a permissible grid of twice this pitch (14 lambda), 
commencing from coordinate 0 lambda on the extreme left. This is 
illustrated in Figure 3.7. 
It is convenient to permit primitive layout to spread under 
the service duct for the special cases of the input and output 
buffers, so saving layout area. The use of these buffers, which 
Lambda is a unit of linear dimension and represents a convenient 
smallest unit of length (cp. Mead and Conway, 1980). Its absolute 
value depends upon the final choice of technology for implementa-
tion. 
- 61 - 
	
(CH 3) 
-J L 	14X —•i r grid size 
upper grid 
	
- 	ilItlillI 	111111 
wiring 	 liii !I 
channel  
H 
I 	 lower grid 











~A N1  1 	Pill MINIM I Service 	 ____ ______  
IMMMM 	50X 
vssi- 	__ 
Duct E - 
NMMM  - ------ ____ 
	 ___ 	 _  - 
BA 





- 62 - 	 (CH 3) 
take a standard form, is illustrated in Figure 3.8. The output 
buffer is the final drive stage attached to every primitive output. 
The input buffer is optional on any primitive signal input port. 
Its function is to provide an optional delay of one whole clock 
cycle (bit time). This is an expedient facility for handling 
inputs to primitives that have become misaligned by one bit in 
time. The incorporation of a predelay cell at a primitive input 
comes for free (in terms of layout area) in this way, instead of 
requiring an extra primitive and wiring, for the sake of a single 
bit of delay. The incorporation of predelay appears as an option 
on signal input ports for all primitives, except bitdelay and con-
trol bitdelay. 
The lower bound on primitive waterfront width is determined 
by the number of inputs and outputs. Associated with each input 
there is a width of 14 lambda, and with each output a width of 28 
lambda. Where possible, the primitive width should be made equal 
to this lower bound. The primitive can gain access to its VDD 
SUP1y through any output buffer. Access to VSS is on the sky-
line, where a bar of metal the full width of the primitive should 
be placed. Each input buffer must have access to VSS from the 
primitive. Finally, primitives should be minimum area layouts, 
subject to the above constraints. 
An architectural template for leaf cells can now be defined. 
The only requirement is that it be compatible with the architec-
tural template for primitives. The VSS and VDD net for each primi-
tive should be interdigitated and should be constituted entirely on 
the metal layer having no crossunders. The requirement for I/O on 
the same edge, as well as for modular composition and local 
- 63 - 	 (Ca 3) 
communication internal to each primitive requires that a linear 
flow configuration be bent back on itself. The general linear form 
is usually made up of input buffers, followed by functional logic 
blocks, followed by an optional cascade of added delay followed 
finally by output buffers. This can be bisected and bent back so 
that input and output attach to the same edge of the layout. The 
final layout of leaf cells has then to be tailored in relation to 
their sizes and positions relative to adjacent cells. In order to 
kaxe 
minimise primitive area such arrangements may/to be different for 
differing configurations of a primitive (i.e., for the various 
parameter values). This leads to conditional selection of leaf 
cells. For entirely modular structures, different parameter values 
may only require a different number of repetitions of the same 
cells. The leaf cells should be designed so as to form simple 
tesseijations to make up primitives, as well as meet the functional, 
performance and minimum area criteria. Global and broadcast 
schemes are to be avoided, and modular pipelined communication used 
instead. Beyond these, leaf cell design is unrestricted so as to 
benefit from custom design optimisation. 
3.8. The Physical Design Subsystem 
The physical design subsystem (PDS) is a set of routines which 
take compiled IDC and generate chip mask geometries. These 
correspond to the function defined by the IDC, and have a perfor-
mance fixed by the primitives and process technology. The routines 
automate the architectural templates and interface conventions 
reviewed in the previous section and use them as a means to assem—
ble chip geometries. The programming basis for the PDS is an 
embedded language (originally this was ILAP [14]).  This allows 
- 64 - 	 (CH 3) 
parameterised procedural definition of symbols (which represent 
groups of geometric shapes) within a high level programming 
language (cp., chapter 2). 
The structure of the PBS is shown In the following piece of 
pseudo-code. 
main { 









The function of init_layout() is to initialise ILAP and define glo-
bal symbols. The function of readpredef() is to read in the 
primitive definitions and build up the corresponding data struc-
tures needed. The function of read idc() is to read in the com-
piled IDC system definition, check its node list (this includes 
checking for fan-out, contention, floating and redundant nodes) and 
construct the necessary data structures. This routine includes 
calls to the composition routines and outputs CIF definitions as 
well as computing the actual sizes of the primitives needed. The 
routine place() computes the positions in which to locate the prim-
itives so as to achieve a minimum value for the sum of the areas of 
the two regions on either side of the wiring channel. The routine 
wireup() serves to generate the data necessary for making all the 
connections between primitive input and output ports. The routine 
layout() uses the previously created data structures to generate 
the full chip geometry. This routine outputs the remaining CIF 
- 65 - 	 (CH 3) 
definitions. Finally the routine output_stats() prints out various 
chip statistics. 
Thus, the four principal routines are read_jdcQ, placeQ, 
wire.up() and lay_out. The routine read_idc() invokes primitive 
generation routines, one call corresponding to each distinct primi-
tive required. As an example of a primitive composition routine, 
consider the assembly of the bit-serial multiplier primitive, which 
the author designed for the cell library. This particular multi-
plier consists of a cascade of cells, each of which handles the 
product associated with a pair of bits from the coefficient word. 
These cells are held as leaf cells of layout in CIF code. The 
beginning and end cells in the string are special; the intermediate 
cells are of one type which is repeated as necessary to build a 
multiplier of the required coefficient word length. In order to 
communicate legally with the central wiring channel, the otherwise 
linear cascade of cells is doubled over at (or close to) the half-
way point, and an additional interconnect cell is used as a link at 
the folding point. An assembly plan is shown in Figure 3.9 for the 
case of a multiplier with an odd number of main cells. 
Given that the sizes of the leaf cells are known (from lay-
out), the following outline for the composition routine, when 
expanded, may be used to assemble the multiplier as a function of 
















- 67 - 	 (CH 3) 
multiply() 
declarations 
retrieve parameter values for this instance 
calculate internal constants 
check parameter values 








for(k = 1;k <= limitb;++k) 
draw("MULTB") 
for(k = 1;k <= limitdt;++k) 
dr aw("MIJLTD") 
if((coeffbits-4) % 4 != 0) 
dr aw("MULTG") 
else if(coeffbits > 4) 
draw("MIJLTD") 
draw( "MULTH") 




evaluate and return height and width 
evaluate and return I/O port relative positions 
} 
where the following routines are examples of calls to the embedded 
language and have the function described. The routine draw() out-
puts the CIF shapes of the symbol named as its argument; this is 
done so as to have the correct position and orientation also. Note 
that suitable values for position and orientation must be passed to 
the draw() function and must subsequently be updated for the next 
call. At this level the details are suppressed for clarity. The 
function symbol() is a routine to generate a CIF definition start 
and endsymbol() to generate a CIF definition end. The function 
symbol—exists( is a function which interrogates the data structure 
generated so far to see whether the symbol has already been defined 
- 68 - 	 (CH 3) 
or not; it returns TRUE or FALSE. The pseudo-code given above can 
be expanded directly into an appropriate high level language func-
tion or routine to define the multiply primitive. An example of 
full layout generated by such a routine is shown in Figure 3.10. 
This block of layout, representing the full primitive nodule, 
is then available to the subsequent routines of the Physical Design 
Subsystem of the silicon compiler for placement and routing. There 
is no restriction to the use of a single parameter, many of the 
primitives have to or three parameters. 
The next task is concerned with determining a suitable 
arrangement for the primitives thus generated. This is done by the 
place() routine. Placement of the primitives within the floor-plan 
proceeds so as to minimise the sum of the upper and lower regional 
areas. The factors which determine total area are as follows: 
the height of the tallest block on the top row 
the height of the tallest block on the bottom row 
the length of the longer row 
the thickness of the wiring channel 
The first three factors depend only on which modules are placed on 
the top and bottom rows respectively. To find the best arrange-
ment, the primitives are first arranged in order of descending 
height, are all allocated to the upper region and the resulting 
area is evaluated. One after another the primitives are moved from 
the upper region to the lower region in order of ascending height, 
to try to detect an arrangement with smaller total area. The prim-
itives allocated to the lower region are arranged in order of 
increasing width. Trial perturbations are then carried out; primi-
tives are moved, in order of ascending width, from the lower region 
- 69 - 	 (CH 3) 
to the upper region to even up the total lengths of each region. 
Further trial perturbations are made to determine whether the relo-
cation of any single primitive from the upper to the lower region 
will reduce the total area. When the arrangement with least area 
has been identified, its allocation of primitives to the upper and 
lower regions is recorded in the data structure. 
It is recognised that this algorithm may not converge on the 
global minimum value for chip area. This is because, the area 
occupied by the wiring channel has been neglected, in evaluating 
the total placement area. The trade-off is between fast implemen-
tation and simplicity against increased area. It was estimated 
that the loss of area was likely to account for approximately only 
ten percent. Further, a general minimum area solution to the place 
and route problem is not known and the time that would be required 
to implement a more sophisticated algorithm was unlikely to be 
offset by the gain. 
The actual arrangement of primitives, within their gross allo-
cation to upper or lower regions, is done according to their rela-
tive groupings in the compiled IDC. This reflects the ordering 
given by the designer in his HDL description. Designers tend to 
write this description in a manner that follows the general flow of 
information in the system. Such a strategy then leads to closely 
coupled modules being close together on the chip, with a resultant 
minimisation of the total wiring net. Again, more sophisticated 
placement algorithms based jointly on the area and interconnect 
could be developed. 
The upper and lower rows of primitives will be located in 
- 70 - 	 (CH 3) 
opposition, to either side of the designated wiring channel, and 
one row is shifted by a single wire pitch (7 lambda). This allows 
i/o ports from the two rows to interdigitate across the entire 
width of the channel without having to keep account of clashes 
(coincident wires of different nodes). A simple channel router can 
be used, with metal wires running horizontally and diffusion con-
nections running vertically. Clearly this scheme would be well 
suited to any two layer process, for example double-layer metal. 
The top and bottom row offset ensures that connections can be made 
between any two ports with a single horizontal wire and two verti-
cal wires, that is without doglegs. Figure 3.11 shows a typical 
wiring channel section. 
Figure 3.11 
Given the ordering of primitives, as determined by the place() 
routine, the routine wire_up() computes the necessary arrangement 
- 71 - 	 (CH 3) 
of wiring within the channel so as to connect up ports which have 
the same node number. This is done by sorting the ports according 
to two keys, firstly their node number and secondly their position. 
Ports with the same node number are to be connected. For each node 
an allocation to slots (the parallel locations in the channel where 
the wires are laid out) within the channel is made. A running 
account of occupied and vacant locations in each slot is kept so 
that clashes can be avoided, connections to the same node.ç merged 
and so that the wiring channel can be compressed as much as possi-
ble. The termination positions for metal and diffusion tracks and 
the positions of contacts are recorded into the data structure, for 
use by the layout routine. 
On completion of place() and wire_upQ, the required posi-
tional arrangements for primitives and wires has been generated and 
the process of producing the actual geometric shapes can be com-
menced. This is the function of the layout() routine, which 
sequentially generates the parts of the chip as indicated in the 
following piece of pseudo-code. 



















= =========== 	================= 
The template for pad location and routing is as follows. The 
pads are all either signal or control input or output pads, to be 
routed from the ends of the central wiring channel. Having fixed 
the positions of the supply and clock pads (along with an area for 
the substrate bias generator, if used) three segments are parti-
tioned off within the pad channel. To allow some control over the 
eventual device pinout, the user must assign groups of pads to each 
segment, and to further specify their order within these segments. 
The relevant pad primitives (input or output) are then assembled 
and located with approximately even spacing within each segment. 
The pad routing is then implemented by a simple bus arrangement, 
from the lower and right hand pad channels to one end of the com-
munication core, and from the opposing edges to the other end of 
the core. The junction-box shown in Figure 3.12 completes the con-
nections between the communication core and the pad bus. Note that 
it is necessary for the central channel router to have brought any 
input and output signals to the appropriate end of the channel. 
frbre detailed ordering of these signals is not necessary because a 
- 73 - 	 (CH 3) 
good interchange facility is available through the junction box. 
An improvement is possible, if subsequent to calculating this 
arrangement, it is compacted. Once again, the approach taken 
trades off simplicity against the degree of optimisation possible. 
Figure 3.12 
Finally chip statistics are evaluated and output. Currently 
these include only chip size. However, on completion of primitive 
characterisation, accurate estimates of power dissipation could 
also be added. 
3.9. The Behavioural Description Subsystem 
The behavioural description subsystem consists of a set of 
(verified) computational models for all primitives together with a 
simulator kernel. The purpose it serves within the design environ-
ment is to give the designer a bit-level emulation of any system 
- 74 - 	 (CH 3) 
which he specifies in FIRST HDL. In this way, firstly, coding 
errors in the HDL can be eliminated. Secondly, the system 
behaviour for any set of inputs can be examined, both at the out-
puts and at any internal node. This allows a designer to obtain 
debugging information for modifying the system HDL specification, 
if this is needed. Additionally, it allows the designer to evalu-
ate the system, with respect to such features as system word-
length, quantisation noise etc. Finally the BBS provides a basis 
from which test pattern outputs can be generated. 
It is argued that this addition to a structured design metho-
dology overcomes the shortcomings of a purely structural design 
method as advocated by Mead & Conway in [11] which, on its own, 
does not in fact offer effective tools for the complete elimination 
of design bugs, or for testing and automatic test pattern genera-
tion. 
In order for the behavioural design subsystem to capture the 
necessary aspects of the design it must be capable of representing 
all idioms expressed in the HDL. Thus it is driven by a compiled 
IDC description, which contains the system design. Further, in 
terms of detailed fidelity, the BBS must preserve accuracy right 
down to the bit level. 
One approach to building such a BDS is to work entirely at the 
bit-level. Primitives are modelled by the boolean processes they 
perform, and the state of the entire system network is computed 
from these at every clock cycle. This can be thought of as being 
roughly equivalent to a switch level simulation of the entire sys-
tem. In practice primitive models at this level are undesirably 
- 75 - 	 (d113) 
complex. Also the simulation of detailed processes within every 
primitive leads to prohibitive simulation times, and storage 
requirements. Furthermore, these internal processes are of no 
direct relevance to the system designer, who is only concerned with 
activity, particularly word-level behaviour at the network level. 
Words correspond to signal samples and this is the finest level of 
resolution of interest. However, a guarantee of isomorphism 
between switch-level, bit-level and word-level is needed. This 
suggests the need for hierarchical simulation with automatic trans-
lation between levels and proven isomorphism. In particular, it 
becomes apparent that an appropriate level at which to implement 
the BDS is that of word level models. Preliminary experiments with 
simulation run times at bit and word level [15],  confirmed this. 
The problem of isomorphism between levels is then relegated to the 
task of verification, which is discussed in a later section. 
In order to outline the organisation, structure and function-
ing of the BDS it is necessary to describe how the signal values 
associated with a node may be represented at the word level. In 
the hardware, each node of a system specified using FIRST has a bit 
serial stream of values, 0 or 1 , appearing on it. Each primi-
tive processes the serial inputs together with the primitive-s 
internal state to give the resulting delayed output bit stream(s). 
Values on nodes and within primitives are clock driven. This pipe-
lined, bit serial operation of primitives, occurs concurrently in 
the hardware realisation. In order to simulate it efficiently 
using a conventional, operation-serial von-Neumann machine, word 
level models are extracted as follows. Each bit stream which 
appears on a signal node of the system has associated with it a 
-76-- 	 (CH 3) 
control (Cl or CLSB) bit stream which marks the occurrence of the 
least significant bit of each new word (cp.: [4]). CLSB is high 
for one bit, during the occurrence of the LSB of the associated 
signal(s), and is low at all other times. Thus there is a natural 
grouping of bits into words. These words represent the signal 
values on the node at the interval during which they occur. Define 
the word on a node to be the word composed from the bit on the node 
at CLSB high together with all subsequent bits on the node until 
the next CLSB high, but not including the bit then on the node. 
The bits are packed LSB first, in other words right justified. 
This definition implies that words are only to be associated 
with a node at discrete time intervals and in between there is no 
word on the node. This notion of a word on a node allows a natural 
definition of events associated with nodes and hence to the con-
struction of an event driven simulator model, in place of the pre-
viously required clock driven model. In turn this leads to a 
reduction of the computation load required of the simulator. (The 
ambiguity between the use of the term event here (and in the rest 
of this section) and in the context of the HDL controlgenerator 
should not cause any problems.) 
An event has associated with it a time, a node, and a value. 
An event may be said to occur on a node only at the time when the 
associated CLSB pulse is high, and the value associated with the 
event will be the word, as defined above, on the node at that time. 
The bitwise interpretation of words relates the higher level 
description to the actual bit streams observable on nodes as fol-
lows. At CLSB high the LSB of the word appears on the node; this 
-77-- 	 (CH 3) 
is followed by each subsequent bit of the word, equally separated 
throughout the interval until the next CLSB high. 
Events on the input nodes of primitives must be synchronised, 
i.e., they must occur simultaneously. The simulator checks 
throughout operation for this and warns if it detects synchronisa-
tion failure. If inputs are not synchronised then it is because 
there is a mismatch between the latencies in the paths of the 
inputs and has been caused by a design fault. This allows the 
designer to check and detect such faults easily. (Note that when 
such warnings are given the behaviour of the simulator is not 
necessarily the behaviour of the hardware if it were fabricated, 
but both would be faulty.) 
When an event occurs at the inputs to a primitive there are 
associated words, or signal values at those inputs. This means 
that, from a word level model of the primitive, output values from 
the primitive can be computed. The output time, node and value can 
then be stored, in association, as another event. 
Thus a simple event driven simulator can be structured around 
a set of utilities for reading in IDC code, for constructing the 
necessary data arrays and hashing indices, for setting input and 
output, for checking and warning etc., and for managing event 
scheduling, together with a set of word level models for all the 
primitives. The utilities constitute the simulator kernel and the 
primitive models form a library of descriptions, or definitions 
with which the kernel works. The simulator is event driven and 
simulates the operation of the system and its components on a word 
by word basis. 
- 78 - 	 (CH 3) 
An event on any node invokes, in turn, models of all primi-
tives having that node as an input. New values are then computed 
for each of the output nodes to occur at the appropriate time 
(namely the time of the input event plus the primitive latency). 
The scheduling of events is handled by storing all pending events 
in a queue. All events due at a given time are removed from the 
queue, and the new output words are then computed and inserted into 
the queue at appropriate times. Thus further events (with associ-
ated time, node and value) can be scheduled. 
Inputs and outputs to the system are via external files. 
These data files are essentially a list of events associated with 
the system input or output nodes. The format requires time and 
datum to be given in a file for each input node. Output data is 
produced as a time, node, datum format. It is not difficult to 
write programs which accept this format and produce alternative 
representations from it thus creating user-friendly interfaces. 
For example, input data may be generated from a set of software 
signal generators that produce commonly-used waveforms (weighted 
sums of sinusoids, square, triangular and saw-tooth waves, step 
functions, chirps etc) under user controlled selection and parame-
ter value choice. Equally, output data may be interpreted in a 
graphic form, producing oscilloscope-like traces of signal activity 
on nodes. In this way it is possible to construct a productive 
work-bench emulation environment for the systems designer. 
Figure 3.13 shows simulation output of this type for a 
complex-to-magnitude chip. The inputs are two orthogonal sinusoids 
of equal but decaying amplitude, which should produce a decay 
envelope at the output. The simulator verifies this behaviour for 
amplitude 
- 79 - 	 (CH 3) 
time 
Figure 3.13 
a run of 32 data samples over several cycles of the sinusoids. 
Note some initially invalid activity at the output as the pipeline 
is cleared. This is equivalent to the latency of the operator. 
It remains now to show how the primitive definitions are for-
mulated so as to interface with the simulator kernel. The simula-
tor model of a primitive is based on utilities to interface it to 
the simulator kernal and on its word-level function (addition, mul-
tiplication, etc.), together with appropriate adjustments to main-
tain bit-level fidelity. This type of model is illustrated in the 
following pseudo-code description of the MULTIPLY primitive: 
- 80 - 	 (CH 3) 
multiply() 
declarations 
retrieve control node number for this instance 
if (time of event on control node == time) { 
retrieve signal node numbers for this instance 
retrieve parameters for this instance 
check parameter values, warn if wrong 
compute latency and any other constants required 
if(time of event on any data node != present time){ 
timing_warning() and diagnostics 
retrieve input signal values 
check formats, warn if out of range 
multiply input values to required degree of precision 
format bytes of product for output 
enter output values, event times and nodes on queue 
else 
timing warning and diagnostics 
} 
========================================================== =========== 
There is also a facility for storing and retrieving any primitive-
associated, internal-state information which may be needed. 
3.10. Correctness and verification 
The stated aim is to produce a functionally correct system at 
the first attempt. It is unacceptable that time and money should 
be spent on fabricating a VLSI system that is anything other than 
correct. This holds especially for VLSI as a development medium 
for prototype or low-volume requirements. Conventional approaches 
to VLSI synthesis are distinctly hit-and-miss in this respect. 
Often as much resource is spent on verification as on the design 
process itself, and just as often the design will not function 
correctly at the first attempt. The general cause of this is 
attributable to the complexity of the VLSI design process. There 
may be many levels of representation, with scope for error in 
- 81 - 	 (CH 3) 
interpreting between each of them. The silicon compiler can avoid 
this situation by virtue of correctness by construction'. 
There are two approaches to ensuring that chip designs are 
correct. One is to attempt a design and then test it, either by 
circuit extraction and simulation, or in the limit, by fabrication. 
If the result of the simulation (test) is unsatisfactory, the 
design is modified and this process is iterated until the simulated 
(tested) behaviour is correct. The cycle is illustrated in Figure 
3.14. The scheme becomes practically unwieldy for VLSI systems 
because the data structures involved become extremely large, and 
the computational demands of whole-system simulation are prohibi-
tive, if even possible. A more acceptable variant of this approach 
is to design hierarchically and support the resulting structure 
with a series of extractor/simulators which verify behaviour 
between levels. This technique can be made to work in practice for 
the correct design of complex circuits, but it is clearly costly to 
implement, and the many levels and iterations are demanding on 
expert skills, time and computational power. 
The alternative approach is to use only techniques of con-
struction that are known to be correct, and to complement these by 
automating their implementation and verifying this automation. In 
this way, synthesis can be made automatic, rapid and correct. To 
ensure correctness therefore it is necessary to verify the tech-
niques and their implementation. For this, there are three areas 
which require attention. These are the area of software verifica-
tion, the second is the area of the underlying techniques and 
assumptions and the third is the area associated with the primitive 
cell library. This leads to formal systems for specification and 














verification. Research into applying these techniques has been 
started [16,17,18,19], but lies outwith the scope of this thesis. 
Such approaches to circuit design will, however be of great impor-
tance to the development of the next generation of design metho-
dologies and CAD tools. 
A model of the structure of FIRST is shown in Figure 3.15. 
The requirement to verify the behaviour of each system (as above) 
is replaced by previously verifying a set of equivalences between 
the components of the physical design subsystem and the models and 
assumptions built into the simulator. These include: 
the primitive leaf cell layouts 
the primitive composition routines 
the assumed Interface conventions 
the chip wiring routines 


























Floorplonner & Router 
I Final Artwork I 
Figure 3.15 
- 84 - 	 (CH 3) 
the behavioural primitive models. 
Once the software and the data structures are validated, there is 
an assurance that each system that is composed by them will be 
valid. Validation of the primitive library is of particular impor-
tance, since this component is not fixed, as are the compiler, lay-
out generator, and simulator. It is capable of expansion as 
required, or there may exist different versions for different tech-
nologies. 
Briefly, this validation is carried out using a benchmark test 
set. This contains examples of each primitive in alternative con-
figurations, to test different cases of the composition routines as 
well as the leaf cells. An HDL description of this test set is 
compiled. Circuit extraction and simulation are then used to vali-
date the primitives, their interface conventions, and the channel 
routing. A final proof is given by a fabrication-test run imposed 
on any primitive before acceptance into the public library. Fig-
ure 3.16 is a microphotograph of such a primitive verification 
chip. Further details may be found in chapter 7. 
3.11. Design for Test, ATPG and Self-Test 
A substantial proportion of system development cost is 
absorbed by testing. As system complexity increases, the test 
problem increases exponentially, unless specific steps are taken to 
redress this effect. Test can be divided into design testing, 
parametric testing, initial product testing, in-service testing and 
reliability testing. The problem of testing is to provide a means 
of distinguishing good from faulty components and systems, at 
minimum cost. This is achieved by the design of test input pat- 
Figure 3.16 
- 86 - 	 (Ca 3) 
terns (TIPs) which exercise the system and reveal, by means of the 
test output patterns (TOPs), the presence of any internal faults in 
the system. The extent to which internal faults can be detected 
for any given TIP is termed the test coverage of the TIP with 
respect to the system under test. The usual approach to TIP design 
is first of all to create fault models which are injected into a 
simulation model of the system. TIPs are then devised and run 
through the simulation and measurements of fault coverage are made 
against the resulting TOPs. Here again, as system size increases, 
the simulations required grow exponentially. This rapidly leads 
to systems which are untestable, either because not sufficient 
information about the internal conditions can be deduced from TOPs, 
or because to do so would require test times far in excess of the 
total system lifetime. Thus techniques have been devised in order 
to reduce test complexity and prevent the creation of such systems. 
These techniques involve reducing internal combinatorial and 
sequential depth by structuring system design from the outset with 
testability in mind. A review of established methods can be found 
in [20,21,221. 
In the context of the design environment under consideration 
the requirements are as follows. Firstly, system synthesis guide-
lines are needed so that it is not possible to create HDL descrip-
tions which compile into untestable silicon structures. Secondly, 
an automatic test pattern generator is required, which can give an 
arbitrary but known test cover for any system under design. Within 
this environment parametric testing and reliability testing are not 
addressed since the resources to support investigation in these 
areas is not available. 
- 87 - 	 (CH 3) 
In order to make an appropriate choice of test strategy and so 
as to extract system design guidelines, it is necessary to examine 
the properties of bit-serial primitives and of networks composed of 
such. The features of importance for testability are as follows. 
Firstly the fan-in and fan-out of internal nodes is low; what is 
more, this is equally true for primitives as it is for systems. 
ittir$j ie 
Secondly the combinational and (sequential depth of all primitives 
is small. furthly, there are very few primitives which possess 
resistance to random pattern testing. (This occurs when nodes are 
not affected by random pattern test inputs.) Fifthly, nearly all 
primitives possess the property that they propagate random pat-
terns. Factors one, two and three contribute generally to easing 
all forms of test. Factors four and five make bit-serial primi-
tives, and any architectures synthesised f'm bit-serial primi-
tives, particularly amenable to random pattern testing. The choice 
of random pattern testing is made on the grounds of economy of TIP 
generation and the extent of test coverage available. Detailed 
development of the arguments is set out in [23,24,25] In summary, 
it has been established that short (1023 bit) pseudo-random bit 
sequences (PRBS) can give one hundred per cent test cover for even 
the most complicated primitives (against the single stuck at fault 
model). Further, it has been shown that, as a result of propaga-
tion of randomness, networks of such primitives can be tested with 
the same degree of cover by the same sequences. Furthermore, full 
system simulation to establish the degree of test cover is not 
necessary. 
There are only two consequences for system design at the level 
of HDL description and above. Firstly, all loops must be broken so 
- 88 - 	 (CH 3) 
that the sequential depth is kept to a minimum. This allows inter-
nal states of such loops to be controlled and observed. The 
mechanism for breaking loops is to introduce a multiplex point with 
one external input and also to tap a node as output. Secondly, any 
primitives which do not propagate a PRBS must be isolated from the 
network during test. Again this may be done by the introduction of 
appropriate multiplex points. 
The problem of ATPG now reduces to generating suitable length 
independent PRBS TIPs. This can easily be done using a computer 
program. The corresponding TOPs can be derived by injecting these 
PRBS TIPs as input to a BDS system simulation of the system HDL. 
The length of test sequences and the degree of test cover can be 
established from the primitive test characteristics and the princi-







A consequence of this approach to system test is that it can 
also be implemented in hardware at minimal cost. PRBS generators 
can be implemented as small circuits. Further, data compression 
techniques (signature analysis) and recognition of test-pass or 
- 89 - 	 (CH 3) 
test-fail can also be built into a system in hardware. This gives 
rise to an integrated, self-test capability. The architecture for 
achieving this is illustrated in Figure 3.17. A PRBS register and 
multiplex point is associated with each input pad. A feed-back 
register observer of data output (FRODO) is associated with each 
output pad. Since bit serial systems are not pad limited there is 
ample area for both in the pad ring. Further, the timing and con-
trol circuitry required is small and so can also be located in the 
pad ring. This results in self-test with no increase in chip area 
and an increase of the order of one per cent in active area. By 
appropriate configuration of the multiplex points, the architecture 
can be set up for individual chip test, for system test (including 
interconnect) and for system run-time configuration (cp. Figure 
3.18). The achievable fault coverage when using signature analysis 
is reduced below 100%, but can be made arbitrarily high by increas-
ing the length of the PRBS TIPs. Typically 1023 bit PRBS can give 
greater than 99% coverage of single stuck at faults. 
3.12. Summary 
This chapter has described a design environment comprising a 
design methodology, techniques for design automation and a collec-
tion of integrated software tools, for implementing bit-serial sig-
nal processing systems. The author has taken a major part in the 
specification, development and implementation of this design 
environment. In the next two chapters a generalised technique for 
mapping mathematical functions into HDL descriptions is given. 
This is followed by an evaluation of what can be achieved using the 
design environment and a review of hardware verification of the 
primitives. 
S 	PAD SIGNAL 
SSOR IN 	 ROCESSOR lt~ 	FpR§S 	P flPROC'E O
SIGNAL 	PAD 	A 	SIGNAL 
PROCESSOR tggDO~~~~PRO~CNESSOR 
SIGNAL 	PAD 
PROCESSOR K91 PROCESSOR fVROCE SSOR 
  
N. 
SIGNAL 	 SIGNAL 
PROCESSOR 	
- SIGNAL 






- 91 - 	 (CH 3) 
P. S. Robertson, "The IMP-77 Language," Department of Computer 
Science Internal Report, University of Edinburgh (1977). 
B. W. Kernighan and D. M. Ritchie, The 	Programming Language, 
Prentice Hall (1978). 
R. F. Lyon, "A Bit-Serial VLSI Architectural Methodology for 
Signal Processing," pp. 131-140 in VLSI 81: Very Large Scale 
Integration, ed. J. P. Gray,Academic Press (1981). 
P. B. Denyer, "An Introduction to Bit-Serial Architectures for 
VLSI Signal Processing," pp. 225-241 in VLSI architecture, ed. 
B. Randell, P. C. Treleaven,Prentice Hall (1983). 
J. P. Gray, I. Buchanan, and P. S. Robertson, "Designing Gate 
Arrays Using a Silicon Compiler," Proceedings, 19-1h Design 
Automation Conference, pp. 377-383 (1982). 
J. P. Gray, I. Buchanan, and P. S. Robertson, "Controlling 
VLSI Complexity Using a High-Level Language for Design 
Description," Proceedings, International Conference on Com-
puter Design, (1983). 
D. J. Rees, "SKIMP Mk II," Department of Computer Science 
Internal Report, University of Edinburgh (1980). 
A. V. Aho and J. D. Ullman, Principles of Compiler Design, 
Addison Wesley (1977). 
- 92 - 	 (CH 3) 
R. F. Lyon, "Simplified Design Rules for VLSI Layouts," Lambda 
Vol. 2(1) PP. 54-59 (1-st Quarter 1980). 
A. M. Gundlach and R. Hoiwill, Edinburgh Micro f abr ication 
Facility Design Rules: N-Channel Silicon Gate Process 2 (Revi-
sion B). March 1981. 
C. Mead and L. Conway, Introduction to VLSI Systems, Addison-
Wesley (1980). 
J. Mayor, M. A. Jack, and P. B. Denyer, Introduction to MOB 
LSI Design, Addison Wesley (1982). 
J. Newkirk and R. Mathews, The VLSI Designers Library, 
Addison-Wesley (1983). 
J. G. Hughes, "VLSI Design Tools," Internal Report, Univer-
sity of Edinburgh Department of Computer Science (1981). 
N. W. Bergmann, "A Case Study of the FIRST Silicon Compiler," 
pp. 413-430 in 3-rd Caltech Conference on VLSI, ed. R. 
Bryant,Cotnputer Science Press (1983). 
M. Gordon, "A Model of Register Transfer Systems with Applica-
tions to Microcode and VLSI Correctness," Department of Com-
puter Science Internal Report, University of Edinburgh 
(1981). 
G. J. Milne, "CIRCAL A Calculus for Circuit Description," 
Department of Computer Science Internal Report, University of 
Edinburgh (1982). 
- 93 - 	 (CH 3) 
G. J. Mime, "A Simple Silicon Compiler and its Correctness," 
in Proc. 6-th International Symposium on Computer Hardware 
Description Languages and their Applications, ed. T. Uehara 
and M. Barbacci,North-Holland Pubi. Co. (May, 1983). 
C. J. Mime, "CIRCAL A Calculus for Circuit Description," 
Integration Vol. 1(2&3) pp. 121-160 (October 1983). 
M. T. M. Rene Segers, "The Impact of Testing on VLSI Design 
Methods," IEEE Journal of Solid-State Circuits Vol. SC-
17(3) pp. 481-486 (June, 1982). 
T. W. Williams, "Design for Testability," NATO A.S.I., CAD for 
VLSI Circuits, Sijthoff & Noordhoff, (1981). 
G. Grassi, "Design for Testability," NATO Advanced Study 
Course on VLSI Design, (1980). 
A. F. Murray, "On the Effectiveness of Random Pattern Self 
Test for Bit-Serial Signal Processors," IEEE Trans. Comp., (to 
be published). 
A. F. Murray, P. B. Denyer, and D. Renshaw, "Self-Testing in 
Bit-Serial Parts: High Coverage at Low Cost," Proc. IEEE 
International Test Conf., pp. 260 - 268 (Philadelphia, 
October 1983). 
P. B. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-
Serial Approach, Addison Wesley (to be published). 
- 94 - 	 (CH 4) 
CHAPTER 4 
ARCHITECTURAL DESIGN TECHNIQUES 
Introduction 
The preceding chapter has described tools and techniques to 
assist in the production of sets of VLSI chip designs from func-
tional flow graphs. A successful implementation now depends on the 
formulation of a flow graph that correctly represents the function. 
System architectures are at issue here, especially with regard to 
the provision and use of concurrency. In this chapter some system 
architectures for real-time signal processing are examined. Ways 
are developed by which to derive both an appropriate computation 
scheme and a corresponding hardware architecture for a general 
class of signal processing algorithms. 
The material presented in this chapter has been developed by 
the author and has similarities to the work of other researchers 
[1,2,3,4,51. It differs, however, in one significant aspect, 
namely in the treatment of bandwidth matching and multiplexed 
architectures. In this respect it vncorporates and extends work 
done by Lyon [6]. A revised form of this chapter is to be found in 
[7], which also contains further examples of system case studies. 
The approach is to formulate function as a recurrence or set 
of recurrences. In order to compute each output sample the opera-
tions of the recurrence are repeated a fixed number of times. Each 
repetition is termed an iteration and the operations carried out in 
each iteration are identical. A processor is defined to implement 
the operations required for a single iteration of the recurrence. 
- 95 - 	 (CH 4) 
The major issue of concern is concurrency and its responsible 
application. It is necessary to achieve specific real-time sample 
rates using adequate but not excessive arithmetic processing 
resources. The range of possibilities (in terms of processors per 
iteration) is as follows. Firstly, the system can use only one 
processor, as in a von Neumann machine. This type of architecture 
is termed fully  serial. Secondly, several processors can be used 
but in any case fewer in number than there are iterations. Archi-
tectures of this sort are termed multiplexed architectures. 
Thirdly, the system can use the same number of processors as there 
are iterations. These are termed full array architectures. An 
example of such an architecture is the systolic array 
[8,9,10,11,12,13,14]. Note that the term array is used here to 
signify a linear array; however, the ideas may readily be extended 
to two dimensional arrays. Finally, for very high speed applica-
tions, more processors than there are iterations can be used. 
These could be termed distributed highly parallel array architec-
tures. 
The specific area of real-time applications addressed can be 
implemented using multiplexed architectures. Thus, the area 
between the fully serial architecture and the full array architec-
tures is of primary concern. Generally, neither of these extremes 
is optimal for any given real-time application and the most effi-
cient choice is a multiplexed architecture. 
This chapter has a generality which extends beyond specific 
bit-serial constructions. The architectures could be applied 
equally to bit-parallel or other implementations with different 
timing and communication protocols (e.g., self-timed etc). 
- 96 - 	 (CH 4) 
However, it should be observed that the design environment 
developed so far essentially reduces the design task to that of 
synthesising a high level hardware network to implement particular 
signal processing functions, that is of mapping function via compu-
tation scheme into an architecture; the rest of design has been 
automated. It is possible to take an entirely ad hoc approach to 
this task and then use a behavioural simulator of the type already 
described in chapter 3 for verification and correction. However, 
such an approach is error prone and time consuming. This chapter 
proposes a systematic way of mapping function into architecture for 
a class of functions and formulates architectures in such a way 
that they can be expressed conveniently and concisely as a hardware 
signal flow graph. The next chapter is concerned with the subse-
quent detailed issues of developing hardware flow-graphs specific 
to bit-serial implementations and the tools described in the previ-
ous chapter. 
The material developed here, together with an algorithm 
specification language, could form a basis from which to 
develop a next generation of silicon compiler with a higher, algo-
rithmic level of input specification. 
4.2. An Example. 
By way of an introduction, and as an example of the type of 
problem to be tackled, consider the implementation of a programm-
able transversal filter, sometimes referred to as a non-recursive 
finite impulse response (FIR) filter. The definition of this type 
of filter can be formulated in terms of the z-transform [Rabiner & 
Cold] as 
- 97 - 	 (cli 4) 
1-1(z) = b1 + b2z' + ... + bz-M +1 	 4.1 
Equivalently the filter can be specified in terms of of the follow-
ing equation, which expresses the filter output in terms of its 
previous inputs and tap weights 
y. = b1 x. + b2x. 1  + ... 	bMx.Ml 	 4.2 
The principal problem of implementation is to derive a computation 
scheme and a corresponding architecture which can eventually be 
expressed as a hardware flow-graph. For this particular problem 
there are many possible solutions, some of which are illustrated 
and discussed at the end of this chapter. 
Equation 5. 2 can be reformulated as a recurrence since its 
evaluation deperds on repeated multiply and add operations; these 
are the operations of a single iteration of the recurrence, which a 
processor must implement. From the hardware cycle time required to 
compute a single iteration and the input sampling frequency it is 
possible to determine how many such elementary operations can be 
carried out in one unit of sample time and therefore how many pro-
cessors will be needed for the complete recurrence evaluation in 
real-time. This corresponds to finding a computation scheme which 
has the correct bandwidth for the application so that the 
corresponding hardware will be capable of, but not greatly exceed, 
the required throughput rate. This is termed bandwidth matching. 
Suppose that an application requires a 256-point filter. This 
filter can be implemented at the one extreme with 256 multipliers 
and 256 adders etc; this would be a full array architecture (see 
Figure 4.1(a)). At the other extreme it can be implemented with 












- 99 - 	 (CH 4) 
one multiplier and one adder; this would be a fully serial archi-
tecture (see Figure 4.1(b)). Between these extremes there is a 
variety of choice, e.g., 2, or 4, or 8,... multipliers and adders 
could be used, which would give multiplexed architectures (see Fig-
ure 4.1(c)). Since the arithmetic elements have a fixed maximum 
operating speed, the choice of architecture limits the external 
bandwidth of the full function. Conversely, given a maximum sam-
pling frequency for the filter, the choice of how much concurrency 
to use can then be made. 
There are some further important considerations at this stage 
of design. If more than one processor is needed, the costs of 
inter-hardware communication must be considered; in other words, 
what balance is required between computation bandwidth and input-
output bandwidth? What constraints are there on computational 
latency? Is there a requirement for extensibility and in what 
form; i.e. should there be a facility for adding more iterations 
by adding more hardware, requiring a modular solution, or is fixed 
order hardware required to cope with arbitrary order computation 
merely by alteration of the control structure? 
For the purpose of this illustration assume that an appropri-
ate solution Is obtained by choosing to implement 32 iterations per 
processor. Thus 8 such processors are required to effect 256 
iterations. Given this constraint, one possible computation scheme 
is then to have each processor evaluate successively 32 consecutive 
iterations of multiply accumulate: processors 1,2,3... evaluating 
iterations 1,33,65,... followed by 2,34,66... etc. These groups of 
iterations, carried out in one processor, are referred to as stages 
of the computation. The stages can be organised to be computed 
- 100 - 	 (CH 4) 
concurrently, as here, or sequentially. The computation is then 
completed by adding the accumulation sums from the 8 stages. The 
details of this computation can be detailed in the form of a 
location-time-value graph, or hodograph (see section 4.4.1). The 
computation implicitly defines the connections required between 
processors as well as the memory requirements for each processor; 
thus, it defines the associated architecture. The hardware flow 
diagram of this architecture, shown in Figure 4.2, is derived by 
filling in these connections and memory elements explicitly. The 
memory required is particularly simple and can be implemented as 
FIFOs. 
4.3. From Function to Architecture via Algorithm 
4.3.1. Some Terminology 
The following terms are used in a specific sense throughout 
this chapter. 
The term function is used to mean a mathematical formulation 
of a signal processing transform. Such a formulation will not be 
interpreted as containing information about the way in which 
evaluation is to be carried out; it will be taken to define only 
the final result. 
In contrast, the term algorithm is used to refer to the 
detailed, multiple sequences of arithmetic operations which are 
required to evaluate a function. An algorithm includes time, pro-
cess location and control information and so determines the organi- 
sation of specific arithmetic operations. Algorithms can be 
expressed in a variety of ways including, parallel programming 
- 101 - 	 (CH 4) 
Processor (b) 
Multiplexed architecture (c) 
Figure 4.2 
- 102 - 	 (CM 4) 
languages. 
The term architecture is used to denote the organisation of 
specific hardware to carry out an algorithm, or class of algo-
rithms. This organisation of hardware will have a direct 
correspondence with the organisation of concurrent computation. 
The main characteristics of an architecture are its "spatial" and 
"temporal" organisation. The former relates to the existence and 
interconnect of structural components, processors, memory, etc. 
The latter relates to their function in time and their control. 
A machine is the actual hardware implementation of an archi-
tecture. 
The class of functions for which architectures are discussed 
comprises those functions which can be formulated as recurrences. 
...!. Recurrences 
A very broad class of signal processing functions can be 
expressed as a set of recurrence equations, with possibly also some 
initial and, or final, non-recurrent arithmetic processing. Such 
formulations are in wide use, as for example in time difference 
equation formulation of signal processing functions. The formula-
tion of a function into a set of recurrence equations constitutes 
the first step of finding an algorithm for its computation. The 
heart of such a formulation is a basic set of arithmetic operations 
which are carried Out many times for each evaluation of the func-
tion. Each repetition of this basic set of operations is termed an 
iteration. The total number of iterations required for one sample 
output of the function is referred to as the number of iterations. 
- 103 - 	 (CH 4) 
In the example of section 4.2 there are 256 iterations of the 
multiply-add operation, The number of iterations is always fixed for 
the range of functions considered here. The operations carried out 
in one iteration define the processor function. In multiplexed 
architectures, where groups of several iterations are evaluated in 
a single processor the former are said to constitute a stage. 
The sample cycle is defined by the duration of the arithmetic 
processing which must be carried out from the start of one input 
data set to the start of the next. Computation of iterations of 
the recurrence (together with any non-recurrence, peripheral pro-
cessing if any is required) occur during this period. The computa-
tion done during one sample cycle may comprise the evaluation of 
all iterations for the current time sample, as in parallel or 
serial-parallel architectures. In this case the output data set 
will appear at the end of the current cycle, and the architectural 
latency is equal to the sample cycle, i.e., minimum. Alternatively 
the computation may comprise only one or a few iterations associ-
ated with the current time sample, as in pipelined architectures. 
In such cases the remaining iterations of the current time sample 
are evaluated previously and/or subsequently, using only part of 
the machine at a time. The data output set at the end of the 
current cycle may refer to the present or to some previous time 
sample. 
Indefinite repetition of the cycle constitutes the machine 
function. Other cycles which will be referred to are function 
cycle, idles cycle, stage cycle, iteration cycle, word cycle and 
single full recurrence cycle. Processor latency is defined to be 
the time required for a processor to evaluate one iteration, 
- 104- 	 (CH 4) 
processor latency is equal to iteration period. Recurrence latency 
is defined to be the time required for the whole architecture to 
evaluate a single full recurrence, i.e., from present sample asso-
ciated input to present sample associated output. The recurrence 
latency is equal to the full recurrence period. 
Two specific recurrence formulations are possible: one in 
terms of a single time sample associated output of the function and 
the other in terms of the processor input-output "register" values. 
The former will be termed function recurrence equations and the 
latter processor recurrence equations. For function recurrence 
equations only time and iteration Indices are needed; for processor 
recurrence equations time, processor and sub-cycle (multiplex, or 
idles) indices are needed (see the following subsections). Usually 
design proceeds by formulating function recurrence equations, 
choosing an architecture type and then formulating the correspond-
ing processor recurrence equations. The equivalence of the two 
formulations can be established by induction on the processor 
recurrence equations. A hodograph (see section 4.4.1) gives a 
"visual" demonstration of the equivalence, and can be a useful 
development tool In formulating the processor recurrence equations. 
An Index Notation for Iterations 
Each iteration of a general recurrence requires inputs and 
produces outputs. In order to be able to refer to input and output 
values for any particular iteration the following notation is used. 
Let the total number of iterations be M, (in the introductory exam-
ple M=256). Let I be the index denoting the iteration and let t be 
the index denoting the time sample association. 
- 105 - 	 (CH 4) 
The iteration index i runs from 1 to M and will be interpreted 
as follows. The index i refers to the particular iteration under 
consideration. Any indices i-h refer to previous iterations and 
indices i+h refer to subsequent iterations. 
The index t runs from 1 to any arbitrary value; the machine is 
assumed to cycle for an indefinite time. Sample time indexing is 
interpreted as follows. The index 	t 	refers to the present time 
sample, for which the full recurrence is being evaluated. 
For the evaluation of the iterations of a full recurrence cer-
tain restrictions can be imposed on the inputs to each iteration. 
The criterion governing these is that of non-contradiction. For 
example, the value of the present iteration cannot be used before 
it has been calculated. Similarly, in a recurrence where the value 
of the previous-iteration-present-time-sample is required as an 
input for the calculation of the present-iteration-present-time-
sample, the value of the next-iteration-present-time-sample cannot 
also be used as input. Suppose that the present-iteration-
present-time-sample is indexed (i,t) then, under the assumption 
that all previous time sample iterations have been completed before 
the present one is commenced, outputs associated with the following 
iterations and time samples can be used in association without con-
tradiction: 
- 106 - 	 (CH 4) 
(i-h , t-s) 
(i+h , t-s) 
( I , t-s) 
(i-h , t ) 
where h , s > 0 and 0 <= I <= M 
i-h refers to previous iterations 
i+h 	refers to subsequent iterations 
t-s refers to previous cycles 
That is, the following outputs may be used: those from previous 
iterations at previous times, subsequent iterations at previous 
times, the present iteration at previous times and previous itera-
tions at the present time. The iteration index I = 0 refers to 
external, initial value inputs, and M is the total number of 
recurrence Iterations in one recurrence evaluation of the function. 
Figure 4.3 shows representation of the present, previous and 
subsequent iterations with their time sample outputs and illus-
trates these valid connection points diagrammatically. A particu-
lar algorithm "programmes" the connections from these points to the 
inputs, using only those required to define it. 
There are cases where it is necessary to relax the assumption 
that all previous time sample recurrences have been completed 
before the present one is commenced. In particular this is neces-
sary in pipelined architectures. However, when this is done, res-
trictions will generally be imposed on the values that h and s may 
take. 
Define the implementation latency to be the time taken to 
evaluate all iterations of the present sample time recurrence. 
- 107 - 	 (CH 4) 
H 	I I 	I I 	H 
t-1 t-1 t-1 t-1 _ 
H II II II 	I 
t-2 	t-2 	t-2 	t-2 
Ii H I I H H 
t-3 
	
t-3 )1E t-3 	t-3 
1-2 	1-1 	 1+1 	1+2 
"uncommitted" processor & delay chain bank showing 
iteration values available assuming 
ascending order of evaluation 
completelon of previous time sample associated iteration. 






- 108 - 	 (CH 4) 
Synchronisation of Recurrences with External I/O 
In implementing recurrences, it is usually assumed that 
inputs which are external to iterations are supplied in the 
required synchronisation, and that the rest of the system is con-
figured to ensure that this will happen. This must happen in 
real-time, so that firstly the computation bandwidth must be 
designed to be adequate and secondly any synchronising delay paths 
must be added. Often, particularly in adaptive systems, it is 
necessary to process the outputs from one cycle of a recurrence 
and feed back some resulting parameter(s) as input to the next 
cycle. The processing involved imposes a fixed delay between the 
end of one recurrence cycle and the start of the next. In order to 
ensure synchronisation of this output with the start of the next 
recurrence cycle there must be a phase of no operation for each 
processor during each cycle. In architectures which are continu-
ously clocked, it is necessary to introduce extra, "dummy" itera-
tions in the recurrence. These are carried out by the recurrence 
hardware during the imposed delay between the end of one recurrence 
cycle and the start of the next. For synchronisation reasons this 
delay must always be an integral number of iteration cycles. These 
will be referred to as idling iterations or idles, The recurrence 
cycle is then divided, with respect to the total number of itera-
tions into a function sub-cycle and an idles sub-cycle which are 
contiguous and non-overlapping. 
Clearly, idling iterations or no operation phases represent an 
unavoidable inefficiency in the implementation architecture. It 
is, therefore, necessary either to minimise the length of the idles 
cycle or to find a system use for the idles hardware during the 
- 109 - 	 (CH 4) 
idles cycle, where these cannot be avoided altogether. 
4.3.3. Recurrence Architectures 
The form of a recurrence may determine what possibilities 
there are for organising the iterations and the processors which 
carry them Out and whether they have to be carried out sequen-
tially, or whether some or all might occur concurrently. Within 
the loose constraints imposed by the given function definition, 
there are usually several possible organisations for computation 
and architecture. These are now explored, so as to obtain a taxon-
omy of architectures. 
In the introduction to this chapter architectures have already 
been classified as full array, multiplexed or fully serial, accord-
ing to the number of processors used. This may be regarded as a 
kind of spatial classification. This is now extended further in 
relation to the distribution and sequencing of iterations within 
processors, i.e., in the direction of "temporal" classification. 
Consider the iterations being evaluated during a single sample 
cycle in the processors. If these all relate to the same time sam-
ple output then the architecture will be termed homogeneous. Here 
all the iterations relating to the present output sample are car-
ried out within the one recurrence cycle. If each iteration 
relates to a different time sample output the architecture will be 
termed separated. If some iterations refer to the same time sample 
output and some refer to different time sample outputs then the 
architecture will be termed grouped. Figure 4.4 illustrates this 
taxonomy of architectures. The above terms are chosen so as to 
avoid the ambiguity which more familiar designations hold. 
-110 - 	 (CH 4) 
ARCHITECTURE 
FULL ARRAY 	 FULLY SL 
OUS 	 MUL11PLD(ED HOMOGE"\ 
HOMOGENE \N 	 (Serf ci) 	







TAXONOMY OF RECURRENCE 
ARCHITECTURES 
ARRAY & MULTIPLEXED 
architectures may ci-




& data flow 
Figure 4.4 
- 111 - 	 (CH 4) 
However, they are somewhat cumbersome so the following conventions 
are adopted. Homogeneous full array architectures will be termed 
parallel architectures. Homogeneous multiplexed architectures will 
be termed serial-parallel architectures. Homogeneous fully serial 
architectures will be termed serial architectures. Grouped and 
separated architectures will be termed pipelined architectures. 
The use of the terms parallel, serial-parallel, serial and 
pipelined should not be confused with their more general designa-
tions. For example, there is parallel processor function in homo-
geneous, separated and grouped architectures (if they are not fully 
serial) but the parallel function operates differently with respect 
to the time association of the iterations being evaluated. In one 
case it may be parallel and in another serial etc. 
Let the number of processors in the architecture be denoted by 
the letter k, and let 1<= k <= M , where M is the total number of 
iterations. The architectures of particular interest are as fol-
lows. 
Homogeneous fully serial: Serial 
If k=1 there is just one processor and the present sample 
associated iterations must be computed serially in some order, usu-
ally either ascending or descending order. Ascending order sequen-
tial evaluation will be referred to as forward multiplexing of the 
processor to evaluate the present sample time associated itera-
tions. Descending order sequential evaluation will be termed back-
ward multiplexing. The implementation latency will be M processor 
cycles. 
- 112 - 	 (CH 4) 
Homogeneous Multiplexed: Serial-Parallel 
If 1<k<M there is more than one processor but not as many as 
one for each iteration of the recurrence. If, further, for the 
duration of the present time sample cycle the processors evaluate 
only present time sample associated iterations, then the organisa-
tion will be referred to as serial-parallel. Serial-parallel 
architectures can implement forward or backward multiplexing within 
each processor, all processors normally implementing the same type. 
The implementation latency will be M divided by k processor cycles. 
Serial-parallel architectures are instances of multiplexed archi-
tectures. In the introductory example k=8 and M=256 
Homogeneous Full Array: Parallel 
If k=M then there is one processor for each iteration of the 
recurrence, and for the duration of the present sample time cycle 
each processor evaluates only present sample time associated itera-
tions, then the organisation will be termed concurrent or parallel. 
The implementation latency will be only one processor cycle. 
Separated and Grouped: Pipelined 
If, for the duration of the present sample cycle, the proces-
sors evaluate not only the present sample iterations, but also oth-
ers, then the organisation will be termed pipelined. It is not 
usual to use a pipelined organisation with k=1. However, various 
pipelined organisations with k=M or with 1<k<M are possible. The 
implementation latency will depend on both the value of k and on 
the pipeline organisation. Often pipelining causes an increase in 
latency, but this may be an acceptable trade-off against increased 
- 113 - 	 (CH 4) 
efficiency in other areas e.g., throughput, communication etc. 
Pipelined architectures may be multiplexed architectures or full 
array architectures. 
In a multiplexed implementation, whether pipelined or serial-
parallel, a single physical processor computes, sequentially, 
several iterations of the recursion and there are certain necessary 
restrictions on the values of k and M. The value k must be a fac-
tor of M. Clearly the choice of M as well as the choice of k has 
implications for multiplexed implementations. Often a signal pro-
cessing algorithm, when expressed as a set of recursions, gives the 
specification of M in terms of an inequality, M being greater than 
some fixed value. Thus, the choice of value for M has a degree of 
latitude and a suitable value, which can be factorised, may usually 
be found to suit the Implementation requirements. Often M is asso-
ciated with function performance, so that it is possible to trade 
off performance against architectural efficiency and convenience. 
Where an implementation uses more than one processor, it is 
usual for the multiple processors to work in synchronism. It 
should be noted that this is not the only possible organisation, 
other schemes include alternate processors active in antiphase over 
a two phase cycle, pipelined staggering of processor cycles etc. 
The specification of these details will be referred to as the pat-
tern of processor activity. 
It is worth noting here that the organisation of the sequence 
of iterations determines the type of memory needed to support each 
processor and that both the architecture and the recurrence deter- 
- 114 - 	 (CH 4) 
mine the details of processor interconnect, 
4.3.4. Recurrence Types 
Recurrences are now classified according to their input 
requirements for the evaluation of iteration i at time t. This 
is important because the range of possible architectures is con-
strained according to this classification. Recurrences are classi-
fied as follows 
Class A: (i-h, t ) is not used. That is, outputs from 
previous iterations in this time sample are not used for 
the computation of any of the iterations during it. 
Class B: (i-h, t ) is used but ( H ,t-1) is not used as input 
to (j,t) for any value of j. This class exhibits zero sample 
time delay between iterations, but does not have recursive 
feedback from the last iterations. 
Class C: (i-h, t ) is used and ( M ,t-1) is used as input to 
(j,t) directly or indirectly for some value of j. This class 
exhibits zero sample time delay between iterations and recur-
sive feedback from the last iteration. 
4.3.5. Bandwidth Matching 
The key implementation issue has already been identified as 
that of concurrency, i.e., the issue of how many processors to use 
and how to organise them. Various types of architecture have now 
also been defined in relation to the number of processors used and 
in relation to the structuring of iterations. The issue of 
bandwidth matching is addressed next. 
- 115 - 	 (CH 4) 
So that various architectural alternatives for implementing 
recurrence functions can be explored a notional, or virtual proces-
sor is associated with each iteration of the recurrence. By way of 
contrast, the final computational hardware for evaluating a single 
iteration of the recurrence will be referred to as a physical pro-
cessor. In connecting up virtual processors no delay is associated 
with their operation. Physical processors, on the other hand have 
a fixed positive latency. 
If it is decided to use one physical processor for each vir-
tual processor, then the architecture must be either a parallel or 
a pipelined full array architecture. This usually results in 
excessive bandwidth for a given application. If this is the case 
then it is better to use fewer physical than virtual processors. 
The architecture is multiplexed and may be either serial-parallel 
or pipelined. Finally, if only one processor is used the architec-
ture is serial. 
Fixed rules can be developed for mapping full array architec-
tures of parallel or pipelined type into multiplexed architectures. 
These rules are dealt with in detail in a later subsection; they 
allow easy and flexible adjustment of an architecture to cover a 
wide range of bandwidths. The notion of a group of virtual proces-
sors being replaced by a group of physical processors can help to 
visualise the mapping method and ease design detailing. 
Recurrence Imposed Constraints on Bandwidth Matching 
Firstly, when considering bandwidth matching in relation to 
choice of architecture, there are some weak conditions imposed by 
recurrence type. These restrict the architectures which can be 
- 116 - 	 (CH 4) 
used to implement the various classes of recurrence and are as fol-
lows. 
Class A recurrences can be implemented by any of the architec-
tures. 
Class B recurrences cannot be implemented by serial-parallel 
or parallel architectures and must therefore be implemented 
either in pipelined or serial form. 
Class C recurrences can be implemented only serially. 
Choice of Architecture & Multiplexing. 
Once the recurrence and peripheral processors have been 
designed, their latency and the system word-length can be deter-
mined. This information is needed in order to determine how many 
processors to use. Meantime the key unknown quantities are 
represented by the use of symbolic constants. An outline is now 
given of how to match the required system bandwidth with the pro-
cessor bandwidth in order to choose an architecture which has 
minimal hardware but meets the required performance. This usually 
turns out to be a multiplexed architecture, hence the association 
of the term multiplexing with bandwidth matching and choice of 
architecture. 
The following symbolic constants are the key independent vari-
ables in a determination of the architecture. 
- 117 - 	 (CH 4) 
Maximum system sampling frequency this is usually determined 
from the application bandwidth. Let B denote the maximum bandwidth 
required for the application and let F denote the minimum sampling 
frequency necessary to achieve this bandwidth. Then F is given by: 
F>=2B. The value F/2 is referred to as the Nyquist frequency. 
Often the sampling frequency is taken to be greater than this 
value. 
Input signal quantisation, Q, measured in bits. 
Hardware implementation maximum serial clock frequency, H, 
measured in Hz. 
From these it is possible to determine the choice of implemen-
tation as follows. 
Firstly, it is necessary to determine the processor latency, 
i.e., the processor bandwidth or cycle time. This is done by 
implementing the arithmetic processing required to calculate a sin-
gle iteration of the recurrence and finding its latency. Let L 
denote this latency and GLB(L) denote the greatest lower bound on 
this latency, both measured in bits. Again, the eventual value of 
L must make up an integer number of iteration sub-cycles whereas 
GLB(L) need not. The value of GLB(L) will be determined by the 
minimum processing latency for implementation of the arithmetic 
function, whereas the value of L must take account of other issues. 
Secondly it should be decided, from the inputs to the 
recurrence, whether it is necessary to have any idle iterations. 
Let z be the number of iterations of idles. The value of z will be 
determined by whatever arithmetic processing has to be done between 
- 118 - 	 (CH 4) 
the completion of a full recurrence and the start of the next one 
and can be determined from the minimum latency of the hardware 
which implements this arithmetic processing. Denote this minimum 
latency by GLB(l), standing for the greatest lower bound on 1, 
where 1 will be used to denote the latency of the hardware. The 
value of 1 in bits must always make up an integer number of sub-
cycles, whereas GLB(l), also measured in bits, may not. The value 
of z must be an integral number of sub-cycles. Consequently, the 
hardware may have to be adjusted later, from having latency GLB(l), 
to having the next larger integral sub-cycle as latency, in order 
to satisfy this condition. This is easily done by the addition of 
FIFOs of the required length to each output. The value of z is 
given by: 
z = GLB(l)//n 	 if rem(GLB(l),n) = 0 
z = GLB(l)//n + 1 	if retn(GLB(l),n) # 0 
where a 	is the system word-length, where GLB(l) is the 
greatest lower bound on the latency, where GLB(1)//n is the integer 
quotient and where rem(GLB(1),n) is the remainder. 
Given the above groundwork, the following lemma expresses the 
number of processors to be used, in terms of the independent con-
stants. 
Let the number of iterations per stage be denoted by m, i.e., 
in is the number of iterations to be implemented by each physical 
processor. Let the total number of physical processors required to 
implement the M iterations of the whole recurrence be denoted by k. 
The following lemma states how to find values for in and k from the 
independent variables and expresses them in terms of the appropri- 
- 119 - 	 (CH 4) 
ate symbolic constants. It should be noted that a description of 
the recurrence in these terms does not, in fact, constrain the 
eventual solution to being a multiplexed implementation; if m=1 and 
k=M then there is one processor per iteration and the implementa-
tion can be concurrent. 
Lemma 
The solution sets for m and k are given by 
m+z 	






Note that both m and k must be positive integers. 
The above lemma states that the period required to calculate 
one iteration is n/H and the sampling period is 1/F , so that the 
number of iterations that can be calculated in one sample period is 
H/(nF). Its significance is that this fixes the upper limit on the 
total number of function iterations plus idle iterations that can 
be evaluated in this time. Thus if a total of N iterations are to 
be evaluated, where M is greater than m, then more than one proces-
sor has to be used and the number of processors required is given 
by k. This implies that the architecture must be multiplexed if m 
is greater than one, or a full array architecture if in equals one. 
Also it requires that the recurrence imposed constraints allow 
implementation by these types of architecture. If not then the 
recurrence cannot be implemented for these clock rates, word-
lengths and bandwidths. 
- 120 - 	 (CH 4) 
From the above lemma the solution set for in can be determined. 
Where this solution set is non-empty the value chosen for in should 
be the maximum allowable value, thus minimising hardware and max-
imising multiplexing within the required bandwidth. 
Note 1. 
in = 1 	implies a parallel architecture 
in = M 	implies a serial architecture 
1 < in < M implies multiplexed architecture 
in < 1 	is NOT realisable 
Note a. 
If the following conditions are satisfied, then the 
recurrence is real-time Implementable as a pipelined 
array of multiplexed processing units. 
m>1 
all inputs for output 
indexed (i,$) 
have one of the following types 
of index: 
(i - h,s - t) 
(i h, s ) 
(1 + h,s - t) 
( i ,s - t) 
where h,t>0 
no conflict exists between 
implementation 
type as determined by the value 
of 	in 
- 121 - 	 (CH 4) 
and as determined by the 
recurrence class. 
Rules for Transforming Array Architectures 
into Multiplexed Architectures. 
It has alreadyJ mentioned that fixed rules for mapping 
array architectures into multiplexed architectures can be 
developed. They are useful in taking a known high bandwidth 
array architecture and reducing both bandwidth and hardware 
in order to match specific lower bandwidth applications with 
minimum hardware. In this section a method for accomplishing 
this is outlined and some of the rules are stated. 
Here attention is restricted to two types of full array 
architecture: the parallel architectures and the separated 
full array architectures. The full array is to be thought of 
as a virtual architecture and a method is demonstrated for 
distributing the operations of several processors of the full 
array into a single physical processor by time multiplexing 
and the addition of appropriate memory and control. The 
sequence of iteration evaluation determines the data flow 
beyond that already fixed by the full array architecture and 
thus fixes the memory structures required. It is of 
paramount importance, where there is freedom of choice with 
respect to this sequencing, to choose so as to minimise and 
simplify the memory requirements. When this is done, in many 
cases the memory can be implemented by nothing more compli-
cated than FIFOs. For functions and array architectures 
where these simple memory structures cannot suffice either 
- 122 - 	 (CH 4) 
special custom memory architectures can be developed or the 
conventional techniques using addressed RAM (random access 
memory). In such cases RAM address generation can become a 
substantial part of the design. 
Consider again a recurrence of M iterations implemented 
as a full array architecture. Suppose that the equations 
given are the processor recurrence equations associated with 
this full array. Further, consider this as a virtual archi-
tecture for which a multiplexed architecture is sought able 
to carry out the same function. Let this multiplexed archi-
tecture consist of k physical processors each performing m 
iterations per sample cycle. The output of each iteration 
must be stored from the physical processor in such a way as 
to make it available later for input as required. This is 
fixed by the function and the virtual architecture. The fol-
lowing special cases occur frequently in the signal process-
ing architectures of interest to us. They are listed 
together with their memory-multiplex-interconnect-control 
structure. The processor recurrence equations for the vir-
tual architecture are indexed (i,t) by processor and sample, 
with 1<=i<=M. Since the array interconnection is regular, 
the processor recurrence input requirements can be formulated 
in terms of the processors indexed (i+h,t-s) or (i-h,t-s) 
whose outputs must be made available to the processor (i,t). 
Each of the three instances stated below is a special case of 
this with h=1 and s=O or s=1. In particular, it should be 
noted that it is assumed for these cases that all processors 
in the multiplexed architecture are active for the whole of 
- 123 - 	 (CH 4) 
the sample cycle. Generalisation to other values of h and s 
is straightforward. The hardware and control required for 
other cases, not covered by this list, for example with other 
patterns of processor activity, can usually be derived easily 
from a hodograph of the full array architecture. 
In full array architectures requiring the value of 
iteration (i-1,t) as input for the calculation of iteration 
(i,t) in virtual processors are replaced by a single physical 
multiplexed processor with output fed back to input via a 
memory loop of length 1 iteration cycle and multiplexer con-
trolled by c2. The cascade tap out point is as shownand if 
such multiplexed processors are cascaded then they will each 
be operating on groups of iterations associated with succes-
sive time sample outputs, see Figure 4.5. 
In full array architectures requiring the value of iteration 
(i-1,t-1) as input for the calculation of iteration (i,t) in 
virtual processors are replaced by a single physical multi-
plexed processor with output fed back to input via a memory 
loop of length m+1 multiplex subcycles and multiplexer con-
trolled by c2. The cascade tap out point is as shown and if 
such multiplexed processors are cascaded directly then they 
can all operate on iterations associated with the same time 
sample outputs, see Figure 4.6. 
In full array architectures requiring the value of iteration 
(i+1,t-1) as input for the calculation of iteration (i,t) in 
virtual processors are replaced by a single physical multi-
plexed processor with output fed back to input via a memory 
loop of length rn-i multiplex subcycles and multiplexer con- 
o— -- 4.5 
4.6 
- 124 - 	 (CH 4) 
(1+1 ,t— 1) 
IRMIAIMIRZO 	IS 
rn—i 
f 4.7 2  
lNrr 
m 
Note:  dm is m bit delayed version of c2 
 []-"O--' 	& 	O-'[J--' 	are equivalent 
 represents sample cycle delay 
 M& []J 	represent 1 & m multiplex subcycle delay etc. 
 
xO 	 y—xOforc2-O') Y 	 c2 Is control 
xl y=xi for c2=1) 
c2 
 ) 	cascade o/p 0— 	cascade I/p 
Figures 4.5 - 4.8 
4.8 
- 125 - 	 (CH 4) 
trolled by c2. The cascade tap out point is as shown and if 
such multiplexed processors are cascaded directly then they 
can all operate on iterations associated with the same time 
sample outputs, see Figure 4.7. 
In full array architectures requiring the value of iteration 
(i,t-1) as input for the calculation of iteration (i,t) m 
virtual processors are replaced by a single physical multi-
plexed processor with output fed back to input via a memory 
loop of length m multiplex subcycles and multiplexer con-
trolled by c2. The cascade tap out point is as shown and if 
such multiplexed processors are cascaded directly then they 
can all operate on iterations associated with the same time 
sample outputs. This type of loop is to provide constant 
coefficients associated with each iteration, see Figure 4.8. 
Note that, in the full array architecture, the pattern of 
activity for each processor can include only one active phase 
and m-1 idles phases during a single sample cycle 
These special cases are now extended as follows. 
Represent the multiplexed architecture by a notional bank of 
1uncouimitted" processor-memory chains as partially illus-
trated in Figure 4.9. The following lemma states where to 
find the correct tap out positions for input to processor 
(i,t) in the multiplexed architecture. 
Lemma a. 
Consider a full array architecture consisting of M pro-
cessors. Let processor recurrence equations for any output 
associated with the processor (i,t) be formulated in terms of 
- 126 - 	 (CH 4) 
h--5M..L.. h — —i + - +7 
= 24 
m0 - 2 
k= 6 
m=6 
1< j < 6 
(i±h,s-1) 
—6 	—2M 	 +2 	 +6 
—7 	—3M 	 + 	 +5 
—4M 	—4 	ON 	0 h =+4 	+4 
—5 	 —1 +3 h—+ 
—6 	 —2 +2 	+6 
--7 	—3 +1 	 +5 
h - am + b 
Figure 4.9 
- 127 - 	 (CH 4) 




I/p Tap Position 
Distance from MPX point 
Selected during 
(i+h, s-r) 
I/P Tap Position 
Distance from MPX point 
Selected during 
(i, s-r) 
I/P Tap Position 
Distance from MPX point 
Selected during 
Input 
Main (C2=0) 	Alternate ( C20 
j-a 	 j-(a+l) 
r(m0+m)+b 	r(m0+rn)-m+b 
b<t<m0+m O<t.b 
j+a 	 j+a+1 
r(rn0+rn)-b 	r(m0+m)+rn-b 
0 < t < rn-b rn-b < t 4 m0+m 
not used 
r(m0+m) 	not used 
0 < t 4 rn0+rn not used 
Tcb?e 4J 
Further, let the processor activity pattern be such that 
there is one function subcycle followed by m0 idle subcycles 
per single sample cycle (i.e., the processors are active 
together and idle together). Then this full array architec-
ture can be replaced by a multiplexed architecture consisting 
of k processors each implementing m iterations, where M=km. 
The single sample cycle length for the multiplexed architec- 
- 128 - 	 (CH 4) 
ture will be (m0 + rn) and the processor activity pattern 
will all be in function subcycles followed by in 0 idle subcy-
des. Further, if the j-th processor of the multiplexed 
architecture evaluates the function of the (j-1).m + 
1,...,j.m processors of the full array architecture in 
ascending order then Table 4.1 gives the tap positions of the 
notional multiplexed architecture according to input type. 
Let h = a.rn + b where O<=b<ni and let t be an integer such 
that 1 < t < m+ m. The Table 4.1 gives the input tap posi-
tions in terms of the processor to be tapped and the distance 
of the tap point from the processor multiplex point. It also 
gives the parts of the multiplex subcycle during which the 
selection is valid. Figure 4.9 illustrates this for m=4, 
= 2. 
4.4. Miscellaneous Related Topics 
In the foregoing sections an outline of a class of func-
tions and architectures has been given together with methods 
for matching a given function to an efficient known architec-
ture. This section covers two miscellaneous related topics. 
Firstly a technique is introduced which can be used for the 
verification of architectures and for developing new archi-
tectures. Secondly the main evaluation criteria for assess-
ing signal processing architectures are set out. 
4.4.1. Processor-Time-Value Graphs, Hodographs & 
Hardware Signal Flow Graphs. 
The above approach to modelling computation naturally 
leads to a location-time-value indexed notation, which 
- 129 - 	 (CH 4) 
expresses signal values as a function of the following argu- 
ments, or indices 
processor index 
sample time index 
multiplex sub-cycle index 
Generally, signal processing equations can be reformu-
lated, formally in terms of such notation, as processor 
recurrence equations. These recurrence equations imply 
corresponding hardware processors, memory requirements and 
interconnection of elements; if developed correctly they 
could be used as a high level form of hardware description. 
The time sample index is integer and arbitrary. Negative 
values will be used only to represent initial values, of 
internal state, etc. The processor index is integer and 
bounded between one and the total number of physical proces-
sors used in the implementation. The multiplex sub-cycle 
index values are integer modulo the multiplex sub-cycle 
length, and therefore lie between zero and one less than the 
sub-cycle length. The iteration index of the function 
recurrence equations is integer and bounded between one and 
the total number of iterations of the recurrence. The itera-
tion index is directly related to the processor and multiplex 
ubcycle indices through the architecture. In concurrent 
architectures, the iteration and processor index are identi-
cal, the multiplex cycle reduces to one iteration cycle, and 
the fundamental cycle is often also the same as these. Such 
notation allows the detailed function of the computation 
scheme to be represented and defined. In particular, the 
- 130 - 	 (CH 4) 
simultaneous or non simultaneous staggering of processor 
operation, the distribution of intermediate values between 
processors, the order of computation of partial results 
within processors and the data flow communication between 
processors can be defined unambiguously. 
One of the problems associated with such notation is 
that it involves a profusion of index detail and therefore 
tends to be tedious and error prone. Further, different pro-
cessor recurrence equations often evaluate the same function 
and that function may not be obvious from the recurrence 
equations. Thus it is possible to have to or more sets of 
different recurrence equations, with corresponding different 
hardware implementations, to compute the same function and 
this fact will not be clear from the recurrence equation 
notation alone. Various techniques exist for making this 
processor-time-value notation easier to use. The representa-
tion of computation schemes can be expanded into tabular or 
graphical form by enumerating the processor index horizon-
tally, (conventionally left to right ascending), and the 
sub-cycle index vertically, (conventionally top to bottom 
ascending). The cycle index is constant over each full sub-
cycle count and then increments at the start of the next. 
Each sub-cycle count is evaluated modulo the sub-cycle length 
and therefore runs repeatedly from zero to one less than the 
sub-cycle length. The elementary unit of time is that of one 
processor operation. Within this framework the computation 
scheme is defined by entering the signal values computed at 
each processor-time position. This method of representation 
- 131 - 	 (cl-i 4) 
will be termed a processor-time--value graph of computation. 
Automation of the processor-time-value graph notation can be 
used in a limited way to represent and manage values nota-
tionally by the use of a "symbolic" simulator. 
A further useful simplification is obtained by suppress-
ing the details of index notation and function value and to 
represent only the flow of data on the processor-time graph 
grid. This can be done by the use of arrows to show data 
transfer in time and between processors. Such graphs are 
particularly useful for concurrent or pipelined, laminar 
flow, nearest neighbour, regular interconnection schemes, but 
can become unreadable for non local, or multiplexing induced 
turbulent flow communication schemes. In this form of 
representation vertical arrows represent data held within a 
processor but passed forward through time, and diagonal 
arrows represent data which is being passed from processor to 
processor through time. This form of representation 
emphasises the interprocessor communication structure but 
also shows how it behaves through time. Such graphs are 
termed hodographs. On a hodograph the feature of particular 
interest is where lines of flow intersect, as this is where 
the corresponding data will interact within processors. 
Diverging or parallel lines of flow represent data items 
which cannot interact to form partial results. The hodograph 
is a useful aid to visualising the interaction of data flow-
ing at different rates and in different directions through a 
linear array of processors. 
Hardware implementation of the computation scheme can be 
- 132 - 	 (CH 4) 
derived easily from any of the above representations by the 
addition of the required memory storage and correct intercon-
nection (including multiplex-switches) between the proces-
sors. In its initial form this is conveniently represented 
as a form of hardware signal flow graph or interconnection 
net of processor, memory and multiplex-switch elements. 
There is a choice in representation as to whether to inter-
pret and therefore represent processors as including latency 
or alternatively to interpret their function as instantaneous 
and represent the latency distinctly on each output. Each 
method has its advantages and disadvantages. The function of 
the memory and interconnect is to supply the correct input 
values to each processor at the correct time and is in 
essence defined implicitly by the computation scheme. 
Methods whereby this may be implemented are discussed below. 
Finally such hardware flow graphs can be coded directly into 
FIRST code for compilation. 
The representation methods outlined above furnish useful 
notational tools for developing or synthesising concurrent 
computing schemes, for defining them and also for verifying 
their function. Later sections of this chapter will indicate 
details of the derivation of hardware specifications from 
such computing schemes. The next sub-section outlines cri-
teria for evaluating the merits of any particular computing 
scheme and the following sub-section illustrates the use of 
what has been discussed by means of several examples. 
- 133 - 	 (CH 4) 
4.4.2. Evaluation Criteria. 
Evaluation of any computation scheme in relation to the 
proposed application is an important part of the design task. 
Such evaluation should be completed before final commitment 
to hardware and can be undertaken either before coding a 
hardware flow graph to implement the design or else on com-
pletion. The former is done if there is a well known 
theoretical approach for resolving all the issues, the latter 
if it is required to use a simulator in order to carry out 
some of this evaluation. Evaluation should include the fol-
lowing aspects and measure the associated attributes of the 
proposed architecture. 
Hardware Evaluation. 
total hardware required in each processor: 
number, type and use of primitives 
total local storage associated with processor 
processor latency 
processor modularity 
total number of processors 
Evaluation of Communication. 
switching and interconnect features: extent, 
locality, bandwidth 
global communication overheads 
Computation Evaluation. 
data flow, input/output, communication 
-134-- 	 (CH 4) 
final output extraction 
response time 
throughput rate 
duty cycle indication of use of processors 
local storage 
modularity 
expandability: via addition of hardware, via 
reconfiguration 
Evaluation of arithmetic performance. 
system word-length 
use of dynamic range, signal distribution 
introduced non-linearities: overflow performance 
noise performance 
scaling overheads, format matching 
stability 
The examples in the next section illustrate some of the 
evaluation points with regard to computation schemes. 
4.5. Examples. 
A specific example is now taken in order to illustrate 
the application of the ideas described above. The function 
chosen is that of the introductory section. This example is 
used for two reasons, firstly it is a simple function so that 
extraneous complexities will not obscure the principles 
involved and secondly it can be implemented by all the dif-
ferent types of architecture, giving a unified illustration 
for making comparisons. The various architectures that are 
- 135 - 	 (CH 4) 
set out below could be implemented in a variety of ways: 
bit-serial or bit-parallel, using discrete, programmable or 
custom parts. Some aspects of the evaluation of the computa-
tion architecture will depend partly on the implementation 
chosen. In particular, architectures with smallest 
input/output will be strongly favoured for bit-parallel sys-
tems whereas this may not be important for bit-serial sys-
tems, giving the designer an extra degree of freedom in find-
ing hardware efficiency. 
For the purposes of the following examples small, 
unrealistic values are chosen for k, m, M • This allows the 
diagrams to be more easily followed. Each allows arbitrary 
extension and the form of the processor recurrence equation 
representation is general. 
- 136 - 	 (CH 4) 
The function, for which an architecture is to be found, 
is defined as follows. Given constants 
Wl,W2, ... ,WM 
and given signal values 
X11X29...7XN 





_1 	 for 1 < i I N+1-M 	5.5 
Firstly observe, by comparing the data and partial 
results required for the computation of successive values 
that the function can be expressed as a class A recurrence. 
In the interests of illustrating a wide range of architec-
tures bandwidth matching is not considered here. Instead the 
full array architectures are explored and then the multi-
plexed architectures are considered. The fully serial archi-
tecture will not be discussed. The computation of a single 
sample output y requires the evaluation and summation of M 
products wx j1. Thus in any architecture for this algo-
rithm, these are the iteration operations and the iteration 
cycle is defined as the time required for the hardware to 
perform a single multiply-add. Answers to the following 
questions will then determine the architecture type and 
specific structure. 
- 137 - 	 (CH 4) 
How many processors are required to match the 
sampling frequency with the individual processor 
bandwidth? 
How are the iteration products and partial sums 
to be formed with respect to a single sample 
cycle? 
How are the data values to be located in the pro-
cessor array? 
What pattern of processor activity is to be used? 
Answers to these then determine the spatial and temporal 
aspects of the architecture fixing the data flow (i.e., they 
determine where data is used or formed during one cycle and 
where it must be sent for the next). From these requirements 
hardware can be derived directly. Apart from bandwidth res-
trictions, the range of possible architectures can be viewed 
in terms of the different ways in which the data can flow in 
order to compute the correct function, in this example the 
flow of x., w., Y. to form y.. The processors that particu-
lar data has to flow through determines interconnections and 
the rate at which data flows through them determines their 
latencies. A detailed representation can be worked out on a 
location-time-value graph or on a hodograph, a formal summary 
can be specified in terms of the processor recurrence equa-
tions and the corresponding hardware block diagram can also 
be given. 
Within an array architecture a range of alternatives can 
be built up according to the range of possible flows of the 
data involved and the pattern of processor activity required 
- 138 - 	 (CH 4) 
to compute the function using these flows. Often some 
specific flow patterns will exclude computation of the 
required function, or give rise to inefficient results. 
A Note on Flow and Processor Activity 
The flow of data with respect to a linear array of pro-
cessors can be forward, backward or stationary. Constant 
rates of flow can be represented by positive, negative or 
zero values and measured in terms of the processor associated 
latencies as 1/n where a is the latency of each processor. 
Globally communicated data, corresponding to zero latency 
associated with processors, can be thought of as having 
infinite flow rate for notational purposes. 
The sample cycle processor activity must be defined in 
terms of the total number of iteration and idle cycles per-
formed between input samples. Processor activity can then be 
defined in terms of this cycle as the pattern of activity. A 
numerical value for this can be derived for each processor by 
coding the active and idle phases within the cycle. Active 
phases can be coded 1 and idle phases 0. For example, for a 
cycle consisting of five iteration and three idle cycles, a 
processor which is active during sub-cycles 1 and 3 would 
have code 00000101 • For maximum efficiency processors 
should be active throughout the sample cycle, but some archi-
tectures rely on alternate processors being active during 
different phases in order for data to flow correctly to form 
the required function. Example 4 below illustrates an 
instance of this. 
- 139 - 	 (CH 4) 
The hardware flow diagrams for each example are given 
with the text In Figures 4.10 - 4.15. 
Example 1: Homogeneous full array. 
Equation 5.5 can be computed directly by the use of M 
parallel data Inputs to M multipliers and by using a parallel 
M input adder to sum the products. This is, however, unsuit-
able for anything other than very small values of M, because 
of the hardware problems involved in parallel addition and 
parallel data input. A variant of this architecture is to 
pipeline the data input and implement it via a tapped delay 
line. The hardware for this architecture is shown In Figure 
4.10(a) where Figure 4.10(b) shows the function of the 
repeated processors. 
Example 2: Pipelined full array. 
The main problem of the parallel architecture for the 
function given in equation 5.5 is that of the fan-in associ-
ated with parallel addition. The way to overcome this is to 
pipeline the formation of the sums by formulating each as 
"the sum so far" plus the next term and then to form it 
sequentially. One way of doing this is illustrated in Figure 
4.11. 
Examples 3 & 4: Pipelined full arrays. 
Example 2 exhibits pipeilning of the sum formation but 
still has the problem of fan out, for moderate or large 
values of M. This can be solved by pipelining both the sum 
formation and the data transmission and formulating them as 
- 140 - 	 (CH 4) 
recurrences. Several different forms of pipelining and pro-
cessor activity are possible and the range of these can be 
explored by a systematic investigation of the various flow 
patterns that can be used. Two cases are illustrated. Exam-
ple 3, in Figures 4.12 illustrates one such design in which 
the flow rates are all non negative and the processor duty 
cycle is 100%. Example 4, in Figures 4.13 illustrate a rad-
ically different design in which one flow is positive one 
negative and the third zero, and the processor duty cycle has 
to be 50% for correct evaluation of equation 5.5. Other 
variants of pipelined full array architectures can be 
developed along these lines, starting from the data flow. 
Example 5: Pipelined full array. 
One feature of the architectures illustrated so far is 
that in order to increase M additional hardware must be 
added. At best this is just the connecting up of identical 
hardware as in the case of modular architectures, at worst it 
involves component redesign. For some applications it may be 
possible to relax the requirements on sampling frequency in 
the interest of extending M without having to add hardware. 
In other words a degree of programmability by control is 
wanted, rather than programmability by hardware reconfigura-
tion. A simple example of such programmability by control 
can be seen in the comparison of extending a bit-parallel 
adder and a bit-serial adder to accommodate a longer word-
length. Example 5, illustrates a pipelined architecture, 
with the same number of processors as iterations but which 
can be reconfigured to operate with fewer processors than 
- 141 - 	 (CH 4) 
iterations. The pipelining is arranged so that each proces-
sor forms a fixed time sample recurrence and during any one 
cycle the processor form batches of successive time sample 
outputs. The weights and data samples are passed systoli-
cally from processor to processor at different rates; the 
partial results stay fixed and are accumulated in each pro-
cessor. Successive processors output, in succession, batches 
of successive time sample outputs. This is illustrated in 
Figures 4.14. 
Example 6: Multiplexed pipelined. 
So far all the architectures have been full array archi-
tectures. The remaining example, see Figure 4.15, shows a 
multiplexed architecture, suitable when the maximum bandwidth 
offered by the previous implementations is excessive and it 
is more important to minimise the amount of hardware. It 
also illustrates the technique for mapping a full array 
architecture into a multiplexed architecture, being a multi-
plexed version of the full array architecture given in exam-
ple 4. 
The examples given above by no means exhaust the range 
of possible architectures for evaluating the convolution sum 
of products but they do illustrate many of the many alterna-
tives and their architectural features, merits and problems. 
Further examples of this may be found in {15]. 











- 144 - 	 (cl-I 4) 
(b) 
Figure 4.11 




- 146 - 	 (CH 4) 
4m 4m 4m 4m 











- 148 - 	 (CH 4) 
References 
H. T. Kung, "Lets Design Algorithms for VLSI Systems," 
Department of Computer Science Report, Carnegie Mellon 
University (1979). 
H. T. Kung and C. E. Leiserson, "Algorithms for VLSI 
Processor Arrays," pp.  271 - 292 in C. A. Mead & L. Con-
way, "Introduction to VLSI Systems", Addison-Wesley 
(1980). 
D. Cohen, "Mathematical Approach to Computational Net-
works," Technical Report, University of Southern Cali-
fornia, Information Sciences Institute (November 1978). 
L. Johnsson and D. Cohen, "A Mathematical Approach to 
Modelling the Flow of Data and Control in Computational 
Networks," pp.  213-68 in VLSI Systems and Computations, 
ed. H. T. Kung, B. Sproull, G. Steele,Springer Verlag 
(1981). 
L. Johnsson, U. Weiser, D. Cohen, and A. L. Davis, 
"Towards a Formal Treatment of VLSI Arrays," Technical 
Report, California Institute of Technology, Computer 
Science Department (January 1981). 
R. F. Lyon, "A Bit-Serial VLSI Architectural Methodology 
for Signal Processing," pp. 131-140 in VLSI 81: Very 
Large Scale Integration, ed. J. P. Gray,Academic Press 
(1981). 
- 149 - 	 (CH 4) 
P. B. Denyer and D. Renshaw, VLSI Signal Processing: A 
Bit-Serial Approach, Addison-Wesley (to be published). 
R. P. Brent and H. T. Kung, "Systolic VLSI Arrays for 
Linear-Time GCD Computation," Proc. VLS183, pp. 145 - 
155 (Trondheim, August 1983). 
K. Bromley and et al., "Systolic Array Processor 
Developments," pp. 273 - 284 in VLSI Systems and Compu-
tations, ed. H. T. Kung, R. F. Sproull & G. Steele 
Jr. ,Springer-Verlag (1981). 
P. R. Cappello and K. Steiglitz, "Digital Signal Pro-
cessing Applications of Systolic Algorithms," pp. 245 - 
254 in VLSI Systems and Computations, ed. H. T. Kung, R. 
F. Sproull & G. Steele Jr.,Springer-Verlag (1981). 
A. L. Fisher and H. T. Kung, "Synchronising Large Sys-
tolic Arrays," SPIE Real-Time Signal Processing V Vol. 
341 pp.  44 - 52 (1982). 
W. M. Gentleman and H. T. Kung, "Matrix Triangularisa-
tion by Systolic Arrays," SPIE Real-Time Signal Process- 
ing 	Vol. 298 pp. 19 - 26 (1981). 
R. H. Kuhn, "Yield Enhancement by Fault Tolerant Sys-
tolic Arrays," pp. 145 - 152 in Proc. USC Workshop on 
VLSI Design and bdern Signal Processing, ed. S. Y. 
Kung, (Los Angeles, 1982). 
14. H. T. Kung, "Why Systolic Architectures?," Computer Vol. 
15(1) pp. 37 - 46 (January 1982). 
- 150 - 	 (CH 4) 
15. H. T. Kung, , Lecture Notes for Advanced Course on VLSI 
Architecture (Bristol, July 1982). 
- 151 - 	 (CH 5) 
CHAPTER 5 
SYSTEM SYNTHESIS AND SYSTEM MAPPING TECHNIQUES 
USING FIRST 
5.1. Introduction 
This chapter describes the design steps necessary for 
translating a computational system architecture into a functioning 
FIRST HDL specification. The form of the procedure given was 
developed by the author as a result of system design and consulta-
tion with P. B. Denyer and S. G. Smith. This chapter appears in a 
revised form in [1]. 
The method through out has been to create well defined inter-
faces at strategic boundaries in the system design hierarchy. 
Transformation of data from one interface to the next can only be 
automated if there is a set of formal rules defining the transfor-
mation. Properties which are preserved from one interface to the 
next must be invariant under the transformation and the transforma-
tion must only generate valid new data. The ease with which the 
problems between interfaces can be solved depends on the judicious 
definition of the interfaces. Where this has been done success-
fully the interface definitions are adopted as a standard, as for 
example CIF, RS232, file transfer protocol, etc. The main virtue 
of such an approach is to give a structured method for managing 
system complexity. 
In the design hierarchy proposed here such interfaces are the func-
tion, the algorithm and its architecture, the flow-graph, the FIRST 
hardware description language and CIF. In Appendix III an example 
- 152 - 	 (CH 5) 
is given, showing how to transform a signal flow graph into the 
FIRST hardware description language, and the silicon compiler then 
transforms this into CIF and an associated functional behaviour. 
In the chapter 4 some target system architectures have been 
developed; these assist in transforming the function into an algo-
rithm and architecture. Having established the architectural con-
text and structure for system implementation all that is left is 
the need to complete the details of the flow-graph. In this 
chapter a synthesis methodology is derived to provide a guided path 
through from computation architecture to hardware flow graph. 
Overview 
The plan is to build systems by 
Identifying the target architecture; 
Building arithmetic engines (processors); 
Supporting the processors with state memory; 
Implementing the array of multiplexed processors. 
The process of identifying the target architecture has been 
dealt with in some detail in chapter 4. This chapter deals with 
the remaining steps. 
5.2. Implementing Arithmetic Engines 
The issues involved in arithmetic processor design apply 
equally to the design of both the recurrence processors and the 
peripheral processors. These processors may be so simple that only 
one level of definition is needed. More complicated processors can 
be structured using the hierarchy of operator, chip and subsystem, 
together with some conventions to aid assembly. Such conventions 
- 153 - 	 (CH 5) 
are enumerated, with explanation, in later sub-sections. 
The arithmetic processor in its "raw form" has some inputs, an 
interconnection of arithmetic processing elements and some outputs. 
The interconnection net defines the set of nodes. Each primitive 
input and each output is then connected to the appropriate node. 
The design and specification of a processor thus becomes simply the 
construction of a flow graph of primitives, together with the nodes 
associated with their inputs and outputs. The sequence of parallel 
operations to be carried out is assembled directly from the bit 
serial primitives available. Such a list must satisfy certain cri-
teria of correctness in order to represent a valid network. (These 
criteria are checked for by both the simulator and the chip layout 
generation programs of the FIRST compiler; they include checks for: 
fan in, fan out, undriven nodes, and overdriven nodes.) 
5.2.1. Time Tagging 
Perhaps the largest task in designing a bit-serial arithmetic 
processor is that of accommodating latency through the various sig-
.Ldl paths. Time tagging is a useful technique to help manage this 
process. A notional time tag is associated with each node in a net 
list. Time tags correspond to the arrival of new words at the 
hardware nodes. The tags are defined relative to some location 
chosen as the time tag origin and relative to some starting time. 
The location for the system time tag origin is usually taken 
to be the control generator output. This corresponds to global 
tagging. Local time tag origins can be defined anywhere, for exam-
pie at the inputs to a processor. The absolute starting time is 
- 154 - 	 (CH 5) 
time zero. Alternatively, relative starting time may be used. 
This can be any value of absolute time when it is taken as zero 
locally. A time tag is relative or absolute depending on its 
starting time, and local or global according to its origin. 
Define the time tag of a node to be the latency, measured in 
bit time, from the chosen origin. In other words, the time tag is 
the number of bit times required for the LSB associated marker of a 
signal to travel from its origin to the node in question. The time 
tag is a kind of time metric for the system. For convenience, the 
time tag can be measured in higher order units or even mixed units 
e.g., cycles, or words and bits, if desired. Because each bit-
serial primitive is pipelined and has a finite latency equal to 
some integer multiple of the clock, or bit time, the processor 
nodes will in general have different time tags associated with 
them. 
Time tags are used to keep account of data synchronisation 
during the design process and to help in generating the control 
network. Local time tags are usually used during processor design; 
changes of origin are easily accomplished during synthesis. 
5.2.2. Further Conventions 
To aid synthesis and speedy design of arithmetic engines it is 
convenient to adopt some further simplifying conventions for their 
construction. 
Maximal Concurrency, Minimal Processor Latency 
Firstly the organisation of primitives within a processor 
should exhibit maximal concurrency. That is to say, a fully 
- 155- 	 (CH 5) 
concurrent flow graph of the process is preferred over one which 
multiplexes elements within it. This organisation maximises indi-
vidual processor bandwidth and simplifies both the control struc-
ture and the problems of synthesis. Further, an attempt is made to 
minimise processor latency by the judicious choice of parameters, 
the order of operations, and by avoiding, wherever possible, the 
use of "latency-expensive" primitives. 
Synchronous Interfaces 
The organisation of the system architecture, as given in 
chapter 4, is simplified if all data entering and leaving each 
arithmetic processor is synchronous. This is the second conven-
tion: to ensure that inputs to a processor and outputs from a pro-
cessor are synchronous, corresponding to a principle of time align-
ing at inputs and outputs. This requires there to be equal latency 
from input to output paths. Thus, processors are arranged so that 
all inputs can be given local time tag zero and all outputs will 
have the same value of time tag. 
This principle means that additional compensating, equalising 
or synchronising delay may have to be inserted into signal paths at 
various nodes. Clearly it is desirable to make this overall pro-
cessor latency as small as possible, i.e., processors should incor-
porate only the unavoidable storage function associated with their 
pipelined computation and all other memory is implemented exter-
nally to the processor. Such a convention considerably simplifies 
synthesis at the possible expense of hardware optimisation. Gen-
erally the reduction possible when such a convention is not used is 
only marginal. Further, this convention can be relaxed 
- 156 - 	 (CH 5) 
subsequently, to incorporate the optimal organisation. 
Parameterisation Using Symbolic Constants 
In general, each primitive has some parameters associated with 
it. The third convention is that in specifying processors as net 
lists of primitives these parameters should be retained as symbolic 
constants. 
This may be necessary when some of the parameters (e.g. system 
word length) have not yet been determined and others may only have 
tentative initial values. This convention allows rapid re-
implementation at any later stage if some of the parameters have to 
be changed. In particular it is useful to parameterise all syn-
chronising delays in terms of the parameters on which they depend. 
Systematic adherence to this principle can save time later, espe-
cially if it is necessary to compare implementations with differing 
coefficient quantisation, or system word-lengths. 
Arithmetic Aspects of Computation Architectures. 
Once an appropriate computing scheme has been devised it is 
necessary to take account of its detailed arithmetic operation and 
the nature of the signal inputs. The context for this is set by 
the arithmetic representation, and by the application. It is 
necessary to establish, for each calculation in the computation 
scheme, the number of bits of quantisation needed and the interpre-
tation of binary point position, the format or scaling, required. 
The system application will dictate a desired dynamic range and 
scaling for each input signal. Further, a knowledge of the nature 
of the signals involved should give the statistical distribution of 
- 157 - 	 (CH 5) 
the signal within this range. From this information the initial 
range and resolution of signal values can be determined. Then, for 
each computation the required range, resolution and expected dis-
tribution of the results can be estimated. Methods for achieving 
this include theoretical considerations, statistical simulations, 
existing knowledge of similar systems, if such exist, and the 
method of first approximation and design iteration. In particular 
this analysis will cover internal bit growth, internally generated 
overflow conditions and possible scaling incompatibilities which 
can be caused by fixed length, fixed point arithmetic representa-
tion of signals and their different scalings. 
One outcome of this analysis will be a target word-length at 
each point in the system determined by the arithmetic considera-
tions. From these it is possible to see the constraints and tar-
gets set arithmetically on the system word-length. Later, as 
hardware processors are designed, the hardware constraints on sys-
tem word-length will emerge from the design process. From both 
sets of constraints and targets the system word-length can then be 
chosen with some degree of optimisation. 
It is essential to establish values for the range and resolu-
tion required for the evaluation of each iteration as well as the 
approximate signal distribution within this range. If this 
analysis is not done during this phase of design then, when it 
comes to processor implementation, certain assumptions will have to 
be made concerning them. 
Two's complement, fixed point format arithmetic introduces 
some unavoidable representation problems, as does any other system 
- 158 - 	 (CH 5) 
of number representation. Firstly, there is the problem of bit 
growth resulting from multiplication operations. This is usually 
handled by rounding or truncation; where necessary, however, it is 
possible to use multiple precision product output, for example for 
accumulation of low significance bit values. 
Secondly, there is the problem of bit growth arising from 
addition or subtraction. This can be handled in the following 
ways. 
The system word-length can be chosen to be long enough to con-
tain all bit growth. This can lead to inefficient use of dynamic 
range, but in bit serial systems it can also be useful where pro-
cessor latency constraints and algorithm type necessitate the use 
of a long word-length. A variant of this solution is to catch the 
"overflow" in a separate higher order word, thus using multiple 
precision representation in the parts of the system where needed. 
This can reduce system word-length and speed throughput. The over-
head of going to multiple precision arithmetic is small in bit 
serial systems, in contrast to bit parallel systems where the over-
head of multiple width parallel buses is usually unacceptable. 
Knowledge of signal distribution within representation range is of 
particular value here, as it can be used to keep the probability of 
overflow arbitrarily small, while at the same time minimising the 
range needed. 
An alternative approach is to detect overflow and either clamp 
or warn of its occurrence. This technique can introduce non-
linearities into the system, and this may or may not be acceptable 
for some applications. 
- 159 - 	 (CH 5) 
Another alternative is to ensure that overflow does not occur 
by scaling signal values internally, wherever necessary. A general 
theory for doing this in the design of digital filters Is primarily 
due to Jackson, and a good presentation can be found in chapter 5 
section 12 et seq., of [Rabiner and Gold]. 
The general effect of rounding and scaling is to loose preci-
sion and introduce noise into the system. For this reason an 
evaluation of any implementation should investigate its noise per-
formance, that is the comparative degradation in performance over 
the theoretical algorithm (infinite precision) due to the way the 
implementation uses rounding, truncation, and scaling. 
A third problem is that of fixed point format compatibility. 
If two signal values are to be added or subtracted, then they must 
have the same scale factor for the operation to be valid. On the 
other hand multiplication can accommodate differently scaled data 
and coefficient. Thus within a system some signals, e.g., product 
outputs, may have incompatible scaling with respect to others, to 
which they must be added for example. This requires careful 
"accounting"  of scale factor, or format, at the input to each 
arithmetic operation, and may necessitate the introduction of scal-
ing adjustment at various points. 
Signal Scaling 
The fourth convention is to label Out each node with an asso-
ciated signal scale factor. The FIRST target hardware and archi-
tecture are built around a fixed word-length and fixed-point for-
mat, twos-complement number representation. Each node, at each 
word time, has an associated binary integer value. This takes the 
- 160 - 	 (CH 5) 
form of a bit-serial stream, marked by an associated LSB pulse. 
This integer value has an associated notional scale factor giving 
the interpretation of the binary point position. Conventionally, 
at the system data inputs, the interpretation is that numbers are 
represented as purely fractional, lying between plus and minus one. 
As a result of arithmetic processing, however, and because other 
inputs often require scaling, the internal representations may 
require to take values in excess of this limited range. The arith-
metic format is expressed as the scale factor required to change 
the integer interpretation into the required interpretation, or in 
more extended form as a pair of numbers giving the powers of two 
represented by the most and least significant bits of the integer 
word (cp. Appendix II). This format then gives the assumed 
interpretation of the integer bit streams to be found on the node. 
The need for such a notation is that, given assumed input formats 
for a processor, it is necessary to keep track of the arithmetic 
transformation through the processor. For example, two signals 
with different scaling cannot correctly be added. Therefore if, as 
a result of internal operations, e.g., multiplication, two values 
of different format are to be added, one or both must first be 
scaled. 
A further arithmetic consideration is that of the dynamic 
range (i.e., the range and resolution) and the distribution of sig-
nal values at each node of the system. This is largely a question 
of efficiency and use of representation. Again this is not an 
issue that can be resolved at the level of processor assembly but 
can only be dealt with at full system, or algorithm computation 
level. Determination of parameter values, of appropriate formats, 
- 161 - 	 (CH 5) 
of system word length, dynamic range, quantisation, etc., is only 
possible from overall system level considerations. This issue can 
be resolved by theoretical studies, simulation or by design itera-
tion. 
All of the above information should be available from the 
analysis of the arithmetic aspects of the signal processing algo-
rithm and its computation scheme. However, if this has not been 
done or is incomplete, then first approximations can be used, the 
system synthesised and extensively simulated and the design 
iterated. This constitutes an alternative to analytical or 
theoretical approaches. 
Hardware Accounting 
The fifth convention is that the specification of each proces-
sor should include a summary of both hardware and arithmetic 
statistics. 
This means that a process of hardware and arithmetic "account-
ing" should complete the design of processors. This accounting 
includes a summary of processor function, parameterisation, 
hardware latency, arithmetic formats (as associated with each of 
the inputs and outputs), arithmetic performance, including bit 
growth, non-linearities, dynamic range etc. This information is 
used in various stages of design synthesis: during processor 
design, network synthesis, external interfacing, for checking and 
simulation output data interpretation. Systematic documentation at 
an early stage eases later retrieval. 
- 162 - 	 (CH 5) 
5.3. Setting the System Word-Length 
Systems are composed of arithmetic processors and state 
memory. Since the memory has no effect on word growth within the 
system, it follows that it is possible to evaluate signal ranges 
and set the system word-length as soon as all of the arithmetic 
processors have been designed. To do this there are two aspects of 
arithmetic processing to be considered: the processing local to a 
single processor and the total processing throughout the machine, 
as a result of the interconnection of processors and the consequent 
data flow. This sub-section discusses the issues involved in 
determining the system word-length. Definitive values, or at 
worst, first approximation values for the remaining, independent 
symbolic constants should be available from the arithmetic study of 
algorithm, computing scheme and system input signals. 
5.3.1. Factors 
The factors which affect system word-length are: 
input quantisation; 
internal bit growth through arithmetic processing; 
special input formats required by primitives. 
Additionally and depending on the recurrence type, a further 
factor which may affect the system word-length is: 
minimum processor latency. 
In particular this is the case for fully parallel architec-
tures, multiplexed architectures implementing type B recurrences 
(see chapter 4), and serial architectures. 
- 163 - 	 (CH 5) 
5.3.2. General Objectives 
In bit serial implementations, the system word-length or the 
processor latency will determine the stage sub-cycle length and 
hence the processor throughput rate. It is desirable to make both 
as short as possible, in order to maximise hardware bandwidth. 
Further, unless there are overriding reasons for not doing so, sys-
tem synthesis is simplified by making them equal, or at least by 
making the processor latency an integer multiple of the system 
word-length. On the other hand, it is necessary to ensure that the 
system word-length is long enough to allow for the anticipated bit 
growth. These opposing requirements on system word-length are 
resolved by ensuring that efficient use is made of the available 
dynamic range. Once the architectural decisions have been made, 
the method for chcilsing system word-length is to determine firstly 
the values of all the lower bounds on it. Then the least value 
satisfying these conditions can be chosen as the system word-
length. 
5.3.3. Lower Bounds on Word Length 
Let outputs be indexed (i,t) according to stage and sample, 
respectively. Let n denote the fixed, internal system word-length. 
Let Q denote the fixed number of bits of input quantisation. Input 
quantisation forms a lower bound on the system word-length, 
Q <= n. 
For algorithms where the output indexed (i-1,t) is required 
before the computation of output indexed (i,t) can commence, the 
processor latency, L, is a lower bound on the system word-length, 
GLB(L) <= L = n 
- 164 - 	 (CE 5) 
where GLB() stands for the greatest lower bound. 
Even where the restriction of making L=n is not necessary 
there are often reasons of convenience for chq,sing the system 
word-length to have this value, or perhaps an integer multiple of 
it. 
5.3.4. Bit Growth 
The choice of system rd-length should take account of inter-
nal arithmetic growth and can be used to incorporate guard bits to 
deal with overflow/underflow. Any system word-length must either 
incorporate adequate guard-bits or some other mechanism for accom-
modating the effects of bit growth. Internal bit growth occurs 
only in the arithmetic processors (e.g., adders or subtractors), no 
further bit growth can occur in the memory, multiplex or intercon-
nect hardware. When considering internal bit growth it is neces-
sary to distinguish the following types of input condition: 
single pass inputs 
multiple pass inputs (fixed, finite repetition) 
feedback inputs (unlimited repetition) 
The classification of inputs in this way is derived from the 
system architecture. Single pass inputs involve only explicit bit 
growth within the processor. Multiple pass inputs occur when there 
is repeated identical processing, either within one processor, by 
multiplexed recirculation, or by passing processed values on to 
further identical processors. In this case the maximum possible 
bit growth is a fixed multiple of the explicit bit growth which 
occurs within a processor. Feedback inputs, related to time aver-
aged or accumulated data, involve potentially unlimited bit growth. 
- 165 - 	 (CH 5) 
For each input the associated bit growth must be determined. 
A deterministic approach or more likely, a statistical one, based 
on expected signal distributions, can be taken. Where the proces-
sor incorporates a scheme of scaling it may not be necessary to 
allow for bit growth. 
5.3.5. Format Restrictions 
Certain primitives have input format restrictions. These must 
be observed for the hardware primitive to function correctly. Usu-
ally these are in the form of conditions on the sign bit; for exam-
ple, the modified Booth multiplier, discussed in Appendix II 
requires two sign bit repetitions for correct operation. 
For convenience, the system word-length is often chosen to be 
an even number. 
Summary 
Having determined its various lower bounds, the system word-
length may then be chosen as the least possible value to satisfy 
all. If a long total system word-length is necessary, it may be 
partitioned into multiple precision bytes of shorter length. If 
feasible, the byte can then be redefined as the internal system 
word-length and processing is in multiple precision where neces-
sary. The cost of using this facility can be an increase in system 
hardware latency, which is traded off against the higher bandwidth 
operation. 
5.4. Implementing State Memory, Multiplexing 
and Net Synthesis. 
- 166 - 	 (CH 5) 
The number of physical processors that will be needed can now 
be fixed, by using the proposition of chapter 4, together with the 
value for the system word-length. This achieves the processor-to-
system bandwidth matching necessary to give minimum hardware but 
adequate performance. The chosen architecture determines the 
sequence of intermediate result computation and gives the multiplex 
memory loop structure (cp., chapter 4). 
Having designed the processors and determined the symbolic 
constant values, the system synthesis is accomplished by adding the 
required memory, multiplex switching and interconnect. This opera-
tion should be subject to the restrictions outlined in sections 5.5 
and 5.6 below, in order to secure initialisation and testability. 
As a first pass at interconnect synthesis, a useful convention 
is to proceed as follows. Multiplexers are connected at processor 
inputs wherever they are needed to break loops. Compensating 
delays are inserted into the other input paths, as necessary. 
Associate with each such processor the fixed amount of "virtual" 
state memory required. This is the total memory required to make 
up the correct loop lengths needed by the algorithm (function and 
architecture). Implement this partially by the inherent processor 
latency and partially by the addition of physical memory to make up 
the remaining virtual memory. Each processor output is then con-
nected to this physical state memory and the state memory outputs 
are connected to the multiplexer inputs. 
The interconnection of such processor, multiplexer, memory 
structures to each other is then achieved by tapping outputs from 
the processors, or from somewhere in the state memory so as to pass 
- 167 - 	 (CH 5) 
data between these units in the correct synchronisation. The 
interconnect scheme can then be evaluated for hardware efficiency 
and where expedient may be rearranged, in particular by relocating 
the multiplexer position in relation to the processor and merry 
within each loop. 
5.5. Initialisation 
The initial internal state of all primitive hardware on power 
up is indeterminate. Two types of start up cover all possible sys-
tem requirements. These are 
start up where pipelines fill with valid data by propa-
gation via normal system function and 
start up from given initial conditions. 
The former assumes that direct flow will flush out the initial 
states and therefore assumes that there is no internal feedback; 
the latter requires a preliminary initialisation phase to set up 
the correct initial conditions. In hardware terms this is achieved 
by ensuring that all feedback loops in the network are broken by at 
least one multiplex primitive which allows access for external sig-
nals. The multiplex primitive is controlled by an "event" control. 
This event, when valid, signifies initialisation and, when not 
valid, signifies run mode. The event control signal is activated 
during initialisation, or reset by the use of the corresponding 
event request input to the control generator. During initialisa-
tion the multiplex primitive breaks the feedback loop by taking an 
externally supplied signal, instead of the internally fed-back sig-
nal. In run mode the multiplex primitive closes the internal loop, 
as required for system function. 
- 168 - 	 (Ca 5) 
5.6. Design-for-Test 
It is now widely accepted that proper consideration must be 
given to testing as part of the design process. Without this con-
sideration, the parts so designed may well be difficult or even 
impossible to test in practice. Ideally a simple design-for-test 
procedure is needed; one that can be applied at design time. The 
test and testability issues are examined in detail in [2,3]. and 
the conclusion of these studies is that bit-serial systems are easy 
to test and that a simple design-for-test process does exist. In 
this section a summary of the main conclusions of the afore men-
tioned study is given, as far as they relate to system synthesis. 
This is done to ensure that the systems designed are fully and 
easily testable. 
Primitives can be classified as amenable to random test, or 
not. For test purposes the system network must be broken into 
chains of random test amenable primitives; where a chain is defined 
to be a net of primitives which has no loops, that is no feed-back 
paths. Any primitives which are not random test amenable should be 
isolated from the chains so that they can be tested separately. 
Largely, the configuration into chains will already have been 
arranged for the purposes of initialisation, and is achieved by the 
introduction of multiplex elements. Isolation of primitives which 
are not random test amenable can be done by the use of multiplex 
elements or also by other, more efficient mechanisms. This allows 
external access to inputs and outputs for separate testing and will 
prevent non-linear growth of testing. 
In summary, system synthesis should ensure that, in composing 
- 169 - 	 (CH 5) 
the interconnection net, all closed loops should be configured so 
as to have an externally accessible point of entry (this is gen-
erally done in any case for initialisation) and all non-random-
test-amenable primitives should be isolated. 
Implementation of the Control Network 
Reviewing progress with system design and implementation thus 
far, it is to be observed that the following items have been 
derived: a system architecture, values for all the system parame-
ters, designs of arithmetic processors and a primary network of 
signal nodes connecting processors, state FIFOs and multiplexing. 
Each primitive in this primary network has certain control require-
ments. So far these exist in the design as "hanging" or uncon-
nected control inputs and outputs at each primitive. These unde-
fined control connections, together with the time tags associated 
with the corresponding signal nodes, implicitly define a secondary 
network of control nodes. This control network has no nodes in 
common with the signal network (other than supply and ground) and 
may be viewed as an "orthogonal" design exercise. The primary 
(signal) and secondary (control) networks "contact" each other in 
the primitives. The next task in the design process is to find an 
efficient implementation of this control network. 




Definition of virtual control loops; 
Reduction by origin rotation; 
- 170 — 	 (CH 5) 
Physical realisation in minimum length tapped delay 
lines. 
Time tagging has already been dealt with in section 5.2.1, the 
remainder of the steps are discussed next. 
There is a hierarchy of control, corresponding to the 
incidence of the start of a word (LSB), the start of a multiplex 
sub-cycle, the start and finish of an initialisation or test cycle 
etc., at each node. Each level of control requires, firstly, a 
single source node from the control generator. Secondly, variously 
delayed versions of each level of control are required to synchron-
ise correctly with the primitives in the primary network of signal 
nodes. There are two possible approaches to the construction of 
these delay line networks: 
the delay elements can be synthesised as a separate control 
network, which supplies all control requirements but uses none of 
the hardware already in the primary network. 
the delay network can be synthesised as two subnets. One, the 
processor-associated-control, consists of hardware, either existing 
or additional, within each processor. The other, the extra-
processor-control, supplies the remaining control requirements. 
It may appear that there is no essential difference between 
these two approaches, since they relate, apparently, only to 
specification groupings which are flattened out by the compiler. 
This is not the case. The use of processor associated control can 
result in duplication of control delay hardware, but can also util-
ise any free"  hardware in the place of extra delay hardware, as in 
the case of the multiplier primitive, which automatically delivers 
- 171 - 	 (CH 5) 
a delayed version of its input control. Separate control may be 
minimum in terms of delay hardware but can suffer fan-out problems 
and may not utilise any "free" delay associated with processor ele-
ments to replace delay elements. Each system should be evaluated 
so as to establish which method of implementation gives the best 
trade-off between hardware, fan-out and design complexity, though 
the margin of difference is usually small. 
There are two techniques for reduction of the control network 
hardware which can be applied. 
Firstly, due to the periodic nature of the control cycle 
pulses, the tapped points of the delay line will be equivalent 
modulo the cycle-length. This means that the maximum length of 
control delay line necessary for any level of control is always 
less than the cycle length for that level. Thus LSB control can be 
reduced modulo the system word-length. Usually the total latency 
required for the multiplex cycle level of control is less than the 
multiplex cycle length, so that this reduction technique is not 
usually useful above the LSB level of control. 
Secondly, again due to the periodic nature of the control 
cycles, each cycle associated delay line can be reduced by rotating 
the origin. Recall that the origin is defined at the control gen-
erator output and is associated with the start of a new cycle. 
Note also that the various levels of control must be in synchroni-
sation at all locations. Given any level of control and any ori-
gin, once the control delay line has been constructed it can be 
notionally extended into a closed control loop if a further end-
block of delay is added so as to synchronise the final output with 
- 172 - 	 (CH 5) 
the initial input. (Note that the control generator does not 
incorporate a facility for receiving the returned signal.) The 
total length of the loop is then determined by the total latency of 
the hardware being controlled and is an integer number of words, 
corresponding to the delay of a signal moving through the whole 
path from start to finish. Further, the loop is characterised by 
groups of delay (usually unequal) separated by the tap out points 
required by the primary signal network. This shall be called a 
virtual control loop as it is sensible to implement only the delay 
line portion up to the last tap position, and not implement the 
end-block. This is a useful design concept for the following rea-
son. By rotating the origin around the loop, the unimplemented 
end-block can also be made to rotate. In this way, a judicious 
choice of origin can ensure that the largest delay in the loop is 
the end-block and so ensure minimum hardware. 
This technique is not always applicable, but it is particu-
larly useful for dealing with multiplexed architectures, which 
always have signal loops associated with the multiplex-cycle level 
of control. Application of rotation in this context reduces both 
the delay hardware required for the multiplex level of control and 
also that required for the associated initialisation event control 
net. 
In passing, it should be noted that neither modulo reduction 
nor origin rotation are techniques that can be applied to event 
control, since events are non-periodic and do not have a defined 
cycle length. Examples of the computation of time tags, the imple-
mentation of control networks and their reduction using the tech-
niques outlined here, are to be found amongst the case studies in 
- 173 - 	 (CH 5) 
5.8. Partitioning 
Until now no distinction has been made to define any physical 
boundaries in the system. Instead global system design has been 
developed, independent of such considerations. This is a viable 
approach for bit-serial systems, where partitioning is feasible in 
any number of ways due to the low overhead incurred when any signal 
path crosses a chip boundary. 
In system design, chips are specified as containing a hierar-
chy of operators and primitives. The use of a hierarchy of opera-
tors eases the design task by allowing nested groupings of primi-
tives to be specified, as appropriate to the system architecture. 
This feature serves design convenience and efficiency, the silicon 
compiler flattens out any hierarchy, and assembles the chip from an 
ungrouped list of primitives. Partitioning consists of determining 
how many and what operators to place on each chip. 
Partitioning is governed firstly by the maximum feasible chip 
size available, which is, in turn, determined by process yield and 
packaging economics. Secondly it is influenced by the architec-
tural groupings of primitives. In the early stages of design, 
before system synthesis has been completed, (during processor 
design and as a preliminary to system synthesis), chip floor plans 
can be generated in order to obtain likely chip sizes. This will 
establish what limitations are imposed on the architectural group-
ings of operators and primitives by the chip size restrictions. 
This should be done at an early stage of design because partition-
ing involves a latency overhead. There is a latency of one bit 
- 174 - 	 (CH 5) 
involved in leaving one chip and entering another, due to the 
inter-chip pipelining latency. Thus, the location of a primitive 
on a chip sets it down inside communication boundaries. Primitives 
cannot be moved over these boundaries without a modification of 
latency, if synchronisation is to be preserved. Iteration on floor 
plans with incomplete system specifications allows efficient parti-
tioning prior to and during final synthesis. 
System synthesis can be done with purely architectural group-
ings of chip contents. If this approach is taken without any idea 
of resultant chip size then the resultant system may contain over-
size chips, or chips with unsuitable aspect ratios. This may force 
a redesign. The redesign will require redistribution of primitives 
across chip boundaries and so entail changing both the primitives 
and their interconnection net. Even this is not usually a major 
task and it can normally be accomplished quickly. In the interest 
of modular design it is generally convenient to partition systems 
so as to include whole stages of recursive computation schemes 
within chip boundaries. This is exemplified by the case studies in 
[1] 
5.9. Completion 
All the major design issues have now been addressed. A system 
design can be completed routinely in the following stages. Some of 
these can be initiated from the start of design and be run along-
side the stages described above; for example, during processor 
design it is recommended that coding and compilation of flow graphs 
accompany their formulation. 
- 175 - 	 (CH 5) 
Node Naming 
Nodes define the interconnection between primitives. For 
FIRST source descriptions nodes must be given names. These may be 
descriptive or may follow some name-number scheme. The compiler 
generates the name-associated numerical tags, and lists these in an 
output file, which is used for specifying input and output nodes 
when using the simulator. 
FIRST Source File Writing 
The FIRST source code which describes the system is written 
next. Typically this consists of up to one or two hundred lines of 
code. The process of writing this code uses the information given 
in appendices I & II and can be undertaken piecewise from the 
outset. 
Source File Compilation and Debugging 
The source file can be compiled incrementally as it is writ-
ten, to remove syntax faults using the compilers diagnostics. 
Final compilation gives the files required for chip generation and 
simulation. 
Floor Plan Monitoring 
Throughout writing the source description, and during incre-
mental compilation, chip floor plans should be inspected, as sug-
gested in section 5.8, in order to steer partitioning. 
Preparation of Simulation Input Test Files 
The approach to the generation of simulation input files 
- 176 - 	 (CH 5) 
requires the system designer to write some code to generate input 
files with the correct format. This can be done in any high level 
language with which the designer is familiar and for which there is 
a compiler. Signal generator programs can be relatively easily 
written and are short, even for the generation of the more sophis-
ticated types of signals. 
Once the complete system has been coded, and the signal input 
files have been prepared, the system can be simulated. Firstly, 
this gives full checks on the node net specified. Secondly it 
gives full LSB level synchronisation checking. Thirdly it gives 
verification of functional behaviour. Finally simulation gives 
sets of test vectors which may be retained for later use. 
Design Modification 
As a result of simulation, the design may be modified in order 
to remove specification or synchronisation errors or system design 
inadequacies, which cause incorrect performance. Revisions of the 
source description, for any of these reasons, can be quickly com-
pleted if design follows the guidelines set down in this chapter. 
Mask Generation 
Full mask geometries are generated from the final, verified 
system chip descriptions. These are then ready for any post-design 
pre-fabrication processing which may be required by the mask-making 
house. 
- 177 - 	 (CH 5) 
System Interfaces and Packaging 
In order to progress from a working chip set to a self-
contained, working system, chip cards, supplies, clock generators, 
and external interfacing, including sample and hold circuits, ana-
log to digital converters, RAM and ROM interfaces, etc., have to be 
assembled. A variety of standard configurations can be designed 
Z or all of these. It is suggested that this is best done by using 
a minimal number of standard parts. 
Documentation 
The final stage of any design is completion of the documenta- 
tion. One of the advantages of the design process as set out here 
is that it may be self-documenting to include a design "history", 
which records revisions and evolutions. A complete set of documen-
tation for a final system would include 
M application and system description. 
A definition of the algorithm. 
A definition of the computation architecture. 
A set of hardware flow graph block diagrams. 
The FIRST source description. 
A set of chip floor plans. 
Simulation diagrams. 
Simulation input and output (for test vectors). 
Bonding diagrams. 
Description of system interfaces. 
System user documentation. 
All of this material is generated during design. In particu-
lar, items 5-8 are "unavoidable" documentation, in the sense that 
- 178 - 	 (CH 5) 
files containing this information are either required for or gen-
erated by the FIRST compiler and are in "human readable" form. If 
a flow graph graphics pre-processor software package is used then 
the flow graphs (item 4) also become a part of the unavoidable 
documentation. In this way most of the documentation becomes an 
integral part of the design activity and is not a chore to be com-
pleted afterwards. The revision and evolution "history" will con-
sist of old versions of files and cover items 5-8. If the system 
designer also generates concise verbal explanations to accompany 
these files and cover the remaining items in the list, then the 
design is almost wholly self documenting. 
Summary 
This chapter has proposed a specific methodology for system 
synthesis which may be followed to generate complex custom systems 
in short time scales. Further, it presents a framework from which 
further design automation might be researched and developed. The 
important areas In this field are: 
Data-base or file-system management and verification. (The 
approach used has been file system based.) 
Design documentation and history. 
Extension of automation to the generation of architectures from 
algorithm specification. 
Development of a technology independent circuit specification 
language and layout generators for specific technologies. 
More general solutions to the problem of generating the physical 
design, within any one technology so as to allow a greater flexi-
bility of layout topology and a greater degree of layout optimisa-
tion and accounting. 
- 179 - 	 (CH 5) 
An integrated solution to these problems may perhaps lead to a 
second-generation compiler embodying algorithmic specification as 
input. 
- 180 - 	 (CH 5) 
References 
P. B. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-
Serial Approach, Addison-Wesley (to be published). 
A. F. Murray, "On the Effectiveness of Random Pattern Self 
Test for Bit-Serial Signal Processors," IEEE Trans. Comp., (to 
be published). 
A. F. Murray, P. B. Denyer, and D. Renshaw, "Self-Testing in 
Bit-Serial Parts: High Coverage at Low Cost," Proc. IEEE 
International Test Conf., pp.  260 - 268 (Philadelphia, 
October 1983). 
- 181 - 	 (CH 6) 
CHAPTER 6 
A SURVEY OF CASE STUDIES IN SYSTEM DESIGN USING FIRST 
6.1. Introduction 
In this chapter system design capture is evaluated. In this 
way, the design environment and the design methodology described in 
the earlier chapters can be assessed in terms of the range of sys-
tems that have been successfully designed. 
For each system the principal features of interest are com-
plexity, performance and design time, as well as partitioning and 
distribution of resources (processing, memory, control etc.). The 
full data associated with each system comprises: 
a functional definition of the system 
a performance specification for the system 
a block diagram (flow graph) of the system 
a FIRST HDL file 
a set of chip mask geometry definitions 
a list of implemented system statistics 
a set of simulations 
a set of test pattern input and output vectors. 
As detailed description of each system lies outwith the scope of 
this thesis, this information is not presented in full but key 
points are summarised. Where available, references are given to 
papers which cover further details. Appendix III gives a detailed 
presentation of items 1-4 & 6 for one system. 
The system designs were carried out by a number of research-
ers. P. B. Denyer and the author acted as consultants during the 
- 182 - 	 (CH 6) 
process of initial implementation in order to introduce each 
designer to the methods and tools. The extraction of system 
statistics and the survey presented in this chapter are the sole 
work of the author, as was the automation of data extraction. 
6.2. System Statistics 
The system statistics available are extracted as a by-product 
of compilation from the FIRST IDL files and include: 
system wordlength 
system multiplex cycle periods 
primitive count 
transistor count 
distribution of resources (processing, memory, control etc.) 
pad count 
concurrency factor 
number of operations per second 
node count 
list of primitives 
list of chips 
chip sizes and aspect ratios 
MDL and IDC file sizes 
sampling frequency 
number of chips 
design time 
preliminary system description comments in the MDL file. 
The design time can be seen as a measure of the design capture and 
automation efficiency. Chip count and transistor count are crude 
measures of system complexity and partitioning. The concurrency 
factor is measured by the number of arithmetic processing 
- 183 - 	 (CH 6) 
primitives (including formatting primitives but not memory or con-
trol) and is of architectural significance, as are the relative 
proportions of processing, memory and control. Pad count can be 
used as the basis for measuring communication bandwidth. Often the 
balance between communication bandwidth and processing bandwidth 
are of architectural interest. The number of operations per second 
is measured by the product of clock frequency and concurrency fac-
tor divided by system wordlength. In this respect all arithmetic 
operations have an equal weighting. A break down of the actual 
operations and their proportionate use can be deduced from the 
lists of primitive types. System s.ordlength, given the clock 
frequency,provides a measure of word rate; multiplex cycle length 
provides information on the architectural use of processors. Node 
count provides a raw measure of the network complexity. Chip sizes 
(given in lambda units) provide an indication of practicality of 
partitioning and of implementation cost. Finally, the size of HDL 
and IDC files can be used as a measure of source data compactness. 
6.3. Pilot Studies 
Four systems were chosen as pilot studies. The pilot studies 
were completed during approximately four months at the end of 1981 
and beginning of 1982, while the design environment was being con-
structed and before it had been completed. Their purpose was to 
identify problems in using the MDL, establish that useful systems 
could be specified using it and test and develop the methodology. 
The studies started from functional and performance specifications 
and ended with MDL descriptions and compiled chip floorplans. They 
were conducted independently by P. B. Denyer and the author and 
then in collaboration. The systems chosen were a programmable 
- 184 - 	 (CH 6) 
transversal filter, an adaptive transversal filter, an adaptive 
lattice filter and a programmable filter based on cascading biqua-
dratic filter sections. 
As a second phase, guest systems-designers were invited to 
implement a system in collaboration with P. B. Denyer and the 
author for the purpose of evaluating the approach and facilities. 
This collaboration phase le--d to a reworking of the above mentioned 
systems together with the implementation of other systems. Each 
system has been fully implemented as HDL, compiled and verified by 
system simulation. Further, fine-tuning of implementation trade-
offs by redesign at an architectural level has been demonstrated. 
The rapidity with which this can be done using FIRST opens up cus-
tom silicon design for the prototype bread-boarding of systems. In 
the following sections of this chapter each system is reviewed. 
6.4. Complex to Magnitude Converters 
This system was developed as a final year student project by 
D. J. Talbot. Two versions were implemented (3% and 1% accuracy 
algorithms). Full details can be found in [1]. The system 
wordlengths used were 14 and 16 bits respectively, giving single 
chips with transistor counts of 1.25K and 2.19K. The system was 
configured to operate with no multiplexing (multiplex cycle period 
of 1 word). Maximum system word rates and sampling frequencies 
were therefore equal, being 570 KHz and 500 lUTz respectively. The 
chip sizes were 1012x1103 and 1621x1209. Proportions of arith-
metic, memory, control and formatting were 82%, 7%, 11%, 0% and 
83%, 7%, 10%, 0% respectively. The 3% system operated at 4.6 MOPS 
and the 1% system at 9.5 MOPS. Design time is estimated at I man- 
- 185 - 	 (CH 6) 
week, and the system specification files in FIRST HDL comprised 70 
and 80 lines of code. Simulation confirmed performance to better 
than 3% and 1% accuracy for each respective system. The 3% version 
was submitted for fabrication. 
6.5. Finite Impulse Response Filters 
This system was developed by S. G. Smith. Two versions were 
implemented, a real arithmetic 512 point version and a complex 
arithmetic 128 point version (2 channels). The system wordlengths 
used were both 14 bits. Chip counts were 10 (3 designs) and 18 (3 
designs) respectively, with transistor counts of 240K and 162K. 
The systems were configured to operate with multiplexing multiplex 
cycle periods of 32 and 8 words). Maximum system word rates were 
both 517 KHz and maximum sampling frequencies were 40 KHz and 71 
KHz. Proportions of arithmetic, memory, control and formatting 
were 11.6%, 68.6%, 19.2%, 0.6% and 56.7%, 31.6%, 10%, 1.7% respec-
tively. The real arithmetic system operated at 93.7 MOPS and the 
complex arithmetic system at 246.9 MOPS. Design time is estimated 
at 1 man-week (including both versions), and the system specifica-
tion files in FIRST HDL comprised 140 and 200 lines of code. It 
should be noted that each design was extensible by cascading addi-
tional chips. The system was simulated but not prepared for fabri-
cation. 
6.6. Adaptive Lattice Filter 
This system was developed by M. J. Rutter, D. Renshaw and P. 
B. Denyer. Details can be found in [2,3]. Only one version was 
implemented. This was a custom IC re-implementation of an existing 
TTL design. The system wordlength used was 26 bits. The system 
- 186 - 	 (CH 6) 
had a total chip counts of 5 (5 designs) and a total transistor 
count of 30K. The system was configured to operate with a multi-
plex cycle of period 16 words. Maximum system word rate was 307 
KHz and maximum sampling frequency was 19.2 KHz. Chip sizes were 
2020x2086, 2174x2129, 1768x2093, 1348x1845 and 2076x1955. Propor-
tions of arithmetic, memory, control and formatting were 43%, 29%, 
13%, 15%. Operating performance was 10.2 MOPS. Design time was 1 
man-month and the system specification files in FIRST HDL comprised 
500 lines of code. Algorithmically, this was the most complicated 
system implemented. Simulations were completed but mask geometries 
were not prepared for fabrication. 
6.7. Echo Canceller 
This system was developed by S. G. Smith, D. Renshaw and P. B. 
Denyer. Details can be found in Appendix III. Four versions were 
implemented. The system wordlength used was 18 bits. The system 
had a total chip count of 18 (3 designs) and a total transistor 
count of 252K. The system was configured to operate with a multi-
plex cycle of period 39 words. Maximum system word rate was 444 
KHz and maximum sampling frequency was 11.4 KHz. Chip sizes were 
2167x1561, 2111x1867 and 1635x1103. Proportions of arithmetic, 
memory, control and formatting were 13%, 84%, 1%, 2%. Operating 
performance was between 54 MOPS and 59 MOPS. Design time was 1 
man-month and the system specification files in FIRST HDL comprised 
between 200 and 250 lines of code. Algorithmically, this system 
required the use of idling cycles. Simulations were completed but 
mask geometries were not prepared for fabrication. 
- 187 - 	 (CH 6) 
6.8. Wave Digital Filters 
This system was developed by N. Petrie, with assistance from 
D. Renshaw and S. G. Smith. It implements wave digital filter ele-
ments in the first version as separate parallel and serial three 
ports and in the second as a single, general purpose component, the 
universal adaptor, which can be configured to serve most functions 
required in building wave digital filters. Details can be found in 
[4,5,6]. Two versions were implemented. The system wordlengths 
used were 20 and 28 bits. The first system requjered two chip 
designs one each for a serial and parallel three port adaptor. The 
second system was a single chip system. In both a control chip was 
used only for simulation purposes. Transistor counts were 9K and 
7.5K respectively. The system was configured to operate with a 
multiplex cycle of period 2 words). Maximum system word rates were 
400 KHz & 285 KHz and maximum sampling frequencies were 200 KHz & 
142 KHz. Chip sizes were 1880x1341 and 2629x1596. Proportions of 
arithmetic, memory, control and formatting were 82%, 13%, 5%, 0% 
and 86%, 3%, 12%, 0%. Operating performance was 13.2 MOPS and 13.7 
MOPS. Design time was 1 man-week and the system specification 
files in FIRST I-IDL comprised 180 and 100 lines of code. Simula-
tions were completed but mask geometries were not prepared for 
fabrication. 
6.9. Systolic Discrete Fourier Transform Engine 
This system was developed by G. H. Allen, with assistance from 
D. Renshaw and S. G. Smith. It implements two versions of a sys-
tolic real arithmetic discrete Fourier transform machine. Details 
can be found in [7].  Two versions were implemented, one based on a 
- 188 - 	 (CH 6) 
four multiplier butterfly, and the other on a reduced three multi-
plier structure. System wordlengths were 24 and 28 bits. Both 
systems required 34 chips (3 designs), to implement a 32 point 
transform, unmultiplexed, or the same number of chips multiplexed 
to implement 32 stages in order to evaluate a 1024 point transform. 
Transistor counts were 151K and 140K respectively. Maximum system 
word rates were 333 KHz & 285 KHz and the time required to compute 
one transform was 0.96 milliseconds & 1.12 milliseconds for the 
unmultiplexed systems. Chip sizes were 537x1185, 963x1280 
2125x2015 and 537x1185, 963x1340, 2528x1580. Proportions of 
arithmetic, memory, control and formatting were 82%, 14%, 4%, 0% 
and 77%, 17%, 67., 0%. Operating performance was 88.0 MOPS and 112 
MOPS. Design time was 1 man-week and the system specification 
files in FIRST FIDL comprised 140 and 160 lines of code. Simula-
tions were completed and mask geometries were prepared for fabrica-
tion for the four multiplier version. A plot is shown in Figure 
6.1 
6.10. Lossless Discrete Integrator Recursive Ladder Filters 
This system was developed by L. E. Turner, with assistance 
from D. Renshaw and S. G. Smith. It implements 3-rd and 5-th order 
ladder filter sections based on implementation using the Lossless 
Discrete Integrator. Details can be found in [8].  The system 
wordlength was 22 bits. Each was a single chip filter system. 
Transistor counts were 4K and 7.5K respectively. The system was 
configured to operate with a multiplex cycle of period 2 words). 
Maximum system word rates were 363 KHz and maximum sampling fre-
quencies were 181 KHz. Chip sizes were 2881x1425 and 3105x1484. 
Proportions of arithmetic, memory, control and formatting were 
cJ 
bC 






I 	 WL 
:- 	 Ii 
	
if 	 1 
- 	
1 - ' 
1 
idW 	 'Ii 
I 
1. - 	 .a'Ni41 	II 	II i 	 _ijM 	
•qq 	4P4- 







ril  j 
i 	 Ill 




- 	 I 
4 	lJI 	IiIj 
i 	 Li IIIII4II 
- 	 -- 	 - 	- - - 
- 	 --- 	
--i I 	
:-MIc'bc---- 	 - 	 ------I 
!' 
011 
- 190 - 	 (CH 6) 
72.5%, 7%, 20.5%, 0% for both. Operating performance was 7.6 MOPS 
and 15.3 MOPS. Design time was 1 man-month and the system specifi-
cation files in FIRST HDL each comprised 160 lines of code. Simu-
lations were completed and mask geometries were prepared for fabri-
cation of the 3-rd order filter section. A plot is shown in Figure 
6.2. During this project a preprocessor was written, which accepts 
arbitrary filter specification from order 2 to order 10 and gen-
erated the required FIRST HDL. This demonstrated that special pur-
pose very high level preprocessors are practical for special appli-
cations. 
6.11. Fourier Transform Engines 
These systems were developed by S. G. Smith, as a practical 
learning exercise for studying the FFT and DFT. Four versions were 
implemented. These were a 16-point, parallel, constant-geometry, 
decimation in time, radix 2 FFT system, using complex arithmetic; a 
64-point, pipelined, Cooley-Tukey, decimation in frequency, radix 4 
FFT system, using complex arithmetic; a 3-point radix 3 DFT system 
based on [9]; and a 6-point radix 6 DFT system, based on [10], 
using a complex number arithmetic based on the cube roots of unity; 
The system wrdlengths used were 24, 20 and 20 bits. The parallel 
system was a practical design exercise, the others were simulation 
exercises, where partitioning into sensible chip sizes was ignored. 
Transistor counts were 160K, 87K, 15K and 35K, respectively. For 
the parallel engine the word rate was 5.3 MHz and the time for one 
transform was 0.1 microseconds. For the pipelined engine the word 
rate was 1.6 MHz and the time for one complete transform was 10 
microseconds. For the parallel system 43 chips (6 designs) were 















































































































- 	INUHIOIII - - 
- 
I  




U - 	-I 	- 	- 
ci) 
REM 
- 194 - 	 (CH 6) 
are that, if design capture and mask geometry generation are sup-
ported by rapid, cheap access to silicon fabrication using a proven 
cell library, then silicon bread-boarding and prototyping of com-
plex digital signal processing systems is a practical proposition. 
Furthermore, design and redesign at an architectural level, with 
immediate feed-back on the implementation costs and implications 
has been demonstrated. The full advantage can be taken of freedom 
from lengthy delays in obtaining such feed-back on design, together 
with the elimination of unnecessary and excessive detail at the 
transistor level, which have been features of previous methods of 
custom design. It has been demonstrated that these tools are capa-
ble of design capture for systems of design complexity, requiring 
in excess of quarter of a million transistors. The success in this 
respect promises that the prospect of the one million transistor 
system on a chip can be a practical possibility from the point of 
view of design. 
- 195 - 	 (CH 6) 
D. J. Talbot, An Integrated Complex to Magnitude Converter, 
University of Edinburgh, Department of Electrical Engineering 
(May, 1982). 	O&CT Report 	- ki' SIO 
M. J. Rutter, , University of Edinburgh, Department of Electr-
ical Engineering (Sept, 1983). Phd Thesis 
M. J. Rutter, P. M. Grant, D. Renshaw, and P. B. Denyer, 
"Design and Realisation of Adaptive Lattice Filters," Proc. 
IEEE ICASSP83, pp.  21-24 (Boston, April 1983). 
H. M. Reekie, N. Petrie, and J. Mayor, "An Automated Design 
Procedure for Frequency Selective Wave Filters," SARAGA Collo-
quium on Electornic Filters, lEE, (March 1983). 
H. M. Reekie, J. Mayor, N. Petrie, J. Mayor, and P. B. Denyer, 
"An Automated Design Procedure for Frequency Selective Wave 
Filters," ISCAS 83, (1983). 
H. M. Reekie, N. Petrie, J. Mayor, P. B. Denyer, and C. H. 
Lau, "The Design and Implementation of Digital Wave Filters 
using Universal Adaptor Structures," CRISP, lEE 	., Spe— 
cial Issue, (in preparation). 
G. H. Allen, P. B. Denyer, and D. Renshaw, "A Bit—Serial 
Linear Array DFT," Proc. IEEE ICASSP84, pp. 41A.1.1 - 
41A.1.4 (San Diego, March 1984). 
8. 	L. E. Turner, P. B. Denyer, and D. Renshaw, "A Bit Serial LDI 
Recursive Digital Filter," Proc. IEEE ICASSP84, pp. 41A.3.1 
- 196 - 	 (CH 6) 
- 41A.3.4 (San Diego, March 1984). 
E. Dubois and A. N. Venetsanopoulos, "A New Algorithm for the 
Radix-3 FFT," IEEE Trans. ASSP Vol. ASSP-26 pp.  222 - 225 
(June 1978). 
S. Prakesh and V. V. Rao, "A New Radix-6 FFT Algorithm," IEEE 
Trans. ASSP Vol. ASSP-29 pp.  939 - 941 (August 1981). 
- 197 - 	 (CH 7) 
CHAPTER 7 
HARDWARE FABRICATION AND TEST 
7.1. Introduction 
This chapter describes some hardware results of design. The 
exercise of fabricating and testing designs was organised to pro-
vide redesign information and verification for the cell library as 
well as an evaluation of the design environment which has been 
developed and set out in the previous chapters. Originally a pro-
gramme of two or three fabrication runs per year was envisaged. 
These were planned to take place during the the second and third 
years of the project. The first two or three runs were intended 
primarily as cell library test runs for redesign and verification 
purposes. Subsequent runs were seen as a vehicle for prototype 
system evaluation. Problems were encountered with the fabrication 
support substructure, however. Thus only the results of the first 
fabrication run are described here. 
L•... FIRST Hardware Assumptions 
FIRST hardware for systems is designed on three main assump-
tions. Firstly it is assumed that two phase non-overlapping clock-
ing ensures valid bit-serial data communication. Secondly it is 
assumed that primitive hardware implements the primitive functions 
that are defined in the simulator. Thirdly it is assumed that the 
arbitrary composition of functioning cells which have functioning 
interfaces gives rise to a functioning design. This third assump-
tion will be referred to as the composition principle. 
- 198 - 	 (CH 7) 
The first assumption is well established subject to the limi-
tations of clock distribution [1].  The second assumption requires 
verification. This is the objective of the primitive design test-
ing programme described in this chapter. The third assumption is 
apparently reasonable. In practice, however, it is true only sub-
ject to the elimination of unpredictable interactions at interfaces 
and to their adequate tolerancing. The assumption, however, 
governs composition and communication in FIRST compiled layout in a 
fundamental way. Thus, in cell design and testing full account 
must be taken to ensure that nothing done there will cause this 
composition principle to be violated. 
7.3. Verification of Primitives 
Both system assembly from primitives and primitive assembly 
from leaf cells is carried out on the above mentioned principle of 
composition. The design of primitives must therefore take account 
of this. Design is conducted so as to 	minimise indeterminacies 
both at interfaces and internally, across the range of processing 
outputs and operating conditions. The object of testing is to 
demonstrate that this has been achieved. The degree to which it 
can be achieved is largely a function of cost and time. In prac-
tice the best that can be done is to reduce the probability of 
failure and construct specific-case, existence proofs. It is pos-
sible that the assumption will be refuted for some cases. In such 
cases it is necessary to be able to determine the causes and 
mechanisms of failure, so that testing can yield diagnostic infor-
mation for redesign. 
It is necessary to define a test strategy. There are two main 
- 199 - 	 (CH 7) 
alternatives. Firstly each leaf cell circuit can be fabricated and 
tested for functionality. Secondly primitives can be constructed, 
by the compiler, choosing parameter values so as to create a test 
cover. 
The first strategy is confronted by several problems; each 
leaf cell has to be buffered (in order for it to drive an output) 
and therefore the interface cannot be observed directly. The leaf 
cell library contains approximately 100 elements each with, on 
average, around 3 inputs and 2 outputs (excluding the supplies); 
thus any testing will be severely limited by the consequent input 
and output bottle-necks. In addition, the compiler assembly and 
composition is not being used or tested in such strategy. 
The second strategy has the advantage of testing the compiler 
assembly and composition output. It also tests leaf-cell assem-
blies interfacing in the environment in which they are designed to 
function. The problem associated with this strategy is how to 
define an adequate and cost effective test cover. A full test of 
all primitives with all possible parameter values is prohibitive. 
A reduced set immediately gives less than 100% cover. However, a 
cost effective reduced set can be constructed by a judicious choice 
of the parameter value set for each primitive. This, if properly 
chosen, will give only very slightly reduced probability of 
undetected primitive failure. The cover chosen tests all leaf 
cells, and all unrepeated interface combinations for each primitive 
in the set. Additionally this test organisation leads to an 
affordable primitive test programme. 
- 200 - 	 (CH 7) 
L•±..• Testing 
The following modes of testing are normally undertaken in con-






In the context of the development of FIRST, however, only func-
tional testing can be addressed adequately. This is so for the 
following reasons. Design for performance is done on the basis of 
full, guard-banded, electrical parameters. These are supplied by 
the processing house. SPICE [2] models are then tailored on an 
empirical basis to match processing. The processing offered 
through SERC is not supported in this way. There is no guarantee 
to meet the provisional values given. Further, process characteri-
sation information does not include all the parameters. Thus 
"back-substitution" of actual values in place of assumed values and 
resimulation, though time-consuming, is not even possible. For 
these reasons no particular effort has been expended on trying to 
characterise performance. 
Yield testing is an important exercise in developing a good 
cell library. To include this in the test programme is, however, 
not possible because it presupposes adequate yield characterisation 
of a stable process. Without this, it is not possible to charac-
terise and separate design-yield factors from process-yield fac-
tors. In the areas of yield and performance there is no reference 
from which to calibrate. Additionally, much larger sample sizes 
- 201 - 	 (CH 7) 
would be needed for significant results. 
The problems associated with performance testing are relevant 
to parametric testing. Although parametric testing could have been 
undertaken it is argued that, against the time and effort spent, 
this would only be of value if the process were fully character-
ised. 
Reliability testing requires resources which exceed those 
available to both the research programme and the Department. For 
this reason it was not undertaken. 
Thus the objective of the test programme is to cover func-
tional testing of hardware in order to verify the primitive cell 
library as assembled by the compiler. Included in this there must 
be a component capable of deriving diagnostic information about the 
causes of failure, where faulty cells are discovered. The purpose 
of this is to allow redesign to be undertaken. Yield and perfor-
mance measurements cannot be interpreted for the reasons already 
given above. 
Test Pattern Generation 
Test pattern generation is simply accomplished as a result of 
the design-for-test methodology which has been developed. Short 
pseudo-random input sequences are generated, by a simple computer 
program. These constitute the input test vectors. They are input, 
with the test chip specification, to the simulator. The resulting 
output patterns constitute the expected test vector outputs. Thus 
test pattern generation is automated simply as part of the design 
process. 
- 202 - 	 (CH 7) 
L•... Organisation of Primitive design and test. 
It was decided that testing should be organised as follows. 
The fundamental set of primitives were designed as quickly as pos-
sible at the start of the project and were compiled into a covering 
set of test chips. It was necessary to do this to allow enough 
time for fabrication, testing and redesign cycles within the time 
scale of the project. 
Initially the design environment for constructing integrated 
circuits included only a graphics editor, geometric design rule 
checker, mask post-processors and SPICE. During the course of the 
project, a variety of switch level and timing simulators [3,4] and 
latterly a circuit extractor were added [5].  The availability of 
such tools, together with automatic translation, from the outset 
would have given a very high probability of success for first time 
cell designs. However, since a substantial part of the project had 
to run without such tools, it was reckoned that at least one 
redesign cycle would have to be tncorporated, as a necessary part 
of the process of obtaining a set of fully functioning primitives. 
Thus, at the outset, a programme to design, assemble and test 
an initial, reduced set of primitives was organised as phase 1 of 
commissioning the primitive library. Phase 2 was planned to cover 
redesign together with trial system design. In addition, a period 
of evaluation was planned for the end of each phase. The remainder 
of this chapter outlines the activities and results obtained during 
these phases of design and testing. 
- 203 - 	 (CH 7) 
7.6. Phase 1 














The primitives are grouped according to designer, the author being 
responsible for the first group and assisting in supervising stu-
dents who designed the other groups. Within groupings the primi-
tives are ordered in ascending order of complexity. 
On conclusion of leaf cell design the primitives were ncor-
porated into the compiler. This involved setting up a cell library 
and writing, testing and debugging composition routines. 
Thereafter the test cover was defined and test chips generated. 
The test chips were documented and submitted for fabrication. On 
return they were tested. 
The pads and input output buffers were incorporated into ear-
her designs and were already known to function. The control gen-
erator was designed and tested by Neil Henderson [6].  Worddelay 
was designed by H. S. C. Wallace [7].  Dshift, absolute and order 
were designed by D. J. Talbot [8].  The remaining primitives were 
- 204 - 	 (CH 7) 
designed by the author who also wrote, tested and debugged all the 
composition routines. Checking, correction, chip generation, sub-
mission and testing as well as redesign was also done by the 
author. 
Pre-submission checking revealed a number of layout and func-
tional faults in the leaf cell designs for dshift, absolute and 
order. These were remedied and the modified circuit logic checked 
and used in place of the original designs. Between design submis-
sion and the return of fabricated parts the first switch level 
simulator became available. An extensive set of switch level simu-
lations of the multiplier were done in connection with investiga-
tions into testability and self test[9,10]. These revealed a fault 
in the multiplier composition and relative output latencies. This 
knowledge lead to a modification of the approach to testing the 
multiplier. The known faulty configuration was tested against the 
fabricated part to check for any further undetected errors. 
L•L. Phase 1 Designs 











Further to this, a simple trial system using instances of some 
of these was also included for test. 
complex to magnitude 3 percent 
- 205 - 	 (CH 7) 
Definition of Test Cover for Each Primitive 
A test cover of a primitive is defined to be a list of 
instances of the primitive with the primitive parameters so chosen 
that all leaf cell components occur at least once. Clearly this 
does not test all possible instances of a primitive, which is not 
practically feasible, but it does ensure that all leaf cells for 
the primitive are tested for correct function, and that they com-
bine to give correct primitive function. Further, because of the 
highly constrained way in which leaf cells are assembled and com-
municate, it is argued that this organisation is justified. 
Table 7.1 lists, against each primitive, the parameterisations 
necessary to give a test cover for each. 
Primitives 
ABSOLUTE[i] 	i=6,8 














The organisation of layout was governed by the following cri-
teria. Each operator was configured to be independently testable, 
including an independent power supply. Common inputs were used to 
- 206 - 	 (CH 7) 
minimise pin count, maximise the number of primitives per chip and 
minimise the total number of chips needed. Chip sizes were 
arranged to fit in 5.08 mm standard EMF frames [11].  The test 
chip floor plan is compiler generated, in all cases. 






CHIP3 	multiplex[i] 	1 <= i <= 12 
CHIP4 add[i] 	2 <= I <= 6 
CHIP5 	add[i] 7 <= I <= 9 
bitdelay[i] 	1 <= i <= 5 
absolute[i] i = 6,8 
CHIP6 	order[i] i = 4,6,8 
scalefixed[i] i = 1,2 
CHIP7 	scalefixed[I] I = 3,4 
worddelay[i,j] (i,j) = (15,8),(16,8),(17,8),(18,8) 
CHIN 	rddelay[i,j] (i,j) = (19,8),(16,9),(18,9),(17,10) 




For reasons of yield four chips per mask were used. CHIPO - 
CHIP3 were submitted on mask-set 1, and CHIP4 - CHIP7 on mask-set 
2. CHIP8 and CHIP9 will have to be submitted together with other 
designs at a later phase of primitive verification. 
The details of chip description source files, sizes, floor 
plans, bonding diagrams and pin out details can be found in [12]. 
- 207 - 	 (CH 7) 
7.8. Phase 1 Testing 
During design the test equipment had to be requisitioned and 
commissioned so that it would be available when needed. At the 
time, the Tektronix DAS 9100 series test station [131 was assessed 
as being the most suitable within the available budget. All func-
tional testing and the limited performance and yield measurements 
were done using this equipment. Test jigs and miscellaneous items 
were also designed, constructed or earmarked for use. 
The phase 1 designs were completed and submitted in January 
1983. Devices EU260-EU263 were received in August 1983 and devices 
EU264-267 in September 1983. This over-long delay (caused by an 
upgrade-shutdown of the EMF facility) unfortunately reduced the 
turnaround from a hoped-for, two to three design cycles per year to 
just one. 
Functional testing was organised in progressive stages. Ini-
tial testing was done at 1 MHz with reduced test patterns. The 
initial test was, then, repeated in steps of increasing frequency 
until failure. If these tests were successful secondary tests 
using the full test patterns were carried out. If unsuccessful 
then reconfiguration for diagnostic debugging was attempted. 
72.• Phase 1 test results 
EU260 - EU263 
Five bonded samples of each chip design were received for 
testing. Initial testing provided the following results. The 
DAS 9100 is a trademark of the Tektronix Company. 
- 208 - 	 (CH 7) 
adder on chip EU261 failed on initial test, giving unstable outputs 
on all samples. The subtracter on chip EU262 also failed, giving 
outputs similar to those of the adder, again for all samples. All 
the multiplexers on chip EU363 gave consistent zero outputs for all 
samples, independent of test input. Of the multipliers on chips 
EU261 and EU262 two samples Out of ten were found to be function-
ing, on the initial reduced test inputs; the remainder exhibited 
obvious stuck-at faults. Because of the failure of the add and 
subtract primitives the system chip EU260 could not be expected to 
function. 
Diagnostic Tests for Design Debugging 
The second phase of testing was set up to establish the causes 
of failure and acquire and verify redesign information. The 
results were as follows. Initial inspection of the adder and 
subtracter test outputs indicated that some kind of race hazard 
might be the cause. By manually circuit extracting the layout it 
was found that the internal carry feedback had been wrongly con-
nected. The feedback was supplying the inverted carry prior to a 
phil clock control, resulting in both faulty logic and a race con-
dition. These adder and subtracter faults were traced to mistaken 
translation of circuit schematic into layout; the circuit schematic 
had been correct. 
It was seen from the analysis that both the adder and the 
subtracter could be reconfigured to test their logic and to check 
that there were no further errors which might have been overlooked. 
This was done by hard-wiring the ci control to the supply voltage 
Vdd and so suppressing the internal carry. By supplying appropri- 
- 209 - 	 (CH 7) 
ate external versions of the carry with the inputs, both adder and 
subtracter were tested and were found to give the expected outputs. 
In this test configuration the adder circuit gave a yield of 4 sam-
ples out of 5 and the subtracter 2 samples out of 5 functioning 
correctly. 
The multiplexer output was clearly indicative of a fatal stuck 
t fault. The same layout fault was found in two separate leaf 
cells. The leaf cells involved were the two alternative versions 
of the output stage and both had the ground connection of the same 
transistor source missing, resulting in the observed stuck at 
fault. Further, one of the cells had an open circuit in the ground 
supply. Since the stuck at fault occurred in a circuit element 
which was in series with the output, it was fatal and no further 
testing or observation was possible. No performance or yield meas-
urements were possible either. Again, as in the case of the adder 
and subtracter, the faults were traced to mistaken translation of 
the circuit schematic into layout. The circuit schematic had been 
correct. 
EU264 - EU267 
Five bonded samples of each of the designs EU264 - EU267 were 
received. On testing it was found that all outputs were per-
manently low. Further, it was found that none of the chips dissi-
pated any appreciable power. Initially postulated causes for this 
behaviour were: power to ground supply shorts, high depletion 
thresholds, pin mismatch on supply pad bonding or missing overglaze 
contacts. Visual inspection under the microscope failed to reveal 
any signs of short-caused burn-outs and the bonding also appeared 
- 210 - 	 (CH 7) 
to be correct. Further, process measurements for the depletion 
thresholds were within the expected limits. In a previous case 
[141 where IC output observation was inadequate for fault diag-
nosis, fault mechanisms had been found by using the technique of 
voltage contrast scanning electron microscopy (SEM) [15,16]. 
Experiments were set up to examine each chip with voltage contrast 
under the SEM, at low magnification. It became immediately 
apparent that the cause of failure was discontinuity in the supply 
lines. This occurred in different places on the different samples 
indicating a processing and not a design problem. Figure 7.1 shows 
a SEM voltage contrast picture illustrating such sites of discon-
tinuity; Figure 7.1(a) has 0.0 Volts applied to the centre (VDD) 
pad, Figure 7.1(b) has 5.0 Volts applied to the same pad, revealing 
the lines of discontinuity. One sample was examined at high mag-
nification without voltage contrast at the site of a discontinuity 
in order to reveal the physical structure at the site. This exami-
nation was delayed till all other inspection was complete as it has 
to be done at a higher voltage than the functioning circuit will 
tolerate, and is therefore destructive. The inspection revealed 
bad step coverage, the break occurring at sites of metal crossing 
polysilicon. Figures 7.1(c) shows one such site where the break in 
the aluminium track is quite clear. Regrettably the whole batch 
was useless for functional testing because of the processing prob-
lem identified above. This meant that no redesign information 
could be obtained for the remaining cells. In particular it was 
not possible to complete the functional test of the adder or subtr-
acter, since the delay cells used for adding latency could not be 





- 212 - 	 (CH 7) 
7•12• Phase  ! Redesign 
The redesign identified as being necessary has been outlined 
in the context of testing in the sections above. For reference, it 
is summarised here. 
The add primitive had one faulty circuit element in to leaf 
cells. The redesign of these cells was circuit subsequently 
extracted and simulated. 
The subtract primitive redesign was identical to that for the 
adder. 
The multiplex primitive had one faulty circuit element in two 
of its leaf cells and also one geometrical design rule viola-
tion and an open circuit in the Vss supply. 
The multiplier had one composition routine bug and one 
geometrical design fault. 
Because of faulty processing, none of the other primitives 
could be tested, so that redesign information was not available. 
L•JJ. Implications 
The implications of the above described test results are now 
outlined. These fall into three categories: implications for the 
cell design environment, implications for the chip testing environ-
ment and implications for the FIRST compiler design method. 
7.12. Implications for Cell Design 
The experience gained in this and in previous projects rein-
forced the view that successful design depends both on having the 
- 213 - 	 (CH 7) 
right design methodology and on having the right tools environment. 
If components of either are missing then success in design is 
unlikely, unless several redesign iterations are undertaken. 
For the first part of the project neither a switch level simu-
lator nor a circuit extractor were available. Nor was there 
automatic translation between the various levels of data represen-
tation. It was foreseen that this would cause problems and steps 
were taken to try to obtain both as soon as possible. Unfor-
tunately, with the limited resources available, neither were in 
service for phase 1 design and submission. Thus it was accepted 
that phase 1 design was likely to require at least one redesign 
iteration. This turned Out to be true. For phase 2, mixed mode 
simulation was available. After submission of phase 2, a circuit 
extractor became available, but automatic translation between 
representations was still incomplete. 
On the separate issue of mask geometry generation and mask 
geometry post-processing few problems were encountered in preparing 
phase 1 submissions. Post-processing was done partially using the 
Department of Computer Science VLSI software [17] and partly using 
GAELIC [18] through the SERC facility. The only problem of poten-
tial concern was the partial failure of the merge program of the 
GAELIC post processing software. 
74... Implications for Test 
The estimates for yield gave values which, on extrapolation, 
looked very unfavourable. The chances of obtaining one functioning 
system, assuming good design and these values, on the basis of even 
ten bonded samples was predicted to be "infinitesimally" small for 
- 214 - 	 (CH 7) 
all practical purposes. This view was consistent with the results 
obtained by two other researchers [19,20]. Thus the test strategy 
was reviewed and revised. 
It was decided that, instead of accepting bonded samples on 
the basis of visual inspection, all wafers would be subjected to 
comprehensive functional probe testing. A Teledyne TAG probe test 
station was available in the department so it was decided to con-
figure the phase 2 test programme around it and the Tektronix 9100 
DAS, used during phase 1. 
LJ±.• Implications for FIRST 
None of the hardware test results had any implications to 
refute the assumptions or the overall concept of the FIRST design 
methodology and tools environment. The compiler was apparently 
problem free. There were, as anticipated, however "bugs" in the 
primitive cell library designs. 
Postscript 
Phase 2 designs have been completed and submitted for maskmak-
lug. It is estimated that the results of phase 2 will not be 
available before August or September 1984. These will be reported 
elsewhere when available. As a separate development, an automated 
verification procedure has been devised using the compiler, general 
UNIX facilities, a circuit extractor and timing simulator. 
Finally, methods for the speedy design of new primitives must be 
explored. 
- 215 - 	 (CE 7) 
C. Mead and L. Conway, Introduction to VLSI Systems, Addison-
Wesley (1980). PP.  65, 218 - 242 
A. Viadimirescu, A. R. Newton, and D. 0. Pederson, SPICE Ver-
sion 2G Users Guide, University of California, Berkeley 
(1980). 
K. P. Coplan, "User Guide to the Switch Level Simulator," 
Internal Report to SERC: "VLSI Circuits for Digital Signal 
Processing: First Progress Report" Appendix 4, University of 
Edinburgh Department of Electrical Engineering (1983). 
L. D. Smith, "A Tutorial Guide to MIXSIM-1," Internal Report, 
University of Edinburgh Department of Computer Science 
(October 1982). 
K. P. Coplan, "A User Guide to CEX," Internal Report, Univer-
sity of Edinburgh Department of Electrical Engineering (March 
1984). 
N. A. Henderson, "A Control and Timing Generator for FIRST," 
M.Sc. Project Report, University of Edinburgh Department of 
Electrical Engineering (September 1981). 
H. S. C. Wallace, "Design and Implementation of a Word-delay 
Integrated Circuit," B.Sc. Project Report, University of 
Edinburgh Department of Electrical Engineering (May 1982). 
D. J. Talbot, "An Integrated Complex to Magnitude Converter," 
B.Sc. Project Report, University of Edinburgh Department of 
- 216 - 	 (CH 7) 
Electrical Engineering (May 1982). -1S1' 3I0 
A. F. Murray, "On the Effectiveness of Random Pattern Self 
Test for Bit-Serial Signal Processors," IEEE Trans. Comp., (to 
be published). 
A. F. Murray, P. B. Denyer, and D. Renshaw, "Self-Testing in 
Bit-Serial VLSI Parts: High Coverage at Low Cost," IEEE Inter-
national Test Conference - Cherry Hill, pp.  260 - 268 
(October 1983). 
A. M. Gundlach and R. Holwill, Edinburgh Micro fabrication 
Facility Design Rules: N-Channel Silicon Gate Process 2 (Revi-
sion B). March 1981. 
D. Renshaw, "FIRST Primitive Verification Phase 1.," Internal 
Report to SERC: "VLSI Circuits for Digital Signal Processing: 
First Progress Report" Appendix 3., University of Edinburgh 
Department of Electrical Engineering (1983). 
DAS 9100 Series Operator's Manual, Tektronix Inc. (November 
1982). 
D. Renshaw, "A Digital Signal Averaging L.S.I.C.," M.Sc. Pro-
ject Report, University of Edinburgh Department of Electrical 
Engineering (September 1980). 
E. Menzel and E. Kubalek, "Fundamentals of Electron Beam Test-
ing of Integerated Circuits," Scanning Vol. 5 pp. 103 - 122 
(1983). 
- 217 - 	 (CH 7) 
A. R. Dinnis, "Colour Display of Voltage Contrast in the SEM," 
Scanning Vol. 3 pp.  172 - 176 (1980). 
J. G. Hughes, "VLSI Design Tools," Internal Report, Univer-
sity of Edinburgh Department of Computer Science (1981). 
GAELIC User Manual, Compeda (1980). 
W. S. Blackley, M. A. Jack, and J. R. Jordan, "Built-In Test 
and Self-Repair Mechanisms in a Digital Correlator Integrated 
Circuit," NATO AGARD 47-th Symposium of the Avionics Panel on 
Design for Tactical Avionics Maintainability, pp. 289 - 294 
(1984). 
20. 1. R. MacTaggart, Private communication 1983. 
- 218 - 	 (CH 8) 
CHAPTER 
CONCLUSION 
The problems of implementing digital signal processing systems 
using custom integrated circuit technology have been presented. 
For VLSI, the traditional methods fail with respect to complexity 
management, cost, design time, debugging, and redesign cycle 
length. Further, system design in such an environment tends to be 
constrained in prototyping by TTL and standard part breadboarding. 
This results in mapping such prototype systems into silicon. The 
consequences are gross architectural inefficiency in silicon 
design. Thus a need for system design prototyping directly into 
silicon has been identified. A mechanism for achieving this must 
allow architectural and partitioning flexibility at the system 
level. At the same time it must lead to efficient silicon design. 
Further, the complex details of the silicon circuit design must be 
hidden from the system designer. The development of an environ-
ment which supports these demands has been the objective of this 
research. 
A review of recent developments in design methodology and CAD 
has been presented. From this, important techniques have been 
identified for application in solving the problems identified 
above. A suitable class of signal processing architectures has 
been identified and related to the successful results of other 
researchers using this architecture to implement systems. 
A prototype design environment fulfilling the requirements 
identified above has been developed. This design environment 
- 219 - 	 (CH 8) 
features: 
a constrained but appropriate architecture 
a method for mapping algorithms to architectures 
a high level language for system design capture 
a compiler 
bound function, behaviour & physical design 
automation using modern software architectures 
a minimal cell library 
parameterised, procedural generation of masks 
simple but efficient floor-planning and wiring 
in built design-for-test and ATPG 
fast simulation with bit-level accuracy 
This design environment has been evaluated with respect to 
system-design capture, evaluation and simulation. The significant 
advantages of the approach are thus demonstrated. It is argued 
that the simplification and systematisation of design using these 
methods contribute towards improving dissemination and communica-
tion of design skills. Also hardware, for cell library verifica-
tion, has been fabricated and tested. 
Finally the problems encountered with this approach have been 
identified. One class is associated with the design and verifica-
tion of a primitive cell library, especially in the context of re-
implementing in other technologies. The other class of problems 
concern access to cheap, fast-turn-around silicon. 
The limitations of this design environment are the constrained 
architecture it supports, and the technology specific nature of the 
- 220 - 	 (cli 8) 
cell library together with the cost of its design and verification. 
It should be noted that, in a project of the type chosen, there is 
a balance between the relative costs and complexities of the 
hardware (cell library) and software components. 
Scope for future work lies in several areas. Firstly, the 
area of cell library design and verification requires more develop-
ment. Secondly, a rapid technique for mapping existing cells into 
other technologies is needed. Thirdly, an extension of the inter-
nal architecture to allow greater flexibility is possible. Such an 
extension would address several areas including partial internal 
parallelism, data driven control structures, greater programmabil-
ity and scope for reconfiguration. Fourthly, a more general con-
trol structure should be implemented; this would provide both for 
architectural extensions and more flexible multiplexing. 
APPENDIX I : FIRST HARDWARE DESCRIPTION LANGUAGE 
ABSTRACT 
This appendix describes the FIRST hardware description 
language (MDL). The MDL specification was jointly 
worked out by P. B Denyer, D. Renshaw and N. Bergmann, 
in consultation with I. Buchanan and J. P. Gray. The 
contents of this appendix were written by D. Renshaw; it 
appears also as a part of an appendix to "VLSI Circuits 
for Digital Signal Processing: First Progress Report", 
Integrated Systems Group, Department of Electrical 
Engineering, University of Edinburgh. 
- 1.2 - 
1. The FIRST language and its compiler. 
1.1. Informal introduction. 
The FIRST language allows a system designer to specify system 
within its scope concisely and completely, as a hierarchy of inter-







Primitives are the predefined functional elements of FIRST. 
(cp manual section 6. part III) Functional elements of all other 
types are user defined. 
Definition of functional elements (apart from primitives) can 
follow top down decomposition into assemblies of simpler elements, 
or bottom up synthesis, or a mixture of both. However, in the for-
mal specification all functional elements (apart from primitives) 
must be defined (or declared) before being used (or called) in the 
language description (or program). The definition of a functional 
element consists of a type definition statement, followed by the 
definition body, followed by an end statement. The definition body 
consists of instantiations of previously defined elements (possibly 
including primitives). Connections between elements are specified 
implicitly by using the same node name for all inputs and the one 
output on each node. 
Instantiation of an element is achieved by calling the ele-
ment. This call statement must include values for each of the 
- 1.3 - 
element attributes (viz.: the input & output connection names, 
parameter values, etc.). 
In general, each type of functional element may take some or 







where the notation { } means optional occurrence. 
The identifier is the predefined primitive, or a user chosen 
name which is used for defining and calling the element. Parameter-
isation can be e.g.: the number of words or bits of storage in a 
FIFO memory, the number of bits quantisation for a multiplier coef-
ficient, or the maximum Count value for a control cycle, etc.. 
Inputs and outputs are identified by a user chosen label which is 
declared as the name of the node to which they connect. 
Note that chips are not allowed to be parameterised. 
The structure described above is embedded in the language syn-
tax as follows 
For a declaration 
Type Identifier [Parameter list] (Control list) Signal list 
For an instantiation 
Identifier [Parameter list] (Control list) Signal list 
A FIRST program (or source file) can be thought of as consist-
ing of a sequence of statements which are either declarations or 
- 1.4 - 
instantiations of functional elements of one of the five types. 
The form of the language is closely related to the structural 
design languages developed by Gray and Buchanan (see Reference [79] 
in the reference list for chapter 2). 
The function of the FIRST language compiler is to reduce the 
hierarchy of system, subsystems, chips and operators into a list of 
invocations of primitives grouped into chips, and to associate an 
unique identifier code with each primitive, node name and control 
name. The resulting description is the FIRST intermediate form or 
I.F., for short. 
The FIRST language compiler is a single pass, four stage 
recursive descent compiler which takes a FIRST language program and 





I.F. code generation 
During each stage various checks are carried out and diagnos-
tics are issued on the occurrence of detectable errors. 
FIRST I.F. code consists of a coded, expanded list of the 
functional element calls together with the associated interconnec-
tion net lists, as defined by the FIRST source program file. This 
form is defined so as to give a clean interface into physical 
design compilation and behavioural description compilation. 
There now follows a more detailed and formal description of 
the FIRST source language together with an enumeration of its 
features, the formal definitions. 








Co mmen t s 
Declarations 	Functional elements 
Constants 
Signals & Controls 
Scope 
Calls 	 Functional elements 
Repeated calls 
Assignment replacement 
Control generator instantiation 
Special form of Pad instantiation 
Pad order statement 
Statements 
Program structure 
Checks, diagnostics & warnings 
Formal definition of language 
1.2. THE FIRST LANGUAGE 
A FIRST program consist of a sequence of keywords, identifiers 
and constants together with arithmetic operators and separator 
characters. 
1.2.1. KEYWORDS 
A keyword is a sequence of upper or lower case letters (which 
are not distinguished) forming one of the reserved words. A list of 
reserved words and special characters is given in table 1. Keywords 
must not be used for user defined names. 
TABLE 1 • KEYWORDS & SPECIAL CHARACTERS 




PADORDER EVENT ) 
CONSTANT CYCLE 
END PIN 
ENDOFPROGRAM POUT -> 
CONTROLGENERATOR PCIN + 





together with all Primitive names 
(see section 6, part iii of this manual). 
wordlength 
(used by the simulator, cp. constant declaration). 
IDENTIFIERS 
An identifier consists of a string of upper or lower case 
letters (no distinction is made) and digits, of any length less 
than 256 characters, The first character must be a letter. Special 
- 1.7 - 
characters or spaces are not permitted. In the formal definitions 
<NAME> is used to denote an identifier. 
Identifiers refer to primitives, operators, chips, subsystems, 
system, control nodes, signal nodes, constants, parameters, etc.. A 
list of predefined identifiers Is given in table 2. As with key-
words, the predefined identifiers may not be used for any other 
designation. 




all Primitive names (see section, 6 part III of report). 
1.2.3. TYPES 









All primitives and the three signals/ controls VDD, GND,NC are 
pre-defined, everything else is user defined. 
Note that only one identifier of type "system" is permitted to 
occur in any one program. Note also that VDD , GND , NC are 
the only nodes that can be both signal and control. All other 
nodes can be either signal or (exclusive or) control. 
- 1.8 - 
1.2.4. CONSTANTS 
Constants appear as parameters which characterise functional 
elements (excepting chips and the system), and as operands for 
arithmetic expressions. Constants are integer valued. 
1. ARITHMETIC 
FIRST uses only integer arithmetic and supports only the fol-
lowing integer arithmetic operations 
( ) 	brackets 
- 	unary minus 
* , / 	integer multiply, divide 
+ , - integer add, subtract 
Operators are listed in order of precedence, highest first, and 
operators of equal precedence are on the same line. Brackets are 
used to alter the order of evaluation. Brackets should be used to 
ensure correct evaluation (as evaluation of operations of equal 
precedence is right to left in this version, but may change in 
later versions). 
SEPARATORS 
Spaces, commas and newlines act as separators / terminators 
variously, as required. 
1.2.7. EXPRESSIONS 
An arithmetic expression consists of a sequence of operands 
separated by operators, possibly preceded by an unary operator. 
Operands may be integer constants, previously declared constant 
identifiers or expressions (sub-expression). Note that a single 
identifier or constant is counted as being an expression. Brackets 
may be used to alter the inherent order of evaluation of an expres-
sion, by changing precedence, and should be used to ensure correct 
evaluations. All operations are integer arithmetic operations. 
Thus care must be taken in writing expressions so that the intended 
evaluation is actually carried out. E.g.: 
Let B be any even integer then: 
3/2*B = 3/(2*B) 
is not the same as 
3*B/2 = 3*(B/2) 
is not the same as 
B*3/2 = B*(3/2) 
1.2.8. COMMENTS 
Lines which start with the exclamation mark character ! are 
comment lines. The compiler ignores all characters of text up to 
and including the next statement separator (viz.: newline). 
1.2.9. DECLARATIONS 
All identifiers (except those which are pre defined, or are 
dummy identifiers) must be declared before being called. Identif-
iers may be declared within an operator , chip , subsystem or sys-
tem declaration body, or they may be declared outside such. In the 
former case they are LOCAL identifiers, in the latter case they are 
GLOBAL identifiers. 
1.2.9.1. FUNCTIONAL ELEMENT DECLARATION 
A functional element (operator, chip, subsystem, system) is 
declared by a statement of the form 
- 1.10 - 
<type><name>{<param list>}{<ctrl list>}{<sig list>} 
where the abbreviations are 
par am - dummy parameter 
ctrl - dummy control 
Sig - dummy signal 
and { } denote optional occurrence. 
followed by a definition body, followed by an end statement. 
The dummy control and signal lists are lists of dummy identif-
iers for signal and control nodes, which are replaced by defined 
control and signal identifiers, whenever the element is called. 
Dummy signal and control identifiers are treated as if they were 
local signal and control identifiers by the definition body. Simi-
larly the dummy parameter list is a list of dummy parameter iden-
tifiers (not expressions) which act as local constants within the 
body of the functional element definition but are replaced on any 
functional element call. 
The body of the definition consists of declarations of local 
type (which may be declarations of constants, signals, controls, or 
pads but may not be declarations of functional elements) and calls 
of predefined functional elements. 
Whenever the functional element being declared is later 
called, each of the calls within the declaration is itself called 
in turn, and the dummy parameters and nodes are replaced by the 
corresponding actual parameters and nodes included in the call 
statement. 
Functional element definitions or declarations MAY NOT BE 
NESTED OR RECURSIVE, but can call any previously defined functional 
elements of any allowed type.(see note 1.). 
The end statement is of the form 
END 
A declaration of type chip or of type system is also impli-
citly an instantiation of that chip or system. A functional element 
declaration is always global (see scope). 
Operators may call only primitives and user-defined operators, 
chips may call only primitives and user-defined operators, subsys-
tems may call only chips or subsystems, and the system may call 
only subsystems and chips. Any other structure of calls is illegal. 
TABLE OF PERMITTED CALLS 
Functional element permitted calls 
----------------------------------- 
Primitives 
Operators 	primitives, operators 
Chips 	 primitives, operators 
Subsystems 	 chips, subsystems 
System 	 chips, subsystems 
Chips must include pad calls and a pad order statement in 
their declaration body. 
1.2.9.2. CONSTANT IDENTIFIER DECLARATION 
Functional elements which admit parameterisation by integer 
valued parameters (e.g.: number of bits delay etc.) can be called 
with different values in each call. In order to allow more flexible 
use of these parameters, identifiers may be used to name integer 
constants for subsequent use as parameters. Upon declaration, such 
- 1.12 - 
constant identifiers are given an associated integer constant value 
and thereafter the use of the identifier is equivalent to the use 
of this integer constant. These identifiers are NOT integer vari-
ables and are assigned ONCE only in the program, at the point of 
declaration. They cannot be reassigned within the scope of their 
lifetime. The use of this mechanism greatly simplifies consistent 
parameter changes (e.g.: system ordlength, coefficient lengths 
etc.) in a system specification, during the functional- behavioural 
evaluation part of system development. 
The declaration of a constant takes the form 
CONSTANT <assignment list> 
where the assignment list is a list of assignments, each assignment 
separated by commas and of the form 
<name> = <expression> 
Integer constants may be either a string of numeric characters 
......9 which represent integers in base ten, up to a max-
imum numerical value of 2147523647. Constant declarations may be 
local or global. 
There is one constant identifier which has a special signifi-
cance for the simulator. It is the name 
wordlength 
which is taken as the fixed internal system wordlength. It is sug-
gested that, at the top of the source program, there is a constant 
declaration of wordlength for the use of the simulator. If this 
declaration is missing then the simulator will request it. If, how-
ever the declaration of wordlength has any other meaning then simu- 
- 1.13 - 
lation will be faulty. 
1.2.9.3. SIGNAL & CONTROL NODE DECLARATIONS 
Signal node identifiers are declared by a statement of the 
form 
SIGNAL <name list> 
where <name list> is one or more identifiers, separated by commas. 
The identifiers listed are declared to be the names of signal 
nodes. Signal declarations may only be local (see scope). 
Control node identifiers are declared by a statement of the 
form 
CONTROL <name list> 
Control declarations may only be local (see scope). 
1.2.10. SCOPE 
The SCOPE of an identifier is the region of the program in 
which the identifier is defined and can be used. The scope of a 
LOCAL identifier is from immediately after its declaration to the 
end of the functional element in which it is declared. The scope 
of a GLOBAL identifier is from immediately after the point of 
declaration until the end of the program. 
Signals and controls are local. Constants may be global or 
local. Primitives, operators, chips, subsystems, and the system are 
global. If a local identifier is declared with the same name as a 
previously declared global identifier (viz.: redeclared) then the 
global identifier cannot be accessed within the scope of the local 
identifier (even if the two identifiers are of different types). A 
- 1.14 - 
name always refers to its most local incarnation. All identifiers 
of the same level (local or global) must be unique. 
The name of a functional element is considered to be declared 
after its associated end statement. Names of primitives are con-
sidered to be declared from the start of the program as are the 
other predefined names. Signal, control and dummy parameter argu-
ments are considered to be local identifiers of the appropriate 
types, with scope from the point of declaration until the end of 
the functional element with which they are associated. 
1.2.11. INSTANTIATION 
1.2.11.1. FUNCTIONAL ELEMENTS 
Functional elements are called or invoked by a statement of 
the form 
<name> {<param list>} {<ctrl list>} {<sig list>} 
where : { } denote optional occurrence and : <name> is a 
primitive, a (previously declared) operator, chip, or subsys-
tem identifier. 
{<param list>} is an optional list of arithmetic expressions, 
separated by commas and enclosed in square brackets. The 
expressions, when evaluated, yield integer values which con-
trol the parameterisable attributes of the functional element. 
Chips and the system cannot be parameterised. 
{<ctrl list>} is an optional set of control node identifiers 
of the form 
- 1.15 - 
(input list> -> <output list>)' 
where <input list> & <output list> are lists of signal 
identifiers separated by commas. Note 
{<sig list>} is an optional set of signal node identifiers of 
the form 
<input list> -> <output list> 
where <input list> & <output list> are lists of signal 
identifiers separated by commas. Again, 	-> may be omitted 
if <output list> is empty. 
The number of parameters, controls and signals associated with 
any functional element are fixed (except in the special case of the 
control generator - see section on control generator). Calls of a 
given element must have expressions and names which match, in 
number and type, the declaration. Although the lists are described 
as optional, using the notation { } , this refers only to the 
fact that e.g.: if an element has no parameters then the parameter 
list will be absent in its syntax. For any given element the number 
of parameters, signals, and controls is fixed. 
Chips and the system are called implicitly by declaration. 
This convention is used to output I.F. code for the physical design 
subsystem and for simulation using the behavioural description com-
piler. 
1.2.11.2. REPEATED CALLS 
Any element, most notably chips, can be called with a fixed 
repetition number. The form of the repeat call is to add to the end 
of the call itself a statement of the following form 
- 1.16 - 
"TIMES" <const> "WITH" <ctrl list> <sig list> 
where the controls and signals in <ctrl list> and <sig list> are 
cascaded with the connection of outputs to inputs as defined by 
these lists, (cascaded connection). Any inputs (or outputs) which 
do not occur in these lists ,but do occur in the functional element 
call, are all connected to the same node (global connection). The 
assignment replacement statement must be used in order to obtain 
distinct inputs or outputs (see next section) from globally con-
nected inputs or outputs. Note that outputs should never be left 
globally connected and must always be replaced subsequently to get 
distinct outputs. Only inputs can be permitted global connection. 
The effect of this call is the same as having <coust> separate 
calls of the functional element, thus giving a compact notation for 
repetitive identical structures. This is particularly useful for 
the specification of filters which may have, e.g., 32 or more 
identical chips cascaded, cp. appendix 1. 
1.2.12. ASSIGNMENT REPLACEMENT 
The assignment replacement statement is a special statement of 
the form 
<name> = <list> 
where <name> is a signal, or control identifier, and <list> is 
a list of identifiers of the correct type separated by commas. 
The effect of this statement is to replace each previous 
occurrence of <name> in the present functional element definition 
(within the scope of <name>) by a name in <list> • The names in 
list are taken in order and replace each occurrence of <name> , in 
- 1.17 
order, starting with the first and continuing the replacement down 
to the assignment replacement statement itself,or until the members 
in <list> are exhausted. The types must be the same. If the 
number of entries in <list> is less than the number of previous 
occurrences, then the remaining occurrences are left and are not 
replaced. If the number of entries in <list> is greater than the 
number of occurrences then a fault diagnostic is issued. 
Usually the identifier <name> is a globally connected node 
in the repeated element structure. There is no access to referenc-
ing cascaded connections except into the first element and out of 
the last element of the repetition. Examples in appendix 1 show 
how various connections of repeated structure can be specified 
using this syntactical construct. 
The purpose of the replacement statement is to allow non-
global,non-cascaded inputs or outputs to be specified in a repeat 
statement. The replacement being of cascaded connections subsequent 
to a repeat statement. 
The identifiers used for non-global, non-cascaded connection 
nodes of repeated elements should be unique, in order to avoid any 
unexpected side effects. 
Note : It is recommended that the assignment replacement con-
struct be used together with a foregoing repeat statement as 
the only statements within an operator or subsystem declara-
tion, viz.: assignment replacement should always be embedded 
inside an operator or subsystem. The <name> which is replaced 
by the names in <list> can then be an identifier which is 
local to the operator or subsystem. This requirement is not 
- 1.18 - 
mandatory but it is strongly advised that it be observed in 
system specification in the interest of correct synthesis. 
Care should be exercised if assignment replacement is used in 
any other way. This restriction does not apply if the repeat 
statement does not have an associated assignment replacement 
statement. 
1.2.13. CONTROL GENERATOR INSTANTIATION 
The control generator construct is both a declaration and an 
instantiation. This is because the parameterisation of the control 
generator has a special, variable form. The control generator con-
struct defines the system control. For simulation it is mandatory 
to have a control generator in the system. The control generator 
may be included, together with the system on a single chip, If this 
is feasible, or may be on a chip on its own or with other com-
ponents for a multiple chip system. 
Cycle operators indicate levels of control. 
An event operator may follow any cycle operator. 
Each control block has one output, except the first which also 
has 'inhibit output. 
An event operator has one input and one output. 
In the input/output lists at the head of the controlgenerator 
declaration, all inputs and outputs are in the order of the 
blocks below it. 
- 1.19 - 
The compiler gives a single block (called CG ) in the inter- 
mediate form file, hence CG is a reserved primitive name. 
A control generator must be used to generate cLSB. The first 
cycle statement of the control generator will have cycle 
length equal to system word length. This will be the value 
which should be declared as wordlength for the simulator. 
The form for the control generator statement is as follows. 






Where inl,in2,..,inx are the event request inputs. 
Where outl,out2,..,outy are the cycle and event outputs, in 
order. 
Where x is the number of event requests 
Where y is the total of the number of cycles plus the number 
of event requests. 
Where period is replaced by an integer or constant value. 
1.2.14. SPECIAL FORM OF PAD INSTANTIATION 
The normal syntax for instantiating a pad is a single primi-
tive call of one of the following 
- 1.20 - 
Padin(externalctrl -> internalctrl) 
Padout(internalctrl -> externalctrl) 
Padin externalsigi -> internalsigi 
Padout internalsigi -> externalsigi 
This can necessitate many pad statements in a chip definition. 
A shortened form, with multiple controls and/or signals is permit-
ted, so that only one padin statement and one padout statement need 
be used. If a statement of this shortened form is used, then it is 
expanded out by the compiler to the equivalent multiple calls of 
padin and padout. 
The syntax is as follows. 
Padin(<ctrl list> -> <ctrl list>) <sigl list> -> <sigl list> 
Padout(<ctrl list> -> <ctrl list>) <sigl list> -> <sigl list> 
Where the lists are lists of node identifiers of arbitrary 
length, and where <ctrl listi> and <ctrl list2> must have the 
same number of control node identifiers, and <sigl listl> and 
<sigl list2> must also have the same number of elements. 
1.2.15. PAD ORDER STATEMENT 
In order to facilitate bonding and printed circuit board chip 
interconnection, it is possible to specify the pad order required. 
Since there is no default pad ordering it is mandatory to have a 
pad order statement within each chip definition. There are three 
fixed positions 	VDD, GND, CLOCKS. This gives three fields within 
which the remaining pads can be placed. The physical design corn- 
- 1.21 - 
piler attempts to space the pads equidistant with respect to each 
other, within each field. 
The pad order statement has the following form. 
Padorder VDD,(xxx,...,xxx,)GND,(xxx,...,XxX,)CLOCK,(xXX,...,XXX) 
Where xxx is replaced by an external node identifier, and the 
number of xxx replacements within each field is arbitrary. 
The choice of pad order will be determined by system wiring 
requirements on chips and on the pad order consequences on 
floor plan and chip area. 
1.2.16. STATEMENTS 
Statements are generally written on one line and are either 
declarations, instantiations, end statements, comments, assignment 
replacements or pad order statements. Where necessary a statement 
may be broken over more than one line, by using a continuation 
character(-) as the last character on the line to be continued. The 
continuation character may be omitted if the character preceding it 
is a comma. 
The Continuation Character is the character - 
1.2.17. PROGRAM STRUCTURE 
A program consists of a set of statements; its structure 
requires declaration before invocation. A program should (by con-
vention) start with some comment statements giving identification 
etc., which are followed by the necessary declarations and calls in 
sequence. No nesting or recursion is permitted. Functional element 
- 1.22 - 
declarations must each have their own corresponding end statement. 
The final line of the program must be the end of program statement 
(apart from subsequent comments, which are to be ignored). 
1.2.18. CHECKS, DIAGNOSTICS AND WARNINGS 
Because human generated coding is error prone, it has been a 
high priority objective to incorporate as much checking together 
with diagnostic warning of detectable errors as can be usefully 
implemented in each of the compilers and at each stage of compila-
tion, as appropriate. The purpose of these checks and diagnostics 
is twofold 
to exclude , or at least warn of faulty constructions during 
each stage of implementation. 
to aid in system design debugging. 
Definitive verification of system function is given by the 
behavioural description compiler, but the earlier checks and warn-
ings help to eliminate most faults as early as possible, often 
before simulation. The checks carried out by each compiler are 
detailed in the appropriate section. 
The language compiler includes checks (and warns on the 
occurrence of error conditions) for the following 
- 1.23 - 
checks on names 
checks on types 
checks on declarations 
checks on parameters 
checks on signals & controls 
checks on pads 
checks on connections 
checks on statements 
checks on syntax 
There are approximately fifty kinds of check made during 
language compilation. 
Table 3 shows the formal syntax definitions and the last sec—
tion gives a worked example to illustrate each of the language con—
structs as used for system specification. 














<PARANLIST> {<PLIST>] -, 
<PLIST>=<EXPR> , <PLIST> ,<EXPR>, 






<CONSTDEFS>=<ASS IGN>, <CONSTDEFS> ,<ASSIGN>, 
<ASS IGN>=<NAME> =<EXPR>; 
<EXPR>=<TERN><ADDOP><EXPR> ,<TERN)'; 
<TERN><FACTOR><MULTOP><TERN> ,<FACTOR>; 
<FACTOR>=<ADDOP><UNS IGNED> ,<UNSIGNED>; 
<UNS IGNED>=<NANE> , <CONST>, ( <EXPR> - ) ; 
<ADDOP>+ , 
<MIJLTOP>=*, -/ ; 
OPT YPE>="OPERATOR" , 	, "SUBSYSTEM", "SYSTEM"; 
$ 
1.3. Upgrades and Enhancements 
A number of new features have been added to the language since 
this version. The underscore character has been included as a 
legal character for names. Modifications and additions have been 
made to the repeated call and assignment replacement syntax, and a 
short hand for long lists of node names has been added These 
changes are outlined below. 
1.3.1. Shorthand for long signal lists 
Frequently the designer specifies an object which is in turn 
- 1.25 - 
composed of many identical objects working concurrently on a long 
input data list, producing a long output data list. The SUBSYSTEM 
Outcolumn in our example is one such object, taking in sixteen real 
and sixteen imaginary signals, and producing sixteen magnitude out-
puts. These lists can usually be named in a repetitive manner, and 
FIRST allows their representation in shorthand using the THROUGH 
statement. A nodelist of the form alphanumber, alphanumber+1, 
alphanumber+N may be represented by aiphanumber THROUGH number+N, 
e.g.: 
sig3l, sig32, sig33, sig34, sig35, sig36 
may be represented by: 
sig3l THROUGH 36 
The first line of the declaration of Outcolumn, naming forty-
nine nodes, looks like this in shorthand: 
SUBSYSTEM OutColumn (ci) rel THROUGH 16, 
imi THROUGH 16 -> magi THROUGH 16 
1.3.2. Shorthand for repeated instantiations and linear arrays 
Modular architectures often require repeated instantiations of 
identical objects. FIRST provides a syntax for condensing such 
repeated instantiations into one statement. This is done by 
appending, to a normal instantiation, a phrase which defines the 
repetition and connection structure. The form of this phrase is as 
follows: 
TIMES constant WITH cascade 
where cascade is a phrase whose form and function is explained 
below. The full form of a repeated instantiation is: 
- 1.26 - 
name [param list] (ctrl list) siglist TIMES constant WITH cascade 
The number of repetitions is defined by the value of constant and 
the different types of connection are defined by cascade. 
Cascade can be a null string. In this case corresponding 
inputs and outputs of the repeated element are connected in comnn 
to the nodes named in the instantiation. An example of this is 
illustrated in Figure 1 and is represented syntactically as: 
OPERATOR Fl in —> out 
BITDELAY [1] in —> out TIMES 3 WITH 
END 
in 
'k 'El out 
Figure 1 
This syntax may describe a legal structure for inputs (though 
fan in may be a problem if the constant value is large) but the 
connection of more than one output to any node is not normally a 
legal construct. The mechanism for severing the corresponding out-
put or input connections from this common node is a statement of 
assignment replacement, which follows the repeated instantiation. 
The form of this is: 
- 1.27 - 
name = list 
where name is a node identifier (occurring in the repeated instan-
tiation) and list is a list of the new node identifiers to be used 
as replacements. The effect of this statement is to replace each 
occurrence of name by the next name in the list, starting from the 
first occurrence and the first name. This substitution continues 
down to the assignment replacement statement or until the list is 
exhausted. An example of this type of connection of repeated ele-
ments is illustrated in Figure 2. The syntax for this is: 
OPERATOR F2 ml, in2, 1n3 -> out 1, out2, out3 
SIGNAL in, out 
BITDELAY [1.] in -> out TIMES 3 WITH 
in = iiil, in2, in3 
out = outi, out2, out3 
END 
ml 	1n2 	1n3 
tB B B 
I 'E~ I 
outi 	out2 	out3 
Figure 2 
Thus we have a syntax for describing global and distinct con-
nections in repeated instantiations. In order to describe other 
nodes of connection for repeated elements it is necessary to use 
the cascade phrase. This allows the description of four different 
- 1.28 - 
types of nearest neighbour connection, one or more of which may be 





The phrase cascade has four components, one for specifying the con-
nections of each type. (Some or all may be empty, according to 
whether or not there are connections of the particular type). Fig-
ure 3 illustrates each of these types of connection, and the 
corresponding syntax is as follows. 
Forward cascade 
OPERATOR F3 in -> out 
BITDELAY [1] in -> Out TIMES 3 WITH - 
out -> in 
END 
Forward tapped cascade 
OPERATOR F4 in -> tapi, tap2, tap3 
SIGNAL out 
BITDELAY [1] in -> Out TIMES 3 WITH - 
out => in = tapi, tap2, tap3 
END 
Backward cascade 
OPERATOR F5 in -> out 
BITDELAY [1] in -> out TIMES 3 WITH - 
in <- out 
END 
Backward tapped cascade 
OPERATOR F6 in -> tapi, tap2, tap3 
SIGNAL out 
BITDELAY [1] in -> out TIMES 3 WITH - 
in <= out = tapi, tap2, tap3 
END 
Each of the cascade phrases takes the form of lists of nodes on 
either side of the cascade assignment symbol ( ->, =>, <-, or <= ). 
In the case of the simple forward and backward cascades the inter-
nal signal nodes are not externally named. However, a new set of 
'k) 





---- 6 bit quantization 
	
8 bit quantization 	 1 
-- 10 bit quantization 
1.2 
I 	. I. 
1. 	 1.1 1.13 	1.11 
131 •.I 	.._ 
Figure 1.2 5th order LDI magnitude response 





a-19. b'-47 c17 d-24 e-24 
m-128 
L t. .125 IT 	 Y.. 
£rJ 17• 1913 




o-20. b'50. c5 
rn-128 
.125 n 	L7 .L 17. 1933 






1.5. Multiplier Multiplexin 
An advantage of bit serial arithmetic over 
parallel arithmetic is obtained when arithmetic 
elements can be shared by using multiplexing. If 
the signal bandwidth of the final product without 
sharing arithmetic elements is greater than 
necessary then a silicon area saving can be 
realized. Since the serial multipliers are the 
largest single arithmetic element they are the 
most likely candidate for sharing. 
In the case of LDI filters, it is desirable 
that the filter operate as fast as possible to 
allow a variety of applications. However, an 
efficient LDI digital filter implementation is 
possible if one multiplier is shared between stages 
Of a second order section. 	This type of sharing 
is impractical in parallel LDI filter implementations 
because of the increase in the number of 
interconnections required. 
To explain the multiplier sharing used in 
LDI filters consider the LDI digital filter 
implementation given In figure 1.4. It every 
second stage in the signal flowgraph in figure 1.4 
is updated alternately then the same signal 
flowgraph as in figure 1.1 is implemented. This 
alternate clocking scheme has the advantage of 
simplifying the LDI filter signal flowgraph (regular 
structure) in addition to saving one multiplier for 
every two sections. 
1.6. Initialization 
The FIRST system allows an external event 
to be recognized by the LI)! filter implementation. 
This external event is used for two purposes. 	All 
state registers are cleared and the filter 
coefficients are loaded into registers before the 
filter begins operation. 
1.6.1. State Reoisters 
Since the LI)! digital filter is a recursive 
system it is necessary that the filter state 
registers be Set to zero before the filter is 
allowed to operate. The state registers in the 
FIRST implementation of the LDI digital fitter are 
implemented as bit delay elements associated with 
an adder element. To set these bit delay 
elements to zero 	multiplexers are used to open 
the feedback loop and set the value entering the 
adder to zero. Note that because of the 
alternate clocking scheme used, in some cases only 
one multiplexer is needed as the input is already 
zero during the initialization. 
1.6.2. Filter Coefficients 
The filter coefficients are loaded as serial 
unsigned binary values during the initialization 
phase. The same control signals used to set the 
state registers to zero are used to load the filter 
coefficients. Multiplexers and signal bit delays are 
used to recirculate the multiplier coefficient with 
the correct timing and synchronization. 
1.7. FIRST Simulation 
The FIRST simulator is event driven thus 
efficiently allowing both control and data signals  
to be simulated. 	This is an essential step as all 
control signals used by each arithmetic element 
must be synthesised by the device implemented. 
The LI)', filter was simulated by using a 
single large valued ( 100,000 ) input to 
approximate an impulse input. The resulting 
simulated response was transformed to the 
frequency domain using a fast fourier transform 
program. 	The results of the third and fifth order 
LDI digital filter simulations are shown in figures 
1.5 and 1.6 respectively. 
1.8. Conclusion 
The design of a recursive digital LDI 
ladder filter using the silicon compiler FIRST has 
been described. The silicon compiler currently 
supports the implementation of devices using a 5 
micron silicon gate nmos process. 	The clock rate 
is eight Mhz. With a system wordlength of 22 
bits a sample rate of approximately 180 Khz is 
possible. 
1.9. References 
[1] Bruton, 	L. 	T1_ow 	sensitivity 	digital ladder 
filters, IEEE 	Trans. 	1975, CAS-22, 	pp. 
168-176 
(2) Denyer, 	P.3., 	Renshaw, 	D. 	and 	Bergmann, 
N., 	A silicon compiler for VLSI signal 
processors, 	proc. 	European 	Solid State 
Circuits Conference, Brussels, 1932. 
[3) Denyer, P.B. 	and 	D. 	Renshaw,Case studies 
in 	VLSI 	signal processing 	using 	a silicon 
compiler IEEE 	ICASSP 	83, 1983, Boston, 
pp. 939.942 
Liu, E. S. K.,L. E. Turner, and L. T. 
Bruton, 'Exact Synthesis of LDI and LDD 
ladder fitters Accepted for publication by 
the IEEE Trans. on Circuits and Systems. 
H. Orchard, "Inductorless filters',Etectronjcs 
Letters, vol. 2, pp. 224-225, June 1966. 
Vaughan-Pope,D.A. 	and 	L.T. 	Bruton, 
Transfer function synthesis using generalized 
doubly terminated two-pair networkz,IEEE 
Tran. on Circuits and Systems, vol. CAS- 
24, no. 2, 	. 79-88, -February 1977. 
(71 	Bruton, L.T. and R.H. Khan, - Multirate 
multiplierless 	discrete 	ladder 	filters', 
Electronics Letters, vol. 14, pp. 814-815, 
1978. 
[8] 	Lyon, 	R.F., 	A 	bit-serial 	VLSI 
Architectural Methodology, 	for signal 
processing, VLSI 81, University of 
Edinburgh, August 1981, pp. 131-140. 
P. B. Denyer and D. Renshaw are with 
the Integrated Systems Group, Department 
of Electrical Engineering, University of 
Edinburgh, The Kings Buildings, Edinburgh, 
UK. 
L.E. Turner is with the Department of 
Electrical Engineering, University of Calgary, 
Calgary, Alberta, Canada. 
Removing of the unrealizable advance 
operators in the filter terminations does not 
eliminate the desirable sensitivity characteristics of 
this filter. However, the digital fitter can no 
longer be designed exactly by direct transformation 
from an analog prototype filter. ( If the values of 
the inductor and capacitor elements in the analog 
fitter are used to define the multiplier coefficients 
in the digital filter errors will occur in the 
transfer function realized.) Exact methods of 
synthesis are available for the LDI lowpass ladder 
filtert6) including a complete design program for 
LDI filters(4]. In addition, optimization methods 
have been used to design LDI filters(71. 
1.3. Finite Precision Effects 
In any practical recursive digital filter 
implementation, signal Levels are represented by 
finite, quantized binary values. The quantized 
binary values are also used to represent the values 
of the filter multiplier coefficients. The discrete 
approximation of a continuous integrator is actually 
implemented as a digital integrator. Any digital 
implementation is limited in accuracy by the finite 
quantized nature of the digital signals on which it 
operates. The binary filter multiplier coefficient 
value is a quantized approximation of the ideal 
filter 	multiplier 	coefficient 	value. 	This 
quantization causes a deviation in the actual filter 
transfer function as compared to the ideal filter 
transfer function. 
Nonliriearities associated with adder overflow 
and signal level quantization occur in any 
implementation of a recursive digital filter. 	These 
nonlinear operations often lead to undesirable 
autonomous oscillations or limit cycles. 
13.1. Coefficient Quantization 
The effects of coefficient quantization in 
an LDI digital filter implementation are less 
severe than in other digital filter structures[ 11. 
This is due to the low sensitivity property of the 
analog prototype filter. The ideal magnitude 
response of a fifth order LDI digital filter is 
given in figure 1.2. For comparison, in the same 
figure, the magnitude response obtained using six, 
eight and ten bit coefficient quantization is 
given. 
1.3.2. Signal Wordlenqth 
If the required signal value at any node in 
the LDI filter becomes larger than the binary 
wordlength available, then an overflow occurs. 
Overflow causes very severe signal distortion in 
the filter. In any practical filter implementation 
It is necessary to ensure that under expected 
operating conditions, overflow does not occur. 
This is done in most fitters by scaling the input 
signal to the filter such that the gain to any 
internal node is less than unity. 	In LDI filters it 
can be shown that scaling is not required since 
the gain from the input to any internal node is 
always less than unity. This assumes that the 
signal paths in the digital integrator are larger 
than the input signal wordlength by an amount 
equal to the coefticient quantization. 
1.3.3. Limit Cycle Oscillations 
Limit cycle oscillations are undesirable 
nonlinear effects which are due to the finite 
quantized 	nature 	of 	the 	digital 	filter 
implementation. Two types of limit cycle 
oscillations are recognized. These are overflow 
oscillations and granularity oscillations. Providing 
that the the LOl filter terminations are 
implemented with infinite precision (unity gain), 
then no serious limit cycle oscillations are known 
to occur. 
1.4. FIRST Implementation 	 - 
The signal flowgraph of the bit serial LDI 
digital filter implemented using FIRST is shown in 
figure 1.3. This is a third order LOl digital 
filter organized as a first order input section and 
a separate second order section. This filter is 
designed so that fifth, seventh or higher odd order 
fitters can be implemented by cascading the 
second order sections. 	Using the FIRST compiler 
has several advantages. Multiplexers can be 
efficiently used to share arithmetic elements 
because of the use of bit serial arithmetic. 	Both 
the system wordlength and the multiplier 
coefficient wordlength can be parameterized to 
allow easy modification for different applications or 
testing. The user does not need to know the 
fine 	detail 	of 	the 	arithmetic 	element 
implementation. - 
1.4.1. Pipelined Serial Arithmetic 
The FIRST compiler generates an integrated 
circuit design with arithmethic elements which use 
bit serial pipelined arithmetic(8). The operations 
of multiplication, addition, subtraction, multiplexing, 
and bit delay are used in this filter 
implementation. Each aritmentic element has a 
known latency. Latency is the time required (in 
terms of bit delays) for each arithmetic element 
to complete its operation. Because of this 
pipelining each FIRST element also requires control 
signals which identify when the least significant 
bit of the data signals will arrive. 
The use of pipelined serial arithmetic 
allows all filter operations to be performed as 
soon as data from previous elements is available. 
The recursive nature of the LDI filter places a 
restriction on the latency or delay around any 
feedback loop. That is, the latency around any 
feedback loop must be equal to the system 
wordlength. This requires that additional bit 
delays be incorporated into both the signal and 
control paths. 
1.4.2. Parameterization 
The FIRST compiler allows parameters to 
be included in the filter description. These 
parameters appear as named variables. The 
system wordlength and the coefficient wordlengths 
have been parameterized in the LDI fitter design. 
This allows filters with different wordlenqths to be 
easily implemented and tested. The restriction that 
any feedback loop have a latency equal to the 
system wordlength is included. 
A BIT SERIAL LDI RECURSIVE DIGITAL FILTER 
L. E. Turner, P.B. Denycr* and D. 
*Integrated Systems Group 	 Dept. of Electrical Engineering 
Dept. of Electrical Engineering 	The University of Calgary 
University of Edinburgh, U.K. Calgary, Canada 
Abstract 
The practical implementation of a bit serial 
lossless discrete integrator (1-01)[ 11 recursive ladder 
filter suitable for implementation as a single 
integrated circuit is described. The low 
coefficient sensitivity and simplicity of the LDI 
signal flowgraph make the filter structure suitable 
for implementing high quality digital filters. A 
bit serial integrated circuit filter has been 
designed and simulated using the silicon compiler 
system FIRST [2,31. 
The LDI filter coefficients which yield a 
Chebyacheff transfer function characteristic are 
found using an exact synthesis method(4). The 
finite precision time domain response of the filter 
is simulated using the FIRST simulator and the 
magnitude response is verified 	by calculating the 
fourier transform of the filter unit sample 
response. The LOX filter implementation makes 
use of an alternate clocking scheme which 
simplifies the signal flow graph. This is a form 
of multiplexing which is easily implemented using 
bit serial arithmetic. 
1. LOT Recursive Ladder Filter Design and 
implementation 
1.1. Introduction 
The implementation of a high order 
recursive LDI digital filter(l) as a single 
integrated circuit using parallel arithmetic is 
unlikely using currently available integrated circuit 
technologies. Digital LDI lowpass ladder filters 
have two input signal paths and two output signal 
paths for each integrator stage. The large 
number of parallel internal interconnections required 
makes very poor use of the integrated circuit 
area. Each integrator section requires a parallel 
multiplier which occupies the majority of the chip 
area. 	These constraints make the integration of a 
high order parallel architecture single chip LO! 
ladder filter impractical. integrated LOT digital 
ladder filter implementations have been constructed. 
However, these implementations use single chip 
parallel 	integrator 	sections 	with 	external 
connections and avoid the need for multipliers by 
using a multirate clocking scheme. 
Bit serial arithmetic implementations of 
recursive digital filters do not have this 
interconnection problem. In addition, the serial 
arithmetic elements are much smaller than the 
equivalent parallel elements. Fifth or higher order 
single chip recursive LDI ladder filters are possible 
using bit serial arithmetic. 
The LDI lowpass ladder filter is known to 
exhibit low transfer function sensitivity to the 
filter coefficients[ I]. This is due to the excellent 
low sensitivity characteristics of the resistively 
terminated analog ladder filter from which it is 
designed. C The analog filter is almost insensitive 
to variations in the inductor and capacitor values 
providing that the passband insertion loss is almost 
zero[5] ). 
This section describes the implementation of 
a bit serial recursive LDI ladder filter using the 
silicon compiler system FIRST[2,3). 	A single thtp 
integrated circuit LOT digital filter is design and 
simulated using FIRST. 	- 
1.2. LDI Ladder Filter Imolementation 
The flowgraph of a digital WI ladder 
filter is obtained from the voltage and current 
(VI) equations of a resistively terminated inductor, 
capacitor, ladder filter. 
To implement a digital ladder filter, each 
continuous integrator element is replaced with a 
discrete approximation. This Is equivalent to 
applying the transfoiation 
to the Laplace transform transfer function of the 
analog filter ( H(s) ). Replacing z with -l/z in 
the transformation above yields the same 
transformation. Therefore a .digital filter 
designed using this transformation will have 
transfer function poles both inside and outside the 
unit circle. This difficulty can be overcome by 
scaling 	the 	transfer 	function. 	Algebraic 
manipulation of the transfer function or flowgraph 
manipulation of the signal flowgraph equivalent to 
impedance scaling the network by exp ( -sT/2) 
yields the scaled LDI given in figure Li. The 
unrealizable advance operators have been removed 
to ensure that the resulting transfer function is 
stable. 







Figure 3a. Simulated OFT with 32 
modules (N=32) . Rectangular inut 
waveform and the output has bean 





Figure 3b. Simulated OFT with 32 	Fiqure 4a. Chip floorplan produced by the FIRST Siic 
modules for an input of two tones: compiler for the OFT module with coefficient ler'ct 
fundamental and third harmonic. 	of 16 bits and wordlength of 28 bits with comdex 
multiply having four real multiplies. 
H 	H 
1 
JI , wL1 
71 ,11,  
71 1 11, 
I 	 flflO y.io I 	 ?01p0 	 AOIaO pfl 
Figure 4b. OFT chip floorplan for coefficient length of 16 bits and wordlength 30 bits 
for the complex multiply performed using three realmu.ltiolies. Fioures 4a and 4b are 
not to the same scale. This chip has smaller area than the chip in Figure 4a. 
Figure 1. The basic OFT module. Latching of 
coefficient data is controlled by 







































XC X1 X2 
TN 








J I1 [i I 
Figure 2. Example of the linear array OFT 
for an N=3 point transform. The partial 
result entering at the left is zero and the 
transform accumulates as these XO, Xl and X2 
move through the array. 
lyricri 
point orr with input being two cycles of a 
	
rectangular waveform. This figure shows the delay 	
— H AEGISTER TEGIsTER through the array between arrival of data and  
production of the orr result. 
The floorplan of the chip for this Fourier 	
RGXSTER 
transform module is shown in Figure 4. By time 
multiplexing the arithmetic hardware in the 	X(k. 1"1J, 
modules using multiplexors and additional storage, 
the number of chips required could be reduced by 	 RErrSTER 
the multiplexing factor for applications which 
could afford the reduced rate of data throughput. 
CONCLUSION 
This paper has demonstrated the use of a 
Linear Systolic array as a discrete Fourier 
transform engine. The number of processor modules 
required is proportional to the number of points 
in the transform. N. The delay between acceptance 
of data and production of the output and the data 
storage required are also proportional to N. The 
interconnection of modules is simple with data and 
the coefficients N supplied at one end of the 
array and the transform result available at the 
other. For simplicity of interconnection and 
storage of input data and intermediate results, 
this scheme is superior to the area-time efficient 
algorithms discussed by Thompson [2]. 
8CRNOWt.EDGD(ENTS Part of the work reported here 
was performed while one of the authors (G.S.A.) 
was a Visiting Scientist at the Research 
Laboratory of Electronics at the Massachusetts 
Institute of Technology. The FIRST compilation 
and simulation was performed in the Department of 
Electrical. Engineering at the University of 
Edinburgh. The FIRST compiler has been developed 
under grants to the University of Edinburgh from 
the U.K. Science Research Council. 
REFERENCES 
[1] S.F. Rung, "Special purpose devices for 
signal and image processingi an opportunity in 
very large scale integration (VLSI)", SPIt Vol. 
241, Real Time Signal. Processing XII, pp. 74-84, 
1980. 
C.D. Thompson, "Fourier transforms in VLSI., 
X 	1980 Conf. 	on Circuits and Computers, ed. 
U.S. Guy Rabbet, Pt Chester, N.Y., Oct 1-3, pp. 
.046-1051, 1980. 
P.B. Denyer, D. Renshaw, and U.N. 	Bergman, 
" 	silicon compiler for VLSI signal processors", 
!SSCIRC '82 Digest of Technical. Papers, pp. 
215-218, 1982. 
(4] P.S. Denyer and 0. Renshaw, "Case studies in 
VLSI signal processing using a silicon compiler", 
ICASSP 83, r= International Conference on 
Acoustics, Speech and Signal Processing, Paper 
20.6, pp. 939-942, 1983. 
in reverse order. This can be arranged by passing 
the data through the array and latching 
coefficient data successively in each module. 
The function of each of the modules in the 
array is described by Figure I. The partial 
result, the W factor and the synchronizing signal 
move through the array with the same delay in each 
module. This delay time is the time to perform 
the complex multiply and add operations. This is 
referred to here as the sodu1.. tin.. 	The tilling 
requires that each coefficient be delayed an extra 
module time in each module. In this way x(N-1) is 
latched into the first module one module time 
before x( N-2) is latched into the second module 
and 	so on until x( 0) is latched into the last 
module 
The timing diagram for a three point (N-3) 
OFT (Figure 2) illustrates the flow of data and 
results through the array. The data, W factors 
and partial results contained in registers in each 
of the modules (across the diagram) are shown at 
successive times (down the diagram). The instant 
of latching of data into the appropriate register 
in the modules is indicated by an asterisk. Data 
are shown moving through the array while the 
previous batch of OFT values are being computed. 
Following the arrival of the last of the datum of 
a batch, x(N-1), into the first module, the 
computation of the OFF for that batch of data 
proceeds while data of the next batch move through 
the array. The first output for a batch, X(0), is 
obtained N module times after the arrival of the 
last datum for the batch, x(t1-1), and this is ZN 
module times after the arrival of the first datum 
x(0). The last output for the batch, X(N-l), is 
available N-1 module times after X(0). 
The linear array has the useful feature that 
the data are stored within the array. This makes 
direct comparison in an area-time efficiency sense 
with the algorithms described by Thompson [21 
difficult since with this algorithm data storage 
is included whereas with most of the algorithms 
considered by Thompson data storage is not 
included. 
The linear array also has the desirable 
feature that data enter the array at the same rate 
as the results are produced. This means that the 
transform system can be interposed between a data 
sampling system and the recording system to 
produce a discrete Fourier transform of the data, 
the only penalty being a delay equal to twice the 
cransform length N. 
The development to this stage does not depend 
on the nature of the arithmetic to be used. Since 
bit serial arithmetic for this array leads to 
fewer interconnections between modules than for a 
Sit parallel implementation and since it is 
possible to design the complex multiply and add 
perations for bit serial arithmetic on a single  
chip for 5 or 6 micron VLSI, the algorithm was 
implimented in serial arithmetic using the FrRsr 
silicon compiler [3]. 
The module consists mainly of a complex 
multiply and accumulate operation. While it is 
possible to configure the complex multiply using 
only three real multiplies, such a scheme leads to 
a longer wordlength in both the serial and 
parallel implementations because of the extra 
addition and subtractions which become necessary. 
The saving in the use of three multipliers instead 
of four is partly lost in the larger multipliers 
required for the larger wordlength in addition to 
a time penalty for the arithmetic operations. 
I)tENTATI0M AND SIMULATION USING FIRST 
The FIRST silicon compiler is a system of 
computer programs which enables a designer to 
specify a signal processing task as a set of 
primitive operations, simulate the task and obtain 
a 6 micron NMOS VLSI layout of the chips in the 
system (4). For the linear array OFT modules the 
primitives used were serial multiply, add, 
subtract and delay together with a specification 
of the control signals. 
For example, the description of the complex 
multiply is as follows: 
OPERATOR XNULT [coeff] (Cm -, cout) rd,id,rc,ic 
-, rout,iout 
SIGNAL rl, r2, ii, i2 
CONTROL Cl 
MULTIPLY (l,coeff,O,O) (cm - Cl) rd, rC -, rl,NC 
MULTIPLY (l.coeff,0,0) (cm - NC) id, ic - r2,NC 
MULTIPLY (1,coeff,0,0] (cm -, MC) rd, ic - il.Nc 
MULTIPLY (l,coeff,0,0) (cn -, NC) Id, rC -u iZ,Nc 
5U3'FRAC'l' (1,0,0,01 	(ci) rl, r2 GMD - rout,NC 
ADD 	[1,0,0,01 (Cl 	ii, 12 GKO -, iout,NC 
CBITDELAY (1) 	 (Cl - cout) 
END 
where the parameters in round brackets ( ) are 
control lines, the parameters in square brackets 
[] are functional parameters and signal lines are 
rd, id, to and ic representing the real and 
imaginary parts of the multiplier and coefficient, 
and rout and iout representing the real and 
imaginary parts of the product. 
By use of control primitives to effect 
multiplexing of data lines, bit delays and 
specification of connections to pads of the chip, 
the chip representing the module in Figure 1 was 
described in approximately fifty lines of code 
similar to that given above. For the simulation, 
the modules were cascaded so that the output 
resulting from sequences of input data could be 
obtained. A typical output is shown in Figure 3 
which is the magnitude of the simulated thirty-two 
A BIT SERIAL LINEAR ARRAY DFT 
Gregory N. Allen4 , Peter B. Deriyer* and David Renshaw* 
4oepartznent of Electrical and Electronic Engineering, 
James Cook University of North Queensland, Australia 4811. 
*Department of Electrical Engineering, University of Edinburgh, 
The King's Buildings, Mayfield Road, Edinburgh, EE9 3JL. 
ABSTRACT 
A Linear array which computes ane err in a 
pipe Lined fashion is described. The aLgorithm is 
derived from the batch processing array proposed 
by H.T. 	Kung (12 but has been modified to aLLow 
continuous operation. This computation of the DFr 
is a compLar poLynomiaL eual.uation on the unit 
circle using Homer's method having the data for 
the polynomial coefficients. 	Data and the N-th 
complar roots of unity ore input at one end of the 
array and the err sequence is output from the 
other. The poLynomiaL coefficients are stored in 
successive modules in the array and a new batch is 
Latched successively with a synchrontsing signal. 
rn its simpLest form the design has a single 
system port which is replicated N times for an 
N-point transform. For time multiplexed modules 
the system throughput and hardware can be 
optimised for given applications. A bit serial 
Layout for 6 micron NMS vtsr has been designed 
and sinul.oted using the FIRST silicon compiler at 
the University of Edinburgh. 
INTRODUCTION 
Fourier transform hardware is widely used in 
signal processing. From area-time considerations, 
the fast Fourier transform algorithms have enjoyed 
a major role (see for example Thompson (21), 
especially with sequential processors. This paper 
examines a different point of view and considers a 
direct implementation of the discrete Fourier 
transform - (DFT) which is not area efficient but 
has very simple implementation using a cascade of 
identical parts. This work is an extension of the 
systolic array OFT structure of Kung [1) which 
included a loading phase and a computational 
phase. The extension is the prescription of a 
system which loads and computes simultaneously. 
The data are input at one end of the array and 
delayed Fourier transformed data emerge from the 
other end in natural order. 
THE DIscRETE FOURIER TRANSFORM ALGORITHM 
The N point OFT of a data sequence x(n), 
n-0,1,2 .....N-i is computed from the equation (1) 
N-1 
X(k) - IE 
 	
x(n)WNkn 	 (1) 
for k-0,1,2.....N-i with W- exp(-jZlT/N). 
This equation can be rearranged using Homer • s 
method to obtain 
X(k) - (( ... ((x(N_l))WNk + x(N_Z))wsk + 
+ x(l))Wk + x(0)) 	 (2) 
Kung (1] has shown how to perform this computation 
using a linear systolic array of processors. The 
OFT 	values X( k) axe computed sequentially by 
passing the kth partial result together with the 
complex value W 	into a computational module 
containing the coefficient x(j), multiplying the 
kth partial result by W 	and adding x(j) to 
produce the next kth partial result. When the 
initial kth partial result entering the array is 
set at zero and the complexvalues Wk are 
supplied 	in the sequence W ,W .WM , . . 
the result emerging from the linear array will be 
X( 0), 	X( 1), 	. .X( N-i) 	provided 	that 	the 
coefficients x( j ) in the modules are in the order 
x(N-i), x(N-2). x(N-3) .....x(l). x(0). 
The data entering the linear array are 
provided x(0) first, x(l) next, and so on until 
x( N-1) arrives. However the data coefficients to 
be stored starting with the first module should be 
Proc IEEE ICASSP'84, San Diego, March 1984. 
In the TTL hardware word growth was accommodated 
by hard limiting in the arithmetic unit. 	It was 
not feasible to incorporate this at each calculation 
in the integrated design, and as the likely word 
growth in the lattice was not known in advance, it 
was assumed that the signals were of gaussian 
distribution and room was provided accordingly for 
word growth. 
The probabilities of an overflow occurring, and 
thus the room allowed for word growth, were set 
according to the penalties which overflow would 
incur. 	Too great an input signal sample voltage 
would be clipped by the input ADC, causing signal 
distortion. 	The probability for this was set to 
1 in 	100. The possibility of overflow which 
causes wrap-around within the filter signal path and 
results in a burst of errors during equalisation, 
was held below 1 in 107. As the system can not recover 
from a wrap-around in the modified stepsize recur-
sion, it was prevented completely using limiters. 
A comparison with the CORDIC integrated lattice 
filter design (8) developed for speech analysis, 
which is based on 16 bit coordinate rotation 
arithmetic shows that the latter uses an area 
smaller than our chip set. 	The CORDIC normalised 
lattice approach has a different arithmetic 
capability to our FIRST approach,aVOldiflg the 
need for stepsize recursion. 
In order to realistically apply CORDIC techniques, 
even to 8 kHz sample rate signals, one has to 
employ parallel processing (8) to minimise the 
internal clock rate and permit multiplexing. 	A 
12 stage vocoder was reported (8) as requiring a 
two chip lattice cascade, operating at an 11.4 MHz 
internal clock rate, plus further external control 
circuits. 	In comparison our 5 chip bit serial 
approach (9) offers 16 lattice stages with enhanced 
22 kHz sample rate for a conservative 8 MHz 
internal clock rate. 	Progress to the reduced 
feature dimensions of VLSI will reduce the overall 
chip count in both these approaches. 
Application as Adaptive Equaliser 
A study was also made of extending our integrated 
lattice filter design to the gradient equaliser 
of (5). 	This would require the addition of a 
side-tap structure and a recursion to set the 
tap weight values. 	Since the G weight stepsize 
is normally double the K weight stepsize, the 
same recursion would suffice for both, resulting 
in a requirement for only two additional chips. 
However decision feedback would provide a 
serious bottleneck for such a highly concurrent 
system. 	The error signal for the first stage 
can only be calculated after the output from 
the final stage has been obtained. 	However 
the succeeding input sample cannot be processed 
until the new tap value is calculated from the 
previous error. 	This problem may be solved 
by allowing idle cycles in the structure or by 
updating the taps one sample later, but neither 
of these solutions is particularly elegant. 
A comparison was also made with the exact least 
squares lattice reported in (9). 	The extra 
operations include: a second PARCOR coefficient 
per stage; an extra multiplication and 
division per PARCOR recursion; an extra lattice 
structure for calculation of PARCOR normalising 
factors (c); an extra recursion for the log 
likelihood variable. 	Using the rules developed 
for the prediction-error structure, this would 
require an additional 4-5 chips, raising the 
chip count for an exact least squares equaliser 
to about 10 chips, if implemented with our S urn 
NIbS LSI bit serial arithmetic approach. 
CONCLUSIONS 
We have discussed the lattice filter as one of many 
orthogonalising transforms used to maintain a 
constant rate of convergence for a linear combiner 
structure. 	We have discussed the relative 
advantages of lattice and transversal adaptive 
filters, showing that the gradient lattice is by no 
means ideal for all applications. 	Prototype TTL 
hardware has been described, which was constructed 
to investigate aspects of fixed jj gradient lattice 
implementation. 	A lattice prediction error filter 
chip set has been designed and compared with the 
CORDIC approach and the former's development into 
variable jj gradient and exact least squares lattice 
equalisers have been discussed. 
ACKNOWLEDGEMENT 
The sponsorship of the British Science and 
Engineering Research Council is gratefully 
acknowledged. 
REFERENCES 
Widrow, B. et al Proc. IEEE, 1975, 63, pp.1692- 
1716. 	 - 
Gersho, A. BSTJ, Vol.48, Jan. 1969, pp.55-70. 
Reed, F.A. et al IEEE Trans. ASSP-29, No.3, 
pp.770-775, June 1981. 
Griffiths, L.J. ICASSP 1978, pp.87-90. 
Satorius, E.H. and Alexander, S.T. IEEE Trans. 
COM-27, No. 6, June 1979, pp.899-905. 
Ungerboeck.G. IBM J. Res. 0ev., 16, No. 6, 
Nov. 1972. pp.546-555. 	- 
Denyer, P.B. at al "Case Studies in VLSI Signal 
Processing Using a Silicon Compiler ICASSP 83. 
Ahmed, H.M. et al ICASSP 1981, pp.641-653. 
Satorius, E.H. and Pack, J.D. IEEE Trans. Cf'bI• 
29 No.2, pp.136-142, February 1981. 
Figure 3. 	Discrete T'L GAL equaliser schematic. 
With an arithmetic unit cycle-time of 2.2 MHz, eight 
operations per stage and sixteen stages, the signal 
sample rate is 17 kHz providing a usable bandwidth 
of 5-8 kHz. 	Approximately 160 ICs are incorpor- 
ated in the equaliser. 	The component count was 
not minimised as the aim was for flexibility rather 
than economy. 
Figure 4 shows the convergence of the 16 stage 
equaliser onto a real raisedcosine channel whose 
impulse response is 0.27 + z' + 0.27 z 2 providing 
an eigenvalue ratio of 11 (5). 	The data is 
convolved with this channel and then input to the 
equalisers input, figure 1, while the d' port 
receives a synchronised undistorted binary training 
signal. 	The y output shows the emergence of the 
equalised signal as the filter converges, with 
adaption started j ms after the start of the trace. 
The error output shows this reducing as convergence 
is achieved. 	The 4 ins convergence time of this 
fixed i GAL equaliser is similar to that achieved 
in an adaptive transversal equaliser. 	Faster 
convergence is obtained by incorporating the 
variable w calculation for each stage (5) as 
described later under the applications of the 
following integrated GAL PE filter design. 
GRADIENT LATTICE CHIP SET 
We have designed an adaptive stepsize GAL PE filter 
based on a set of LSI chips implemented with the 
fast implementation of real time signal transform 
(FIRST) silicon complier (7). 	This is a set of 
software tools based on bit-serial arithmetic which 
permitted the rapid realisation and checking of an 
integrated lattice filter design. 	The filter was 
partitioned into S distinct chip designs: lattice 
stage, PARCOR recursion, stepsize (u)  recursion (2 
Chips) and input and output multiplexing for a 16 
stage filter. 	Figure 5 shows the floor plan of the 
lattice Chip, which at 25 line2 was the largest 





miLT,ty _l.I 't 
LIN i; 






Figure 4. 	Shows adaption of GAL equaliser on a 
synthetic channel.  
Figure 5. 	Floor plan of lattice stage Implemented 
as a 5 mm x 5 mm NMOS integrated 
circuit with 5 Vm feature size. 
Ihe stepsize recursion (5), if implemented directly, 
would have required 1 chip of similar area to the 
PARCOR recursion followed by a second operation 
which implemented the inversion. 	The FIRST system 
does not include the division operation necessary 
for the inversion. 	 As a result the step- 
size recursion itself was inverted using a 
Taylor expansion of the form: 
C(t) 	fn(t)2 + b(t)2 	
(5) 
u(t+l) = on(t) . (1 + A - 0(t).C.(t)) 	
(6) 
where C5(t) is the channel power estimate and A is a 
scaling constant. 	This took only I of the total 
50 sq mm maximum area permitted for the pair of 
chips. 
011 
Figure 1. 	Adaptive lattice equaliser. 
adds appreciable algorithm noise. - Figure 2 shows 
equaliser error power (MSE) for various convergence 
factors. 	As the stepsize becomes smaller, both 
lattice and transversal structures approach the 
Weiner solution. 	However, as the stepsize is 
increased to gain faster convergence, the lattice 
algorithm noise grows more rapidly. 	Thus although 
the lattice has more reliable convergence properties 
than the transversal equaliser, it is unsuitable 
for applications requiring long equalisers or low 
converged errors. 
2 -TAP £QULLS1RS 
2C0 
"Al COON(L I*O 1,4 
STEPSLU Iflul - 
Figure 2. 	Converged error-vs-stepsize for constant 
lattice and transversal equalisers. 
Ungerboeck (6) has shown that the error-signal power 
level (MSE) of an LMS algorithm in the steady state 
is a function of the Weiner error (e2), the number 
of ta(N), the stepsize and the signal input power 
level (S2). 
MSE = 2E <e2> / (2 -.N.E <S2>) 	(4) 
where E <.> is the expectation operator. 
In transversal filters, where the optimum error is 
low, the algorithm noise is held to a minimum. 	In 
the lattice orthogonalising structure, however, the 
prediction error power output from each stage is 
dependent on the spectral characteristics of the 
input signal. 	For very low eigenvalue ratios it 
hardly drops at all. 	This leads to large 
fluctuations in the PARCOR 	values, which in 
turn leads to a high algorithm noise on the output 
of the stage. 	The algorithm noise is then summed 
as it passes through successive stages in the PE 
filter. 	For the complex equalisation which is 
required in a high-speed modem, the increase in the 
degreeof freedom in the adaptation algorithm also 
raises the noise by r. 3 dB per channel. 
Our conclusions on the relative applicability f 
adaptive lattice and transversal filters are as 
follows: 
If the channel eigenvalue ratio is unpredictable and 
reliable rate of convergence is required then lattice 
is the preferred approach. 	However lattice filters 
can only be applied in short equalisers where the 
algorithm noise is low or in applications where the 
background noise is so high that it masks the filter 
noise. 	The equalisation of telephone channels, 
where the input is often noisy, appears to be 
appropriate to either technique, and the final 
selection is dependent on the precise system 
specification. 
TTL PROTOTYPE EQUALISER 
We have developed prototype hardware to demonstrate 
the operation of a fixed u gradient lattice filter 
with real signal inputs. 	This is based on a 12-bit 
TRW multiplier,MPY-12 HJ. The hardware was 
designed by partitioning the algorithm into eight 
multiply-end-add or multiply-and-subtract operations 
per stage. 	Twelve-bit arithmetic was selected as 
this represented the maximum practical working 
accuracy, commensurate with component availability. 
24-bit accuracy was selected for the coefficient 
weight storage to increase recursion accuracy. 
The arithmetic unit, figure 3, was designed as a 
multiply-and-add structure with hard limiting on 
overflow. 	Four twelve-bit busses are used for 
maximum throughput - three input and one output 
with two of them extended to 24 bits during 
adaptive recursion. 	The memory unit was designed 
to interface with the arithmetic unit via the bus 
structure. 	The backward-channel signal samples 
and filter coefficients which had to be retained 
and operated on repeatedly were stored in Schottky 
RAM. 	Other variables, such as the error and 
forward signal samples, which rippled through the 
filter, were kept in latching registers. 
I 
DESIGN AND REALISATION OF ADAPTIVE LATTICE FILTERS" 
M.J. Rutter, P.M. Grant, D. Renshaw and P.S. Denyer 
Department of Electrical Engineering 




This paper compares the relative advantages of dis-
crete and integrated circuit transversal and lattice 
adaptive filter designs. 	It discusses the trade- 
offs between the traditional transversal filter and 
the gradient adaptive lattice (GAL), showing that 
the reliable rate of convergence of the lattice is 
offset by a greater complexity and algorithm noise. 
Two 16-stage hardware implementations are described. 
One is a TTL GAL equaliser based on a single 12-bit 
parallel multiplier. 	The second is a suite of 
5 custom NMDS bit-serial arithmetic LSI chips, 
which together make a lattice prediction-error 
filter. 	Both implementations offer 12-bit 
precisions and bandwidths of over B kHz. 	This 
shows that the increased complexity of the lattice 
approach can easily be accommodated in custom VLSI 
circuit designs. 
INTRODUCTION 
Data modems are used to transfer data at high 
speeds over analog telephone channels of limited 
bandwidth. 	The precise frequency response of the 
channel is seldom known in advance of the connection 
and hence equalisation is needed before data can be 
accurately decoded. 	Equalisation is initially 
performed prior to data transmission by sending a 
pre-arranged training sequence, and adjusting the 
equaliser tap weights to minimise intersymbol 
interference. 	Data is then transmitted and the 
equaliser operates in a decision directed mode. 
Thus minimisation of the training period with a 
rapidly converging equaliser increases modem 
efficiency or throughput rate. 
The most common method of realizing an adaptive 
equalizer is to use a. finite impulse response (FIR) 
transversal filter, where the tap-weights are 
calculated by the least mean squares (LMS) adaptive 
algorithm (1). 	Unfortunately, the convergence rate 
of the transversal equaliser is restricted by tap-
weight interactions which arise from the correlation 
of signal components, and channels in most need of 
equalisation are the slowest to converge (2). 
Various transforms, such as Fourier (3) and Walsh, 
may be applied to the input signal in order to 
reduce the correlation between components and thus 
maintain equaliser convergence rate. 	This paper 
discusses the use of the gradient lattice 
structure (5) as an alternative orthogonalising 
adaptive equalizer and demonstrates lattice filter 
hardware and a VLSI chip set designed for this 
application. 
LATTICE EQUALISER 
The lattice structure (4), shown in the lower part of 
figure 1, is a decorrelating transform, based on 
a family of prediction error (PE) filters. 	A PE 
filter alone may be used to equalize a signal with 
only pre or post echoes. 	However, we exploit the 
property of the PE filter that the backward channel 
outputs, b(t), are mutually orthogonal, and connect 
these Outputs to a linear combining structure to 
implement an adaptive lattice equalizer (5). 	This 
alone would speed up convergence if implemented with 
an 1.145 algorithm to select the combiner weights. 
However, the mutual orthogonality of the signals 
may be further used to distribute the alorithm 
into many one-tap adaptive filters. 	This results 
in an economical distributed orthogonalising 
structure feeding a distributed linear combiner, 
figure 1. 
The LMS adaptive equations which control the lattice 
filter PARCOR, values, K(t), and combiner weights, 
G(t). for stage n at time instant t are: 
Gn ( 0) = Kn ( 0 ) 	o 	 (1) 
Gn(t+l) = Gn(t) • 2.i.e5 (t).b(t) 	(2) 
K(t+l) = K(t) + 	n(t)bn+i(t) + b(t-l).f 1(t)) 
(3) 
The step size w may be fixed, or varied optimally 
according to the selected algorithm (5). 
APPLICATIONS OF THE GRADIENT LATTICE 
In the GAL filter the prediction error filters in 
the lattice structure must converge first to 
provide the orthogonal output samples which feed 
the distributed linear combiner. 	This require- 
ment for a fast converqinq PE filter structure 
Paper No. 205 prepared for IEEE ICASSP Boston April 1983. 
SIGNAL IN 
DOIC -Lrt_J ]PROCESS 
SIGNAL OUT 
SIGNAL 	 SIGNAL 
PROCESSOR  
Figure 10: A multichip system in 'run' mode. 
diagnostic accuracy determined only by the 
ability to decode different signatures. 	If a 
small increase in chip complexity can be 
tolerated, each chip can be capable of 
recognising 'good-chip', 'good subsystem', 
'good system (etc) signatures, whereupon the 
location of a fault can be accurately 
pinpointed. 
The chip test Is essentially an off 
line test involving an explicit test phase of 
0.25 mS. So short is the test time, however, 
that it may be possible In certain 
applications to perform a pseudo-online self-
test by 'cycle-stealing'. 
We have also created a regime within 
which fault-tolerant systems may be designed. 
If redundant elements are included in a 
system, a voting arrangement based on the 
results of an offline or pseudo-online self-
test can replace faulty elements by 
automatically switching in the redundant 
parts. Looking to the future, when systems 
on-a-wafer [17] become more of a reality, it 
will be possible to use a similar redundancy 
arrangement to permantly configure a wafer 
with extra processing elements to utilise 
only devices which pass a self-test. 
5 Conclusions 
We have developed and demonstrated a 
methodology and design approach for the 
inclusion of self-verification In bit serial 
signal processing chips. 	Fault coverage can 
be as high as necessary, is determined 
without fault simulation, and overheads are 
extremely low. 
The test development is carried out 
within the unified framework of a silicon 
compiler, which streamlines and ensures the 
integrity of the design process. The 
resultant chips can become part of a fault-
tolerant and/or hierarchically self-testing 
multichip system, within which on-chip and 
interconnect faults can be detected, 
pinpointed and dealt with. 
References 
R. M. 	Sedmak, "Design for Self- 
Verification: An Approach for Deal ing 
with Testability Problems in VLSI 
Designs, Proc. IEEE Test Conference, 
112-120 (1979). 
R.M. Sedniak, "Implementation Techniques  
for Self-Verification', Proc IEEE Test 
Conference, 267-278 (1980). 
B. Konemann, J. Mucha and G. Zwi ehoff, 
"Built-In Logic Block Observation 
Techniques", Proc IEEE Test Conference, 
37- 41 (1979). 
0. 	R. 	Resnick, "Testability and 
Maintainability with a New 6K Gate Array" 
VLSI Design, Vol 4 No 2 (1983). 
R. A. Frohwerk, 'Signature Analysis, a New 
Digital Field Service Method", Hewlett-
Packard Journal, May 1977. 
S.W. Golomb., "Shift register Sequences", 
Pub Holden-Day Inc. San Francisco (1967). 
E.J. McCluskey and S. Bozorgui-Nesbat, 
"Design for Autonomous Test", Proc IEEE 
Test Conference 15-21 (1980). 
J. E. Smith, "Measures of the Effectivness 
of Fault Signature Analysis", IEEE Trans. 
Comput. C29, 510-514 (1980). 
P. B. Denyer, D. Renshaw and N. Bergmann, 
"A Silicon compiler for VLSI Signal 
Processors", Proc 8th European Solid-State 
circuits conference, 215-218, Brussels 
(1982). 
P.B. Denyer, and D. Renshaw, "Case 
Studies in VLSI Signal Processing using a 
Silicon Compiler", Proc. Int. Conf. on 
Acoustics, Speech and Signal Processing, 
Boston. 939-942 (1983). 
N. Bergmann "A Case Study of the FIRST 
Silicon compiler", Proc, 3rd Caltech 
Conference on VLSI, 473-430 (1983). 
V. 0. Agrawal, "Sampling Techniques for 
Oeterming Fault Coverage in LSI Circuits", 
Journal of Digital Systems 5, 189-202 
(1981). 
J. Savir, "Random Pattern Testability", 
13 Int. Symposium on Fault-Tolerant 
Computing, Milan (to be published, 1983). 
R. F. Lyon, "Two's Complement Pipeline 
Multipliers", IEEE Trans. Communications 
24, 418-425 (1976). 
[15]D. Renshaw, Edinburgh University 
Internal Report (1983). 
C. Mead and L. Conway, "Introduction to 
VLSI Systems", Addison-Wesley (1980). 
S.L. Garverick and E.A. Pierce "A Single 
Wafer 16-Point 16-MHz FFT Processor", Proc 
IEEE Custom Integrated Circuits 
Conference, 244-248 (1983). 
Snflnfl...S._ J_.e.SS.UflflflW5 
Figure 9: 	Floorplan of self-testing design (64-point cascadable finite impulse response (FIR) 
filter). 
increased by around 4%. This might be 
expected to decrease the yield by about 1%. 
Each PRBS/FRODO register will have a 
worst-case power consumption (from analogue 
SPICE simulation) of 5mW, and about 8mW for 
the timing circuitry. The overall power 
overhead for the trial design is thus under 
60mW. With an estimated overall chip power 
consumption of 700 mW this represents a 9% 
overhead. This could be reduced at the 
expense of speed performance or by powering 
down the test circuitry in run mode, but it 
should be emphasised that this is a worst-
case figure. 
If the complete system does not have to 
be self-testing, no increase in system 
complexity is necessary, as the chip self-
test may be Initiated manually. In the more 
likely event that some degree of system self-
test is required, a controller chip or module 
will be required, to define the mode of 
operation (test or run) and to act upon the 
results of the self-test. 	The details of 
this controller will depend upon the way in 
which the result of the test is to be used. 
This topic will be pursued in section 4. 
4 System Implications 
In section 3 we described a methodology  
for including self-test, at the chip level, in 
bit serial processors. There are clearly 
several stages in the devices lifetime at 
which the ability to verify its own fault-free 
status is valuable. We shall discuss only the 
implications as far as the system designer is 
concerned. 
It Is possible to envisage at least 
three modes of operation within a multichip 
system. 	In figure 10 we illustrate the 
configuration in which such a system will 
perform its normal processing function. 
Switching all the chips to 'test' mode allows 
a chip-test to be performed. Less obviously, 
switching the first chip in the pipeline to 
'test' mode and all the others to 'run' node 
configures a system test, within which 
pseudorandom patterns propagate through the 
entire chain of processors and their 
interconnects to yield a 'good system' 
signature in the final FR000 register. This 
is a very powerful technique, testing not only 
the processor's correctness but the integrity 
of the interconnect, which generally accounts 
for a high percentage of failures. Flushing 
out of the system is not necessary if the chip 
test is performed first to flush out the 
individual chips. 
This two-level test could be expanded to 
test many levels in a system's hierarchy, with 
In designing the test length, therefore, we 
need only consider the most random pattern 
resistant operator in the processor. This 
can be identified by studying a series of 
graphs such as figure 7 taken from full fault 
simulations of each operator, which show the 
clock cycle in which the single stuck faults 
actually appear at the operators' output. 
Although this resembles closely the 
activation curves of figure 6, It actually 
represents a much more significant result 
(both computationally and In content). 	From 
the graph it can again be seen that the 
sequence length chosen (1023 bits) ensures 
100% coverage of single-stuck-faults in the 
multiplier and thus a similar degree of 
coverage for the less random pattern 
resistant elements. It can also be seen that 
considerably shorter sequences could be used 
(say 255 bits) if only 98.5% confidence is 
required. We have thus reduced the problem 
of test pattern design to that of assessing, 
from the testability characteristic graphs 
for the operators In the processor (c.f. 
figure 7), the length of PRBS needed to 
provide any desired degree of coverage. The 
testability issue can therefore be viewed at 
a system level, in a manner consistent with 
the steps involved In arriving at an operator 
network and flow diagram. The testability of 
each operator need only be measured once by 
full simulation to produce its testability 
graph, to which the system designer may then 
refer. 
3.4 	A Trial Implementation 
The particular processing function 
chosen for a trial implementation Is not 
important except that It should be 
characteristic of the class of bit-serial 
operators. The 64-point cascadable FIR 
filter section chosen is typical of the genre 
of functions for which the bit serial 
architecture is ideal and at which the FIRST 
compiler is aimed. 
From figure 3 it Is clear that 
the fixed floorplan of a FIRST chip left a 
significant area of unused silicon in the pad 
channel. 	It is a characteristic (indeed one 
of the most attractive characteristics) of 
bit serial elements that the communication 
overhead is low and that the gate/pin ratio 
is consequently high. 	It is likely, 
therefore, that a less fixed floorplan would 
still result in unused silicon in the 
periphery of a bit serial signal processing 
element. We have therefore designed PRBS and 
FR000 elements and the timing logic in such a 
way that it can be placed in the pad channel 
of the FIRST chip. The timing circuit forms 
an extra entity to be inserted, but the PRBS 
and FR000 registers can be incorporated with 
minimal difficulty by simply redefining the 
signal Input and output pads. We have 
designed all these elements using / design 
rules [16, with / = 2.5 um, and we present 
the layout of a FROM register, complete with 
signature comparator, in figure 8. This Is a 
conservative design approach and the use of 
real design rules and a more competitive 
process would result In much smaller circuit 
elements. Nevertheless, the entire test 
circuitry can be placed in the pad channel of 
our trial design with ease (figure 9). 	It 
should be noted that the chip size of the 
design has not been increased at all by the 
sel f - test circuitry. 
SCALE 
øtJ 	250U 	500U 
ACC2OUT 
XOR GATE 	 CO?TPATCR 
Figure 8: Layout of FR000 register. 
We have shown that very high levels of 
fault coverage, and thence reliability, can be 
achieved. It remains to us now to count the 
cost. 
3.5 Cost 
It is not a trivial matter to persuade 
designers to conform to restrictive rules or 
otherwise add to the difficult enough task of 
designing a chip. 	Whether a system is 
designed manually or using a compiler, the 
design difficulty Is not significantly 
increased by our methodology for the inclusion 
of self-test, provided a 'standard-cell' 
approach is adopted. In other words, if a 
chip is regarded as a collection of standard 
layout cells connected together, it is a 
simple matter to redefine the input/output pad 
cells to include PRBS/FR000 registers 
respectively, and to include a simply-. 
connected timing cell. 
Silicon is precious even as VLSI 
densities become realities, and a heavy cost 
in area and therefore yield is not tolerable. 
We have seen that the chip area is not 
increased at all by the inclusion of self-
test, and that the active silicon area is 
signature analysis rather than the more 
normal 16-bit length. The LFSR polynomials 
for coefficient and multiplicand generators 
are 10000001001 and 10010000001 respectively. 
1002 




/8 BIT MULTIPLIER 
.1 
1002 
4 BIT MULTIPLIER 
50 
U- 	 1002 
C 
2 BIT MULTIPLIER 
50 
100 	200 	300 	400 
CLOCK CYCLE 
Figure 6: 	Nodal activity in a modified 
Booth's algorithm multiplier under 
pseudorandom stimulation. 
To give an Insight Into the nodal 
activity within a multiplier under 
pseudorandom stimulation, we present In 
figure 6 the results of simulation 
measurements on 2, 4, 8 and 16-bit 
multipliers. The vertical axis represents 
the percentage of all nodes which have 
toggled (undergone a change of state). The 
rapid rise between 0 and 20 cycles represents 
the 'flushing-out' of the shift registers and 
much Of the adder circuitry, while the slower 
rise to 100% activation by 300 cycles (414 
for the 16-bit operator) represents the 
exercising of the recoder logic, and its 
resultant effects on the adder. The 
similarity between these "activation curves" 
for different multiplier lengths is striking, 
if not surprising in view of the 
aforementioned modulatory and propagative 
randomness property of the architecture. The 
major effect of increasing the multiplier 
length is to smooth the curve, as the 
presence of a larger number of nodes reduces  
the relative significance of any individual 
event. The similarity between these curves is 
very important as It allows us to draw 
conclusions from the more detailed simulation 
of a chosen length of multipler which are 
relevant to all reasonable lengths of 
multiplier. 	It also confirms the propagative 
randomness property, as any significant 
deviation from randomness at the input to any 
multiplier module would affect the activation 
curves of figure 6 by changing the pattern of 
activity in subsequent modules. This would be 
manifest as a slowness, or even failure to 
reach the point where all the nodes have been 
toggled. 
The fact that a given pseudorandom 
sequence causes a transition (or transitions) 
on a given node is a necessary but not 
sufficient condition for both types of stuck-
at fault to bedetected by the sequence. This 
is because there are, in general, "don't-care" 
states with respect to any given node, and the 
effect of a fault may not be propagated to an 
output to appear as an error. We have 
therefore performed an exhaustive fault 
simulation for the 8-bit multiplier (a useful 
length, and one which yields tolerable 
simulation times). The coverage of single 
stuck-at faults is 100%. 	With a 10-bit 
signature analyser the net rate of fault 
capture will be 99.9%. 
We are committed, by the design 
methodology for self-test, to ensure that the 
property of propagative randomness Is obeyed 
by all operators. If this cannot be arranged, 
a minor reconfiguration of the operator may be 
necessary In test mode to restore the 
randomness of its output (e.g. the operators 
output may be XOR'ed with its random Input to 
restore randomness). With the further 
knowledge that fault free operators obey the 
well-known 'rubbish in - rubbish out' maxim, 
we can be confident that a single fault in 
(say) a multiplier will cause an error at the 
multiplier's output which will be propagated 
through subsequent fault-free operators to 
show up as faulty data at the signature 
analyser. 
8 811 P8JLTIPIIER 
SINGLE STUCK FAULTS 
LENGTH OF PSEUOORANOOM SEQUENCE cBITS 
Figure 7: Error detection pattern in an 8-bit 
modified Booths algorithm multiplier. 
is, however, the fault coverage achieved. 
The degree of coverage is generally 
measured by performing a large number of 
switch-level circuit simulations. First, a 
'good machine' simulation must be performed 
with a particular set of input data to 
produce a set of good output data for 
reference. Subsequent simulations will be 
performed with stuck-at-one and stuck-at-zero 
faults Injected at each circuit node in turn, 
and the same Input data. A fault is said to 
be 'covered by the given input stream if its 
injection causes the output data stream to 
differ from that of a good machine. 
It is obvious that this procedure Is 
computationally intensive and therefore very 
expensive and time consuming for all but the 
simplest circuits. Methods are being 
developed which will reduce this overhead by 
only injecting faults on 1000-2000 of the 
circuit nodes [12]. This smaller number of 
simulatio s will still produce a highly 
accurate estimate of the fault coverage if 
the sample nodes are carefully chosen, but 
long simulation times are still required. 
Other te hniques are based on assessing the 
probability that a fault on each circuit node 
is stimulated by random pattern inputs and 
subseque tly propagated to an observable 
output point [13].  These techniques are very 
exciting as far as random pattern testability 
of general networks is concerned. For the 
particular case of bit serial signal 
processors however, the property of 
propaga ive randomness exhibited by the 
ope a ors allows us to adapt a semi-
statistical approach in keeping with the 
hierarc 	compiler philosophy. We assert 
that it is only necessary to ensure that 
individual operators are well tested by our 
chosen length of random pattern, and that the 
propagative randomness property is preserved. 
This impacts upon the design of the building 
blocks but does not restrict the system 
designer at all. The most complicated and 
useful operator in the FIRST system is the 
multiplier, and we have used it in our trial 
design. Therefore we have chosen to study 
the multiplier in detail. 	In fact, it is 
clear that the delay operators (shift 
registers) in figure 2 will be exhaustively 
tested. The only other operator used is the 
adder, which is, as we shall see, included as 
a subprimitive operator in the multiplier. 
We can therefore be sure that the fault 
coverage measured for the multiplier 
represents a lower bound on the coverage for 
our 64-point filter. 
We 	have studied a bit - serial 
multiplier, implemented using a modified 
Booth's algorithm to reduce computational 
latency [14,15], as shown schematically in 
figure 5. The design is modular, in that the  
two-bit element shown in figure 5 may be 
cascaded to give 2n-bit coefficient lengths. 
Each two-bit element contains around 200 
transistors, the exact number depending on the 
particular details of the implementation. The 
hierarchical nature of the design approach is 
clear even from this subprimitive circuit 
element, as it can be seen to contain the same 
add/subtract module that would be used to 






PARTIAL PRODUCT I ADD/SUBTRACT I PARTIAL PRODUCT 
SUM (IN) 	
1 	
MODULE 	SUM (OUT) 
MULTIPLICAND 
Figure 5: 	Schematic representation of one 
stage of a bit serial multiplier using a 
modified Booth's algorithm. 
Multiplication proceeds by recoding 3 
contiguous bits of the coefficient word every 
2n cycles, (where 2n is the coefficient 
wordlength). This recoded bit pattern is used 
to control the calculation of partial product 
sums in the adder/subtracter block, which are 
subsequently passed to the next module for 
further summation. The overall computational 
latency (delay between input and output) is 3n 
+ 2 bits. The shift registers are 
exhaustively tested by any nontrivial input 
sequence longer than 3n + 2 bits. The adder 
block has some input pattern dependence, 
however, and the recoder only changes state 
every 2n clock cycles. One might therefore 
feel that the multiplier would be resistant to 
random pattern testing. It is alarming to 
note that 	(say) an eight bit multiplier, 
there are 2 possible combinations (at least) 
of coefficient and multiplicand. We shall 
show that even a small random subset of these 
combinations (around 70 words) detects a very 
high percentage of single stuck faults, and 
thus represents a reliable self-test. 
We have chosen to use 10-bit LFSR's for 
both pseudorandom sequence generation and 
predictable state for the test to begin. We 
shall adopt the approach that test length, 
and therefore PRBS length, will be determined 
from high-level considerations in a manner 
similar to the construction of the data-flow 
diagram and timing of figure 2. The details 
are explained and the approach justified in 
section 3.3 
There are two phases of the test. 
During the first phase, the processor is 
flushed-out' by PRBS stimulation to set all 
the internal registers to known states. On 
completion of this phase, the signature 
analysers are 'switched-on', and the actual 
test oroceeds until fault coverage is 
adequate. 
The test sequence length is determined 
by the computational latency of the processor 
(which impacts upon the 'flushing-cut' time) 
and on the ength of PRBS necessary to 
adequately exercise the circuitry. The 
latency of our chosen test vehicle is 32 x 14 
- bit words, so the flushing-out time must be 
> 448 cycles. 	In fact we will see that a 
ten-bit LFSR for both stimulation and 
signature analysis gives an adequate PRBS 
length (1023 bits for flushing out, and a 
further 1023 for the test proper), and a high 
rate of fault capture at the signature 
analyser (99.9%). It should be emphasised 
that this is merely a convenient length in 
this case, and in no way precludes the use of 
longer (or shorter) LFSRs in a future 
application. For a system clocking at the 
current maximum rate of 8.3 MHz for a FIRST 
chip, this represents a test cycle of 0.25ms 
duration. 
In figure 4 we show in some detail how 
the chosen methodology has been implemented. 
This diagram is essentially an expanded 
version of figure 1. It can be seen that a 
definite sequence of transitions (unknown - 0 
- 5V) should result at the 'Test Out' output. 
This alleviates partially the problems of 
faulty test circuitry causing bad chips to 
appear to pass their self-test, although good 
devices with (say) a faulty flip-flop in the 
test circuitry may be be rejected. As the 
self-test circuitry occupies only 4% of the 
chip area, the probability that a chip is good 
and the self-test circuitry is faulty is 
around % (for 30% overall yield).. It has 
been assumed that the 'Test In' 0-5V edge is 
asynchronous, to preserve generality. 
A failed self test will occur if one or 
more of the signatures at the output nodes 
does not match the expected signature. 
Failure of the test circuitry to produce a 
'Test Complete' pulse will also result in a 
failed self-test. 	Grounding the 'Test In' pin 
puts the chip In 'run' mode. 	It would be 
possible to arrange for the test circuitry to 
be powered down In run mode, although its low 
power consumption scarcely warrants this extra 
complexity. 
3.3 Coverage 
We have shown how self-verification can 
be Incorporated into bit-serial processors 
with minimal extra design effort. The 





Figure 4: Operation of the self-test circuitry 
can be derived from signal processing 
algorithms by the use of a set of synthesis 
rules. These will ensure, amongst other 
things, that all loops are broken by a 
multiplex point (to reduce sequential depth 
to testable proportions). 
The resulting coded file can then be 
compiled for layout and simulation. Floor 
plans give chip sizes to aid partioning and 
the simulator allows verification of the 
system source description. Each of the 
compilers incorporates extensive checking and 
diagnostic warning for illegal constructs, as 
an aid to design. These include, e.g., 
checks on syntax, parameter values, fan in, 
fan out, undriven or unconnected nodes, 
synchronisation etc. 
Efficient automation of mask geometry 
is achieved by the use of constrained 
architectures, which generate the floor plan. 
The actual format for the floor plan is 
determined also by the constraints of the 
technology. An example of a FIRST floor plan 
for a five micron polysilicon gate NMOS 
process is shown in figure 3. 	The main 
features are a single central wiring channel 
off w ich bit serial primitives are placed, 
and into w ich they communicate. The most 
important feature here is the low 
communication area which bit serial systems 
permit and the consequent ease of 
partioning_which this allows. 	- 
IT __ 
unimmuuil  
Figure 3: Floorplan of a simple FIRST chip. 
The layout is constructed from a proven 
set of circuit blocks, which are assembled 
into the primitive operators of FIRST 
according to the parameters given in the high 
level specification. For instance, blocks 
each representing two binary bits of 
multiplication are combined to give 
multiplication of binary words of length 2n 
(n = 2, 3, etc). 	From the viewpoint of 
testability, the principal features of these 
primitive operators are that even the most 
complex exhibits low average gate fan-In, and 
that most can be shown to propagate random 
signals. In other words, random input 
sequences lead to random output sequences. If 
an operator violates this property of 
propagative randomness it can be modified by 
means of a multiplex point to restore 
randomness. As a consequence of these 
features, It is possible to have confidence in 
the reliability of random pattern testing in 
systems of Interconnected primitives without 
fault simulation of the entire system, 
provided the component primitives have been 
shown to be so testable. 
3.2 The Design Methodology 
As discussed in section 1, we have set 
out to define a means of approaching the 
design tasks embodied in the FIRST silicon 
compiler which will result in a verified 
processor where a self-test capability is 
included automatically. The compiler ensures 
correct functionality and layout integrity, 
and the self test methodology must allow 
design to be performed to give any reasonable 
desired degree of confidence in the self-test. 
The hallmark of a successful methodology Is 
that it can be easily understood and applied 
by system designers who are not expert in 
test/self-test methods. We have developed 
such a methodology. 
The primary virtue of our approach is 
that the architecture and functionality of the 
central system are substantially unaltered. 
The only constraints are that looped data 
paths must be openable, and that access can be 
gained to such opened loops for inspection of 
the pseudo-output point generated by the 
break. In this way long time constants 
associated with feedback loops are avoided. 
This increases the system's latency, 
(pipelined delay between input and output) and 
determines its value. Feedback loops can 
often be broken by existing control Inputs, 
and the inclusion of circuitry to break those 
which cannot should not represent a serious 
overhead or added complexity. 
Control signals are either operated in 
the normal mode or used to force open loops as 
described above. Although this results in an 
incomplete test of some of the control 
circuitry, it is necessary to ensure a 
Register Output Data Observer (FR000) to 
distinguish it from a PR8S register. With 
this nomenclature, we can represent the 
proposed self-testing system by figure 1. 
Switching the multiplexer from PAD to FR000 
configures the device in test mode. 
SIGNAL 	PAD 
PRBSJ 	PROCESS0R _F 





Figure 1: Schematic representation of a 
PRBS/signature analysis self-testing system. 
The self-verification capability has 
been implemented at chip level, such that 
individual devices are capable of self-test. 
It will be seen that this enables a 
hierarc Ically structured system self-test 
covering on-chip and Interconnect failures. 
Cost and coverage are areas of concern 
regarding self-test. We will show that the 
cost in our approach is very low. Coverage 
is reduced by two factors. The first, and 
most significant, is the incomplete 
exercising of the circuit's logic, and the 
second is he less than perfect rate of error 
capture by signature analysis [5,8. We will 
not address directly the latter problem, 
except to note that the use of maximum length 
FR000 registers maximises the rate of capture 
of both independent and 'repeated-use' errors 
8. The nontrivial problems of 'burst' and 
other classes of error we leave to the 
experts. We do address In section 3.3 the 
question of fault coverage by random 
patterns, developing a set of design rules 
for ensuring a very high degree of coverage 
of the (as yet irreplaceable) single stuck 
faults. 
3 Self Test for Bit Serial Signal Processors 
3.1 Bit Serial Signal Processing and FIRST 
Custom VLSI offers many advantages for 
the implementation of real-time signal 
processing systems. These advantages include 
small size, low power dissipation, and good 
product security. The principal disadvantages 
have been high costs and long design cycles. 
As VLSI technology has been developed, the 
principal problem has become that of design 
and the management of design complexity. The 
solution of this problem lies in the evolution 
of appropriate CAD tools, which offer a way of 
handling complexity, and reducing design 
cycles and hence cost. 
Two main approaches have developed. One 
is based on graphics design tools backed up by 
post design analysis of mask geometries for 
initial debug, design checking and correction. 
More recently, an alternative approach has 
been developed, namely that of silicon 
compilation, which allows a high level program 
specification to be compiled correctly into 
the appropriate corresponding mask geometries. 
The normal minimum requirements of a silicon 
compiler are that it offers a high level input 
language, a mask geometry compiler and a 
guarantee of correctness. 	Most silicon 
compilers also seek to offer a simulator which 
models the compiled design behaviour. A 
further stringent requirement is that mask 
geometries must use 'silicon real estate' 
reasonably efficiently. A further condition, 
not usually considered but In our view 
essential is that all structures generated 
must be testable and must have an economic 
yield. 
A silicon compiler called FIRST Is In 
development in Edinburgh. First offers - 
A methodology for mapping signal processing 
algorithms into bit serial networks, 
including rules which guarantee ease of 
testability. 
A high level language and compiler for 
specifying hierarchical net lists of 
primitives. 
A mask geometry compiler. 
A behavioural simulator. 
Automated generation of test patterns 
during design. 
Since details of FIRST have been 
reported elsewhere [9, 10. 11, only those 
features relevant to the present paper will be 
summarised. 
The algorithm, or signal processing 
function to be implemented can be represented 
as a signal flow graph, an example of which is 
shown in figure 2. This flow graph can then 
be coded concisely using the FIRST high level 
language. Correctly formulated flow graphs 






Figure 2: Section of a signal flow graph for 








Self - Testing in Bit Serial VLSI Parts: High Coverage at Low Cost. 
Alan F. Murray*,  Peter B. Denyert and David Renshawt. 
* Wolfson Microelectronics Institute t Dept. of Electrical Engineering, 
University of Edinburgh, Mayfield Road, Edinburgh EH9 3JL. 
ABSTRACT 
This paper presents a methodology for 
the inclusion of random pattern/signature 
analysis self-test in bit serial signal 
processing chips, within the unifying 
framework of a silicon compiler. 	Fault 
coverage is very high, and is determined 
without full fault simulation. The cost in 
silicon, power, complexity and design 
difficulty Is extremely low, as Is shown by a 
trial design. A hierarchical system test can 
be performed, leading to the possibility of 
fault tolerance. 
INTRODUCTION 
The final section (4) outlines the 
system implications and conclusions to be 
drawn from this work. 
2. Self-Test, Random Patterns and Signature 
Analysis - 
Self-test represents the ability of a 
system, subsystem or chip to provide a 
indication of fault-free status with power and 
clock signals being the only external stimuli 
required [i]. The advantages of self-test are 
many and well documented [1,2] and we tabulate 
them below merely for the sake of 
cc*npl eteness. 
Every test during a device's lifetime is 
Self-test 	is 	widely 	accepted 	as 	a made easier by self-test. 
valuable tool 	for preventing the unrestrained Expensive automatic test equipment is made 
growth of test time and cost and for enabling less 	essential. 
the design of fault-tolerant systems[1J. Much The addition of other techniques, such as 
work 	has 	concentrated 	on 	generalised LSSD, 	is 	not 	precluded 	by 	self-test 	and 
techniques for the 	inclusion of self-test indeed may be 	necessary under some 
[2,3] and there are successful 	examples of circumstances. 
particular 	self-testing 	systems 	(mostly Test proceeds at the system clock rate. 
microprocessor but more recently, gate arrays 
[e.g. 	4]). 	In 	this 	paper 	we 	describe 	a For 	non-microprocessor 	products, 
rather different approach, in which the pseudorandom pattern stimulation combined with 
adoption of a particular architecture for a signature analysis [5] 	is rapidly becoming the 
broad class of systems (signal 	processors) pre-eminent 	approach to 	self-verification. 
allows the development of a coherent self- Pseudorandom patterns are provided by linear 
test strategy. 	We show that this approach feedback 	shift 	registers 	(LFSRs) 	[6,7] 	and leads 	to 	a design discipline within 	which signature analysis 	uses similar LFSRs to 
fault coverage can be estimated accurately compress the processor's response data to a 
without 	fault 	simulation, 	and 	the 	self-test relatively short signature word or words which 
circuitry placed automatically by a 	high- indicate 	whether 	or 	not 	the 	system 	is 
level 	silicon compiler. 	The self-test method functioning correctly. Pseudorandom binary 
is the well 	known combination of pseudorandom sequence 	(PRBS) 	generators and 	signature 
stimulation and 	signature analysis. analysers are often implemented as distributed 
In section 2 we describe briefly the LFSR's 	using 	existing 	latches 	from within the 
advantages 	of 	self-test 	as 	a route 	to circuit, which are generally 	of 	the 
testability and of signature analysis as a reconfigurable BILBO 	(Built 	In Logic Block 
self-test methodology. 	In section 3 we give Observer [3])  type. This technique normally 
details 	of 	the 	chosen signal 	processor results 	in 	the 	most 	efficient 	implementation 
architecture and an 	associated 	silicon of self-test. For the particular case of bit- 
compiler. 	This is followed by a discussion serial 	processors 	we 	shall 	show 	that 	the 
of the 	inclusion of the 	self-test 	capability versatility of the BILBO approach is neither 
and the fault coverage achieved, along with a necessary nor desirable. 	Accordingly, we have 
trial 	chip implementation to final artwork implemented 	the 	PRBS register 	and 	the 
stage of a cascadable 64-point 	FIR 	filter, signature analyser 	as 	compacted circuit 
leading 	to 	an 	assessment 	of the cost 	in elements seperate from the internal 	latches. 
increased design difficulty, complexity and We 	shall 	refer 	to 	this 	non-distributed 
silicon 	real 	estate, signature analysis register as a Feedback 
The IEEE International Test Conference - Cherry Hill, Philadelphia, Oct. 1983. 
Pr aLo2 
of the rapid availability of physical chip size 
estimates, 	in this case 40% of the system (2 
chips) is used to generate the adaptive step-size 
estimate. 	This might be considered excessive, and 
could initiate a reexamination of the algorithm to 
find a more efficient estimator. 	At least until the 
devices are submitted for maskmaklng and fabrica-
tion, no significant commitment is made to any 
device; the physical implications of new algorithms 
can be explored for little real cost. 
5. Case Study 2: An Adaptive LMS FIR Filter 
Adaptive FIR titters find application as ocho. 
distortion and interference cancellers. 	For this 
study we have attemted a speech-bandwidth 256-
point echo canceller, using a full (unclipped) ver- 
sion of Wldrow's LMS algorithm 151. 	in this case 
the higher computational requirement merits a linear 
array approach, in which the total filter is formed 
from a cascade of shorter sections. 	Accounting 
for the serial computation rate, we find that each 
physical bit-serial LMS processing unit may support 
32 virtual filter points at a sample rate of 12kt-lz. 
The integrated system partitions conveniently 
such that one 32-point LMS section occupies a 
single chip. using 5-micron riMOS technology. 
Thus the completed filter comprises a cascade of 
eight 32-point sections (in 16-pin packages) , plus 
one 24-pin device for output summation, and one 
other 24-pin chip for error computation and system 
control. 	The total transistor count Is approximately 
35.000. 
S. Further Case Studies 
The Integration of several other systems is un- 
der investigation with the Silicon Compiler. 	These 
Include: 
- a pipelined FFT 
- a large time-bandwidth complex matched 
titter system 
- an implementation of Lyon's computa-
tIonal model of the cochiea [6]. 
- bandpass wave-digital filter structures.  
Conclusions 
We have demonstrated the potential of the 
Silicon Compiler as a powerful design tool for cus- 
tom VLSI systems. 	Development times are com- 
parable with any high-level software approach, yet 
efficient custom hardware results. 	The complier 
permits an effective identity between abstact algo-
rithm and physical representation which encourages 
the exploration of new algorithms with no significant 
commitment to any device set except the one finally 
commissioned. 
We argue that this approach is a most ex-
pedient route to the implementation of complex 
real-time VLSI signal processing systems. even for 
low-volume or prototype requirements. 	Through 
this work we hope to encourage system designers 
everywhere to consider silicon as a flexible and at-
tractive Implementation medium. 
References 
P. B. 	Denyer. 	D. Renshaw 	and 
N. Bergmann. 'A Silicon Complier for VLSI 
Signal Processors,' ESSCIAC '82, Virile 
Unlversittet. Brussels. September 1982. pp. 
215-218. 
R. F. Lyon. 'A Bit-Serial VLSI Architectural 
Methodology for Signal Processing.' VLSI 81, 
University of Edinburgh. August 1981. pp. 
131-140. 
E. H. 	Satorius and S. T. 	Alexander. 
'Channel Equallsatlon Using Adaptive Lattice 
Aigortthms.' IEEE Trans. Comm., Vol. 
COM-27. 1979. pp.  899-905. 
M. J. Rutter. P. M. Grant. D. Renshaw 
and P. B. Denyer. 'Design and Reallsatlon 
of Adaptive Lattice Flltors.' To be published 
in these proceedings. ICASSP '83. 
B. Wldrow. in Aspects of Network and Sys-
tent Theory, HOIt. Rinehart and Winston. 
1970, ch. Adaptive Fitters. 
R. F. Lyon. 'A Computational Model of 
Filtering. Detection and Compression In the 
Cochlea,' ICASSP '82, IEEE. 1982. pp. 
1282-1285. 
Synthesis of this system has proceeded exactly 
along the lines indicated above. 	The first step in- 
volves mapping the computational processes as bit- 
serial flow-graphs. 	For example. Figure 4-1 gives 




Figure 4-1: 	Flow-graph of lattice section 
The blocks marked 'del' are Included to equalise 
the natural latency through the serial multipliers. 
The total latency of this stage Is estimated as 22 
bits, and this becomes the system wordlength. al-
though within this range the expected significance 
of the F and B signals occupies only the lower 12 
bits. 	For bit rates of 8MHz. the stage computation 
time is thus 2. 75 microseconds. 	Now we are able 
to multiplex this single, physical processing unit to 
effect 16 virtual lattice stages at an external sample 
rate of 20kHz. 	To implement this filter, the recur- 
sive states identified above are saved and circulated 
through the lattice computation unit via FIFO delay 
primitives. 
The total system flow-graph is completed by 
Implementing each of the required processes in this 
way, adding the appropriate multiplexing and control 
networks. 	The resulting flow-graphs are then en- 
tered to the compiler for simulation and chip 
generation. 	The compiler checks for syntax and 
network integrity before instantiating the finisned 
layouts. 
In this case the system is partitioned around Its 
three main functional groups; one chip implements 
the lattice, one chip Implements the parcor coal-
Ilcient estimator, a chip pair is required to com-
pute the step size estimate, and a final chip con-
tains the central control generator and some 
remaining multiplexing circuitry. 	None of the chips 
exceeds a packaging requirement of 16 pins (this 
is another nice feature of bit-serial architectures). 
and the largest device measures 5. 14. 9 mm in 5 
micron rtMOS technology. 	Figure 4-2 shows floor 
plans generated by FIRST for the entire chip set. 
The completed system. containing around 
16, 000 transistors. was commissioned in just four 
weeks by our 'silicon-naive designer. 	The live 
chip LSI set replaces an original system containing 
160 TTL parts. with a power consumption of 10 
Watts. which took over six man-months to design 
and debug. 	More than any other argument. we 
hold this as a powerful vindication of the Silicon 
Compiler approach. 	Further details of this system 
are given in (41. 




. 	 I1 
Figure 4-2: 	FIRST-generated floor plans for the 
5-chip adaptive lattice filter. 
Flguse 2-1: 	A fully instantiated chip layout 
generated by FIRST. 
As part of the first-time-right philosophy. FIRST 
also supports a system simulator that enables the 
user to verify and debug his system description. 
prior to compilation. 	This simulator runs from the 
same high-level system description file that is used 
by the Complier. 	This completes the design loop 
and ensures an error-tree system Implementation, 
3. Designing Sit-Serial Systems 
Historically, bit-serial architectures have not 
received the same attention as bit-parallel 
schemes. ' However. through the many advantages 
that bit-serial systems appear to otter to VLSI 
Implementation. we anticipate that this situation will 
be rectified. 	In the meantime, we have found it 
necessary to develop generalised design methods 
for real-time bit-serial systems. 	These involve not 
only the synthesis of bit-serial computation units. 
but also higher issues concerning real-time concur-
rent processing arrays, and at lower levels the ef-
fects and containment of word growth in fixed-point 
systems. 	To cope with these issues we have 
evolved an overall methodology whose explanation 
does not lie within the scope of- this paper; 
nevertheless a summary is appropriate. 	We identify 
five steps to system synthesis: 
Formulation of the signal processing ai-
gorfthm as a mathematical recursion. 
Derivation of a computational flow-graph 
to implement each statement of the 
recursion. 
Bit-serial Interpretation of the flow-
graph. Including fixed-point arithmetic 
effects. Calculation of computational 
latency. 
Determination of concurrency required to 
match 	computational 	and 	signal 
bandwidths. Implementation of multiplex 
network and instantiation of physical 
processing array. 
Addition of intiaiisatlon and test 
capabilities. 
. System partitioning for optimum chip 
count and size. 
The resulting system 110w-graph is then ready for 
direct implementation (or simulation) through the 
Silicon Complier. 
We highlight two case studies out of several 
currently in progress. 	By coincidence, these are 
both adaptive systems, although they are selected 
for no other reason than that they represent more 
complex system forms than their non-adaptive coun- 
terparts. 	As indicated later. the capabilities of the 
complier are extremely general, and are by no 
means restricted to these systems. 
4. Case Study 1: An Adaptive Lattice Filter 
Our first case Study is also the most definitive. 
since It was attempted by a system designer with 
no previous integrated design experience. 	The 
adaptive lattice filter which is the subject of this 
Investigation is primarily intended for application as 
a fast equalizer for modem polling systems. 
however it may also be used for linear predictive 
coding (of speech). or as part of a spectral 
analysis system. 	The filter is based on a gradient 
algorithm presented by Satorius (3). with the excep-
tion that a division in the power estimator is ap-
proximated to first order by a Taylor series 
expansion. This trick enables a potentially expensive 
division to be replaced by a multiply and add. 
The resulting recursive definition of the filter. In 
terms of stage (N) and time (T) . fails Into three 
parts: 
Lattice: 
Firt(N,T) = 	Fout(N-t,T) 
Bin(N,T) = Bouc(N-1,T-L) 
[Fin(I,T) r 	Bin(1,T) 	= Sigln(T)J 
where for all N,T: 
Fout 	= Fin - KBin 
Bout = Bin - KFin 
Parcor coefficient: 
K(N,T.l) = K(N,T) - 1u(N,T)CORR(r'4,T) 
wnere for all N,T: 
CORR 	FoutB in, Fin *Bout 
Step size: 
Mu(t'J,T- I) = Mu(N,T)(j A-POWER(N,T)) 
where for all N,T: 
POWER = Bin'-Fin' 
and A is a small .-ve constant. 
CASE STUDIES IN VLSI SIGNAL PROCESSING 
USING A SILICON COMPILER 
Peter B. Denyor 
David Renshaw 
University of Edinburgh 
Abstract 
We advocate a custom approach to VLSI signal 
processing using a Silicon Complier. 	Through the 
Compiler, system designers with no previous VLSI 
experience may enjoy all the advantages of an in- 
tegrated system implementation. 	At the same time. 
COSt and timescales are dramatically lower than 
those conventionally associated with custom design. 
To illustrate these advantages we present case 
studies of some VLSI systems currently in develop-
ment at Edinburgh. 
1. Introduction 
The complexity of many Interesting signal 
processing algorithms prohibits their realisation in 
real time. 	Often the required process exceeds the 
capabilities of oft-the-shell programmable parts.whiie 
assemblies of standard MSI/LSI parts can be both 
physically and economically unattractive. 	Under 
these circumstances, we believe that the most ap-
propriate medium for development is custom- 
designed silicon. 	Traditionally however this route is 
characterised by excessively long (and often 
repealed) design cycles. 	The enormous commit- 
ment that this entails has led to an impression that 
silicon is an inflexible medium, to be approached 
only after careful prototyping. and with a large 
market In view. 	This impression is contradicted 
with the advent of the Silicon Compiler, a tool 
which may automatically, and virtually instan-
taneously. generate an entire chip design from a 
concise high-level system description. 	Such a 
facility changes our perception of silicon as a 
development medium. 	The designer is freed to 
work at the architectural level, where his creative 
talents are arguably most effective. 	The Compiler 
is then accessed frequently for rapid feedback on 
physical Chip sizes and power Consumption. 
With no prior knowledge of integrated circuit 
design. VLSI design cycles around four man-weeks 
are typical. 	By virtue of their automatic construc- 
tion, the systems so produced are error-free, 
guaranteeing a first-time-right implementation. 
In this paper we review the development of VLSI 
systems using the Silicon Compiler FIRST (1). 	We 
hope to show that custom VLSI systems enjoy not 
only considerable size and performance advantages 
over alternative forms of Impiemontation, but also 
that VLSI design with a Silicon Compiler can be 
economically attractive, even for prototype systems. 
2. The Silicon Complier 
We have developed a Silicon Compiler to sup-
port dedicated (as opposed to programmable). bit-
serial architectures, following the work of Lyon (2). 
and others. 	This restricted architectural choice 
enables the Compiler to work efficiently, and at the 
same time encourages the development of a power- 
ful. unified synthesis technique. 	This is reflected 
in an effective, structured design style which con-
tributes greatly to a speedy design cycle. 
FIRST (for 'Fast implementation of Real-time 
Signal Transforms') supports a library of predesig-
ned primitive operators (multiply. add. delay. 
etc. ). which are called by the user as a network. 
or flow-graph, of the system to be implemented. 
For example. the user may call and connect tour 
multipliers, an adder and a subtracter to implement 
a complex multiplier. 	Such a routine may become 
a new system operator, to be called in future as 







Throughout this hierarchy. modules obey rigid com-
munication conventions which make their intercon-
nection both straightforward and failsafe. 
FIRST generates chips by instantiating all of the 
requested primitives and placing them within a tar- 
get tioorplan. 	The flow-graph network Is then Im- 
plemented by connecting the primitives along a 
central communication core. 	if Is an advantage of 
bit-serial systems that, because signals occupy 
single wires, this core does not dominate the chip 
area as might be the case with a bit-parallel ar- 
chitecture. 	After adding pads. clocks, and power 
supplies the chip is complete. 	An example of a 
compiled chip from one of the following case 
studies (the LMS fitter) is given in Figure 2-1. 
The initial primitive library is designed for 5 micron 
NMOS processes running at bit-rates of 8MHZ. 	At 
this technology level each chip typically contains 
2.000 to 5.000 transistors and is specified by 
some 20 to 50 lines of system description. 	In all 
cases presented here the upper limit on chip com-
plexity is purely technology-lImited. 
Proc. IEEE ICASSP'83, pp.939-942, Boston, April 1983. 
Figure 1 (b) Compiled chip layout (with Multiplier detail). 
simulation and design iteration at the system level - but rather as an indication of a virtually instant identity 
between an abstract system description and its ultimate physical representation. 
Conclusion 
Silicon compilation offers massive cost and time reductions over traditional LSI design techniques. It 
also offers the system designer, who may have no LSI design experience, a capability for implementing 
complex integrated systems with every expectation of first-time success. 
We have implemented such a compiler for signal processors using bit-serial architectures. A range of 
initial examples covering speech. sonar and telecommunication applications have demonstrated the 
feasibility of this technique for producing efficient chip layouts within development timescales of a few 
ma ndavs. 
Acknowledgements 
The authors would like to acknowledge helpful contributions by J. Gray. I. Buchanan and D. Myers. 
References 
Lyon R.F.. A Bit-Serial VLSI Architectural Methodology for Signal Processing", in VLSI 81, ed. 
Gray. Academic Press. 1981 
2. 	Mead C.. and Conway L.. Introduction to VLSI Systems, Addison-Wesley, 1980. 











t.r rat  1$9 o ult  
CO$27L17 uuodl ,liuit.1 ..ellult.0 
C011T11OI. ZI?57 eO,c1.lOit 
0081110), lIPS? 0070,0071 .00?2.0073.00?0,00?5,0076.C1?0 
ZIPS? Iln.?111111.VK331o,WLIII$ 
OUTPUT lout. IMSIOut, 713$out,Yi.SIout,OYTI. 







Mu100pZ.x(13 (altO) .3,110 
Hp1ez1) (III?) .l7,VNSIln -) •7 
Hu.Zt(.plex)13 (III?) .1I,VI.3I1.0 -) .8 
Mu001pl.0(1) (.002) $6.15 - .6 
Multipl.x(1] (.006) 7113)o.t,2X3).ut -) 711310.0 
Mu1t191y(ruuo6..er61..t8.2) (.000) .2.71111. .) IS,D$ 
Nulttply[rouod,v.rdl.ngtl-2] (.OtS) 122,117 .> TL31o0,TI3Io9t 




iP.1t(vord1.nst0) (aOt*) •16,al5 ..> •1l.117,OTPL 
8aPdd.1Ip(id1.S.1.wOd1.fl(t0-2)  
Wardd.L.5t1d1....t.ge.-l.wer61..gtb.2) (.001) iT > •19 
Wo.dd.1.140S$..t&IsS-1,vordlS.;tfl) (.001) .8 -> 210 
Constant 0112 • .tsl../2. 
(.011.. - 1) - 0.12 
$epddll$y(bu1t.u.,41.11tO-2) (.000) lout -> $0 
WOd.1.y(..t,..rd1$uit02) (*000) 00 > $2 
	
3000.1.p(uo,41..gtb) 	> It9 
11104.1.y)vo,dl.nstb) $19 -> $20 
Nita. X.y)u.d1.ugtb) .20 •> $21 
lltd.1.y[verdl.ugtO-t) $2 .) $3 
Bltdulaytuo,d1.ngtO-0) $9 ) all 
!100.1.1(ue,'dl.nit0-0( $10-) $12 
flltd*1.T(wu,d1e.gtO/2.1) $21 -> 122 
I Ll.LOCI?7 71.1.005 10 PARAMETERS 
TOO) 10 • 10, $?) 
!IOOFP00011II 
Figure 1. An example of silicon compilation using FIRST. Figure shows one part of a 3-chip LSI set for 
adaptive speech echo cancellation. 
(a) 	FIRST system description file. 
to compose more powerful operators (FF1'. Biquad, etc.) by procedural definitions of primitives. Once 
composed. these can be made available to other users. rather like a library of useful subroutines. 
Table I lists the primitive operators currently supported by FIRST; it is desirable that this list be as short 
as possible. Table II gives a library of the procedural definitions that have been generated to date. 





















The primitive operators are parameterised to include functional options - for example in the Multiply 
primitive these specify the coefficient word length and the output format (rounded or truncated). These 
parameters determine the composition procedure that will be used to construct the operator during 
assembly. In general. parameters may be specified as explicit numbers, variables, or arithmetic expresions. 
Thus it is also possible to parameterise the procedural definitions. For example. the Biquad procedure 
will compose a set of M filters, each containing N second-order sections. with data words of d bits and 
coefficient words of c bits. 
Internally, FIRST supports two physical descriptions of each primitive operator: geometric and 
behavioural. The geometric description contains CIF code for a collection of hand-packed cell layouts that 
are assembled by the compiler to form the operators. The behavioural description takes the form of bit-
level procedures for perfect functional simulation. (The clock period is a satisfactory minimum time metric. 
and the boolean logic levels are a suitable amplitude metric, once the primitives have been functionally 
proven at the desired maximum bit rate.) 
There is scope within the FIRST system to accommodate automatic test pattern generation. although 
we have not yet implemented this feature. 
FIRST time right 
The simulator and compiler work from the same system definition input file. Under these conditions 
it is possible to guarantee that the physical implementation will match the input system description, and 
also that the simulation will match the physical implementation. Together these features offer a level of 
confidence that existing design techniques have never permitted. 
FIRST Implementation 
We have implemented, and are verifying, a library of primitives (Table I) for 5 p.m nMOS processes. 
operating to a fixed-point 2's complement format with a maximum bit-rate of 8 MHz. 
An Example 
Figure 1 shows one device from a three-part LSI chip set that implements an adaptive FIR filter for 
automatic echo cancellation in real-time speech systems. The merit of the concise system description 
language is evident in the short FIRST input file shown in the upper half of the figure. This is typical for 
an LSI part. and clearly indicates the potential of this approach for handling VLSI systems of much greater 
compelxity. 
A plot of the resulting chip. generated by the FIRST compiler from the system description file, is shown 
in the lower half of the figure. The operators are indicated on this plot by bounding boxes, but to give 
some idea of the efficiency of the layout one of the operators (a Multiplier) has been plotted in full. It 
is evident from the plot that very little silicon area is wasted (we estimate 20-25% typically) - an interesting 
result for critics of automatic layout who point to irresponsible use of silicon area. 
Experience to date in the design of four systems, totalling 10 LSI parts, suggests a timescale of the order 
2 mandays per part for this type of system development, given a base level understanding of signal 
processing with fixed-piont arithmetic. No knowledge of MOS circuit design is required. 
We do not suggest this tirnescale as a future guideline - a much greater allowance must be made for 
A SILICON COMPILER FOR VLSI SIGNAL PROCESSORS 
P.B. Denver. D. Renshaw. N. Bergmann 
Departments of Electrical Engineering and Computer Science. University of Edinburgh. UK. 
Abstract 
This paper reports a silicon compiler that allows the rapid implementation of LSI/VLSI signal processors 
from a high level system description language. 
Introduction 
The advantages of custom LSI design have, in the past. been offset by the high costs associated with 
long development timescales and multiple design iterations. These overheads make custom design 
unattractive for medium and low volume markets, and for products requiring rapid development. 
Moreover, the ad hoc design techniques in use today cannot be extended to cope with VLSI complexities 
without greatly exaggerating these disadvantages. 
Past attempts at overcoming these problems have lead to fragmented solutions which only in small 
measure diminish the overheads, and often introduce further problems of their own. A major disadvantage 
usually lies in the requirement for a high degree of MOS circuit design and layout skill, making it difficult 
or impossible for the systems designer to implement new structures in silicon. Typically, he may build a 
breadboard of SST. MSI and LSI parts to prove the system before investigating its integration. This is a 
restrictive practice that prohibits the proper exploitation of silicon as a development medium for a host 
of new VLSI systems and architectures. 
An alternative approach is to automate the entire implementation procedure. and offer a single system-
level interface to the user. Such an automatic design system. coupled with rapid, low-cost maskmaking 
and fabrication, should encourage the development of prototype systems in integrated form, dramatically 
cutting the total development cost and timescale. In addition to these economies of cost and time, It may 
also be possible to guarantee a correct implementation by virtue of automatic construction. 
This paper reports on such a 'silicon compiler' for LSI and VLSI signal processors. It is called FIRST 
—Fast Implementation of Real-time Signal Transforms. 
A Framework 
To obtain results in the form of efficient chip layouts, the compiler must be tailored to a complete design 
style. incorporating specific architectural, topological, timing, circuit and layout conventions. The FIRST 
compiler is based on such a methodology for bit-serial architectures, reported recently by Lyon [1]. 
Bit-serial processors generally offer optimal area-time tradeoff and are compact enough.to  allow 
concurrent implementations of complex systems within reasonable chip areas. The overhead costs of 
communication and control are minimal, allowing the majority of the chip area to be devoted to the 
computational processes. System partitioning into multiple chip sets is eased by the single-wire serial 
communication format. In addition, a variety of multiplexing levels may be used to properly match signal 
and processor bandwidths for real-time applications. 
FIRST uses a fixed floor-plan format based on a central communication channel that supports ranks 
of bit-serial processors to either side. Each of these processors obeys strict geometric and electrical 
conventions at the wiring channel interface, but is otherwie unrestricted in form. This well-defined 
interface has enabled us to delegate the implementation of individual processors to a range of designers. 
It also allows new processors to be added easily at any time. 
The processors operate synchronously from a global 2-phase non-overlapping clock that is distributed 
from the edges of the communication core. 
As in [1] the nMOS circuit, layout and timing conventions suggested by Mead and Conway [2] have been 
adopted for the lower levels of the design hierarchy, though it is possible to replace these to suit other 
technology variants. 
System Design and the Silicon Compiler 
FIRST supports a high-level system description language as the single user input medium. This takes 
the form of a net list of bit-serial 'operators' that represent a flow graph of the system to be implemented. 
The task of the compiler is to compose a complete chip layout by assembling and placing these operators, 
and then routing the network. 
FIRST maintains a restricted library of primitive operators (multiply, add, delay. etc.) that includes all 
of the basic processes needed to build a wide range of signal processing systems. It also allows the user 
ESSCJRC '82 Digest of Technical Papers 
Brussels, September / 982. pp. 215-218 
APPENDIX IV 
LIST OF PUBLICATIONS 
Ii' 
ItntIIIII.IILI '1111111 __ 
bic 
. 	
.  . 
. 
. 	 'M 'No IF! i~ il 
'F'T.1~' M5 ~1' L' 
' - A A T-1 
I -M, te A!, m~ i~~: M`Ii~1111N! iiz ii 
	I t i 
uJfrhI 
1 —f -1 AJ _4 
1. 0. 
-¼¼b- frfrth  rJ-7 
.yj..L 	 tJ. 1 
L4 Xr 
dt 
U 	 t 1b*•- 	 4c 	 I 
yr 	 I 
r'  
- 111.21 - 
LMS Adaptive Echo Canceller - Wiring Schematic 
- 111.20 - 
yl 	 y3 	 y14 	y15 	y16 
c2f cli updcte 
Figure 16: SUBSYSTEM ZFIR 
Figure 17: SYSTEM SECHOCANC 
xouk 
Figure 15: CERR floor plan 
- 111.18 - 
Figure 13: CLMS floor plan 
Figure 14: CTREE floor plan 
— 111.17 — 
pre3ek 
cresetl 	 I 
cia 	 c2c 
-o o 
	
COOL 	 1ij 
pci -- sib 	
GENERATOR 
2 	 12 
7 	 47 
sic I
747] pc2c 
cic 	 c2c ----- 
I] 
c2d I 
14 13 	 "" 6 
I c2e 
OUTCEN 	ERRG I 	din 
I ptmuin 
Figure 12:. CHIP CERR 
CHIP CERR2 (preset —, pcib. pcic. pc2b, pan) pain, pint, pin2, pins. 
pin4. pdin. ptmuin —, peout, paoUt, pupout. pyout 
SISNAL a, din, tmuin. upout. .out, ml, in2, in3, in4. gout 
CONTROL cia.cib, cic, cid, ci., c2a, c2b, c2c, c2d, c2e, 
.3, c2bi, civset, c2:t 
CONSTANT awl 	IE 
CONSTANT C 
PAOIN (preset —) cre.et) pdin, ptmuin. pain —) din, tmujn, a 
PADOIJT (cib, 'dc, c2b, e3 —> pclb, pClC, 
pc2b, pe3) eout, upout. a —) peout, pupout. plout 
PADORDER VDD, preset, pcib. pdic, pc2b, p.3, pint, pin2, pmn3, pin4. 
OND, pain, pdin, ptmuin, CLQC'., peout, pupout, psaut. pijout 
PADIN pint, pin2, ptn3. pin4 —) ml, inc, tn3, in4 
PADOUT gout —> pout 
OUTCEN Ceul) (c(d. c2 —, dl) ml, jn2, in3, jn4 —) gout 
ERRCEN CswI,c) (do, c2e —)) jout, din. tmujn —> *cut, VpOut 
COITOELAY CI) (cIa —> cib) 
CDITDELAY Cfl (cib —) dc) 
COITOELAY C] (dc —> cid) 
COITOELAY Ci) (c2a —) c2b) 
COCTOELAY 1243 (c2b —) c2bl) 
COtTDELY 1231 (c2bl —> c2c) 
Ct'JORDDELAY 112.03 (cia. c2c -> c2ci) 
C).CFDDELAY 112.03 (cIa. c2ct —, c2d) 
C1'0RDDELAY ClI.) (cia. c2d -> c2e) 




ENDCON ThOLQENER ATOP 
END 
end ofprogram 
— 111.16 - 
pa1pb1j 	II 	I I 	I I 	Ii 	IpC8IIpb8 








 + 	 + 
Figure 11: CHIP CTREE 
CHIP CTREE1 (pcii —, pcla) pal. pa2, paD, pa4, p.5, p.6, 
p.7, paG. pbl. pb2, plO. p14. pbS. p16, p17. p18 —> psum 
SIGNAL all. .12. .13, a14, alS. alà. .17, a18. 
bit. 112, biD. b14. 115, biô, b17. b18. 
.21.a22. .23. .24. b21. 122, b23. b24. 
.21. .32. 131. 132. .41, 141, sum 
CONTROL all, c12, c13. c14 
PADIN (pcii —, cii) pal, pa2, pa3, pa4, pa5, p.6, pa7, pa8, pbl, pb2, 
p13, p14. pbS, pbà, p17. p18 —> all. .12. .13. .14. .15. .16. .17. 
.18. 111. 112, 112, b14, biS. 116. 117. biB 
PADOUT (c14 —) Palo) sum —> psum 
PADORDER VOD, pal, p11. p.2. 912, paD. plO, pa4, 
p14. pa5. pbS. paó. OND, p16. pal. 
CLOCK, p17, pa8. pb3, psum, Pali, pcto 
ADD 11.0.0.01 (all) all. Ill. GND —> a21. NC 
400 11.0.0.02 (all) .13. blO. OND —> .22. NC 
ADD 11,0.0.03 (cii) .15. 115. OND —> .22. NC 
ADO 11,0,0.01 (cli) .17.117,  ONO —, .24. NC 
ADD 11.0.0.01 (cll.) .12. b12. GND —> b21. NC 
ADD 11.0.0.02 (cli) a14. 114. OND —> b22. NC 
ADD 11.0.0.03 (cli) a18. 116. OND —, 123. NC 
ADD 11.0.0.0] (11) .19. biB. OND —> b24. NC 
ADD 11.0.0.03 (c12) .21. b21. OND —) .31. NC 
ADD 11.0,0.01 (c12) .23. 123. CND —> .32. NC 
A2D 11,0.0.03 (c12( .22. 122. OND —, 131. NC 
ADD 11.0.0.0 (c12) .24. 124, GND —> b32. NC 
ADD E1,0,0,03 (ciD) .31. 131. CHO —> a41. NC 
ADD 11.0.0.01 (c13) .32. 132, OND —> 141, NC 
ADD 11.0. 0. 0  (c14) .41, 141, OND —> sum. NC 
C2ITDELAY 113 (cii —) clZ) 
C2ITDELAY 11] (c12 —) c13) 
C3ITDLAY Ci] (dO —) c14) 
END 





O 	 pe3 ND  








Px 	xnew L_r_J X 
0 
Pc2ims 	- - - - 
C12 
JTLJ 
Figure 10: CHIP CLMS 
CHIP CL.NS (pcl..pc2.p.3) Ps, pup —> pro. py 
SIGNAL 	xnu, 	xi. 	x. 	xi 	z, xoid, 	w, wL. 	'uoid, 	up, 	update, 	j, 
xxi, xx2, xx3, xx4, xx, rx, x17, %x8, xx9 xqi, uli 
CONTROL cli, 	c12, 	c2, 	e31, e3.. 	e3 
CONSTANT 	wiiB 
PADOROER VOD, 	px. 	pro, 	py. QND, 	pup. CLOCK, 	p•i 
pci. pc2 
PADIN 	(pci.pc2.pe3 —> c1l.c2.e31) Pr, pup 	—) 	mew, 	up 
PADOUT xi, 	j -) 	pro. 	pu 
CDITDELAY C13 	(cii —> cia) 
CEITDELAY C203 31 —> e32) 
CDITDELAY C20 	(e3 	—> e33) 
SITOELAY Csw1 	x —> 	xxi 
DITOELAY 	CruiJ xxi —> xx2 
SITDELAY Cswl3 	xx2 —) xx3 
DITOELAY 	Cswl3 xx3 —) xx4 
SITOELAY Csu)13 	xx4 —, 	xx 
3ITDELAY CswiJ rx3 —> xx 
ITOELAY 	CwiJ 	xxó —) xx7 
DITDELAY [sw13 xx7 — 	xx8 
DITDELAY 	Cswl) 	xxS —> xx9 
DITOELAY [71 zrP —, 	xi 
L4CRDDELAY 	129. 	wl, j (ell) xi 	—> 	x'.i 
ITELAY C71 Zyl 	—) 	x2 
WORDDELAY 	C37,swl,13 (c11) wi 	—> ULi 
BITDELAY 121 wil 	—> u,old 
BITDELAY 	Cs,,l-11 x2 —) 	x3 
3ITDELAY 111 	up 	- update 
MULTIPLEX 	11.0.01 	(e3) 	w. OND 	-) xii 
MULTIPLEX 11.0.01 (c2) xi mew —> 
LMSPART1 CswlI (c12) x, update, x2, woid —> y. xi 
END 
- 111.14 - 
yotJt 1 0 
0— - 	
0 
- 	 '1 
Idh 







c12 .., tmu 	 0 
2+c-------------A 2+c - tmuin 
lw+21+c 
prod [ recirt I 
IMUX 	jlw-1 
c2erl 
upout I lw+22+c 
Figure 9: OPERATOR EBBGEN 
OPERATOR ERROEN Cswl,c3 (cter. c2r, ->) Vin, din, 
trnujfl -> eout, upout 
SIGNAL djf, prod , tmu, recirc 
CONTROL cU, c2 
SUBTRACT 	,O,O.ci (cl,r) din, gin. 	NO -> dif, NC 
FLIMIT Cw1, 1. O )cll) dif -> ecut 
MULTIPLY Cl. 12' 0,03 	-> NC) cout. tmu -) prod. NC 
MULTIPLEX Cl,O.O (c2er) prod. recirc •-> upout 
OITOELAY Cswl-13 upout -> recirc 
OCTOELAY [Z~c3 trnuin -, tmu 
COITLAY Cl (cli -, c12) 
COITOELAY Ci+cJ (cier -> cii) 
ND 
- 111.13 - 
mu I 11n2 1n31 I in4 
all - - - - 
sum 	sum2 
+ 









Figure 8: OPERATOR OTJTGEN 
OPERATOR GUTOEN Cswl] c1i, c2i. -, do) in1 	jr2 jri3, jri4 -> yout 
SIGNAL sumj, sum 	sum, ;umi sum, ad, acm
CCNTRCL c2, c3. c14. cZZ, :21 
ADD 11,3.0.03 (cli) ml, in2. OND -> auml., NC 
ADD 11,0.0.03 (Cli)nD, in4. GND -> um2, 	C 
ADD 11,0,0,03 (ciZ) sum L sum2. OND -> sum, NC 
SIDNEXTEND (c1) sum -> sumi. summ 
DPACCUNUL,ATE (swl3 (c14. c211 -) ) sumi, sumr -) ad. acm 
FFCRMAT2'rQl Iswl,0,0,0,O3 (do) ad, acm -> YOut 
CDITDELAY Cl) (cli -> c12) 
'COITOELAY Cl) c13 -> c14> 
CDITDELAY Cl] (c12 -> ct) 
C3ITDELAY Cl] (c14 -) do) 
CDITOELAY CD) (ci •-> c211) 
ND 
— 111.12 — 
c2tpi 
cltp 
Figure 7: OPERATOR DPACCTJMULATE 
OPERATOR OP.CCUMULATE Cswl3 (cidp. c2dpi —> ) sumi. 
sunm —> ad, acm 
SIGNAL al. am int' fl. m 
CONTROL c22 
MULTIPLEX C1,0,01 (c2pi) FL 	ND —) al 
MULTIPLEX 11,0,01 (c22) Fm, GND —> am 
ADO C1,0,0, 03 (clap) al. sumi,ND —> ad, mt 
ADD 11,0,0, 01 (cldp) am, summ. III  —> acm. NC 
BITOELAY Eswl-23 acl 	FL 
SrTDEL.Av Csul-23 acm —> m 
COITOELAY CswLl (c2dpi —> cZZ) 
CND 
— 111.11 - 
Figure 6: OPERATOR LMSPART 
OPERATOR LNSPART1 Csw1 (cub) x updateS xa1d wod 	y, w 
SIGNALu, u2 wo1d1, ill
CONTROL cli, c12, c13 
COITDELAY Cl] (cli —> c1) 
CDITOELAY CL] Cc12 —> clD) 
DTTDELAY C193 z —) ill 
DITOELP1 C201 ill —> xi 
SITDELAV C11 wold —) woldi 
MULTIPLY Cl 12.0,0] (cli —) N(:) w. xl —> y. NC 
FLIMIT Cswl,i3O] (c12) u2 —) w 
ADD 11,0,0,03 (cii) ul, wcldi. GND —> u2 N 
MULTIPLY C1,12,0,01 (clsb —> cii) update xoid —> ul, NC 
END 















3(t-1) <0,-1 i> 
<<2.-15>> 
<0,—i 5> 
















Figure 5: error generator 
d(t) 
Figure 1: Widrow LMS adaptive filter 
Id 







output signal 1 
	
& 
error signal error  
I 	 generator 
Figure 3: initial system plan 
References 
B.Widrow et. al., "Adaptive noise cancelling: principles and appli-
cations," Proc. IEEE, Vol. 63, pp.  1692-1716 (December 1975). 
D.R.Morgan, "Adaptive multipath cancellation for digital data com- 
munications," IEEE Trans. Comma., Vol. 	M-26, (9) pp.  1380-1390 
(September 1978). 
P.3.Denyer, D.Renshaw, and N.Bergmann, A silicon compiler for VLSI 
signal processors," Proc ESSCIRC 	82, 	pp. 213-218 (September 
1982). 
3. 	ERR - a block in which OUTGEM sums and formats the outputs of 4 
TREE operators (i.e. 2048 filter points), accumulates in double- 
precision, and re-formats to single precision. 	The output is 
passed to ERRGEN, which performs subtraction to produce the error 
sample, limits, and multiplies by 2i. 
The system is now partitioned into feasible chips. The three chip 
designs (Figs. 10 - 12), corresponding to the three operators listed 
above, take the names CLMS, CTREE and CERR. The compensating delay c, 
and the control timing are adjusted to accommodate the bit-delays intro-
duced across chip boundaries. c is calculated to be 15 bits. The con-
trol generator is placed on CERR. 
The system flow diagram, up to chip level, is now complete, and may 
be entered to the silicon compiler for chip composition. Dimensions of 
the completed chips are as follows: 
chip width (\) height (\) 
CLMS 2531 1807 
CTREE 1565 1050 
CERR 2412 1582 
Chip floor plans are given in Figures 13 - 15. Unfortunately it is 
the largest chip which must be repeated many times within the system, 
although we feel that even this device is not excessively large. 
Extending the hierarchy to the remaining levels (subsystems and 
systems,) the outstanding block diagrams were produced. Fig.16 shows 
the subsystem ZFIR, containing 16 cascaded CLMS chips, and Fig-17 the 
entire system, SEcHOCANC. However FIRST language files have not at this 
stage been produced for these levels of the design. 
The system of Fig.17 represents a 512-point filter. 	Should the 
full length of 2048 be required, 4 ZFIRs, each of which feeds a CTREE, 
may be cascaded. Operator OUTCEN gives CERR the facility to accept 
inputs from up to 4 CTREE chips. 
7. System simulation 
At the time of preparation of this report, the system had not yet 
been simulated. 
8.. Conclusions 
An echo canceller has been described which is cascadable, in blocks 
of 512, to length 2048 filter points. A set of three LSI chips has been 
specified, designed and laid Out in approximately three man-weeks, by a 
first-time user of the FIRST silicon compiler. It is anticipated that 
another three man-weeks should see the project through to completion. 
critical path length form this figure. The critical path length is the 
length of the algorithm recursion loop, i.e. the number of word delays 
between the appearance of the final multiplexed signal sample on the LMS 
block and the appearance of the error-update sample. This critical path 
length must be accommodated at the end of each filter output cycle, 
since the next cycle may not commence until the error has been computed 
and returned to the LMS blocks. As each primitive, operator etc. in the 
path has an intrinsic delay, this figure may be easily estimated. 
The critical path length was estimated at around 8 words, giving a 
multiplexing potential of 47. However, in order to reduce the multi-
plexing memory requirement (and hence the chip size), and also relax the 
8 MHz clock rate (which is near the process limit), the multiplexing 
level was chosen to be 32. This leads to a revised system clock rate of 
(32+ 8) x S kHz x 18, 	5.76 MHz. 
As the critical path length is not an integer number of system 
wordlengths, a block of compensating delay must be introduced into the 
critical path, to give it an integer length. Whilst this may be accom- 
plished by simply inserting a bit-delay block (the delay being 
parameterised for the present), a more elegant solution is to use the 
compilers facility for entering arbitrary delay after such primitives 
as ADD and SUBTRACT, thereby saving on chip floorspace. The subtractor 
which produces e(t) from d(t) and y(t) is ideally suited for introduc-
tion of this compensating delay. 
Having completed the system description, and estimated the parti- 
tioning into chips, the next step was to create source files for the 
FIRST silicon compiler. 
6. System implementation using FIRST 
Having designed the system from the top down, it could now be 
implemented from the bottom up. The first step was to create the arith-
metic operators. These were identified as follows: 
LMSPART - an LMS processor block 
DPACCUMULATE - a double-precision accumulator 
OUTGEN - a block which produces and formats y(t) by accumulating 
the outputs of all LMSPART blocks, and 
ERRGEN - a block which produces 2tie  using y(t), d(t) and 2. 
Figs. 6 - 9 depict these operators as computation engines con-
structed from FIRST primitives. The FIRST source file is displayed in 
each case. As the primitive FFORMAT2TOI had not at this stage been 
entered to the compiler, it was substituted by the operator SIGNEXTIEND, 
which consisted of a bit delay and a multiplexer. 
The next step was to construct the following three operators: 
LMS - an LMSPART operator multiplexed to produce 32 filter points, 
with associated multiplexing memory 
TREE - an adder tree capable of summing the outputs of 16 LMS 
operators, i.e. 512 filter points, and 
be multiplexed over as many time divisions as system constraints would 
allow. It was assumed that a single LMS processor plus the multiplexing 
memory would fit on to a single chip. The level of multiplexing attain-
able was then a function of tolerable chip dimensions, processor 
latency, system clock rate and system bandwidth. The error generator in 
turn fed the LMS blocks, thus completing the system loop. Fig.3 is the 
initial chip plan of the system. 
5. System description 
Starting from the algorithmic equations (eqns. 1. - 3), block 
diagrams were produced, detailing the construction of a single LMS pro-
cessor block, and the error-update block. Figs. 4 and 5 show these 
blocks. Initially it was thought that the multiplier used in the filter 
output convolution should produce a double-precision output, which 
should be accumulated in triple-precisin to allow for word growth. 
However, the nature of echo-cancellation is such that the filter pro-
duces a noisy estimate of the echo path, and it was decided that a 
single-precision rounded-output multiplier, accumulated to double- 
precision, would suffice. 
The next step was to determine the significance and format of words 
at each point in the system. The longest desired significance in the 
system is the 16-bit tap weight. Assuming no further restrictions on 
wordlength occur (which turns Out to be true) then the system wordlength 
may be set at 2 guard bits above 16, i.e. swl 	18. Thus the word rate 
for the echo canceller engines is 8 MHz divided by 18 bits, i.e. 444.4 
kHz. 
Figs.4 and 5 display the significance (single arrows) and format 
(double arrows) of words at important points of the system. Format and 
significance in the adder tree is not treated, as the double precision 
nature of this structure precludes overflow and other undesirable 
phenomena. 
The system can be said to work in three distinct, cycling nodes: 
Operation, i.e. performing convolution to produce an output sample 
(cf. eqn. 1) 
Error-update generation, i.e. subtraction of the formatted output 
sample from the training signal (cf. eqn. 2) then multiplying by 
21 , and 
Adaptation, i.e. multiplication of 211e by each signal sample and 
accumulation in the tap weight registers (cf. eqn. 3) 
Due to the pipelining inherent in bit-serial systems, node 3 may be 
executed in parallel with node I. Mode 2, however, cannot proceed until 
node 1 starts to produce the final accumulated output word, and mode 3 
must await the appearance of the result of mode 2 before commencing. 
Thus node 2, whilst computationally trivial in comparison to the other 
nodes, provides an unavoidable system bottleneck during which the other 
nodes must idle. 
The level of multiplexing in the LMS blocks may be decided at this 
stage. 	The full multiplexing potential of the system, i.e. ignoring 
idles and latencies for the present, is simply the ratio of system word 
rate to system sample rate. In this case, the full potential is 444.4 I 
3 	i.e. 55.55. To obtain the actual potential, we must subtract the 
2() 	
d2(t) - 2d(t)S1E + aTssa 
and differentiating w.r.t. a gives 
e2(t) 	- 2d(t)S + 2ssTa 
Eqns.1 and 2 give 
- -2Se(t) 
The updates to the tap weights are performed by subtracting these 
correlation estimates (multiplied by a convergence factor ) from each 
tap weight. The size of i determines the accuracy and stability of the 
final solution, and the speed of convergence to that solution. The max-
imum bound on p is governed by input signal statistics [11. The updat-
ing process may be expressed as 
EL - 	- 11 e2(t) 
i.e. 
- II + 2Se(t) 	 (3) 
Equs. I - 3 represent the form of the algorithm to be implemented in the 
echo-canceller under design. 
4. System specification and initial chip plan 
From a designers point of view, the echo canceller is manifestly a 
straightforward adaptive filter. However, due to the nature of the echo 
path in long-distance telephone networks, the canceller may be required 
to exhibit a long impulse response, requiring two thousand or more 
filter points to achieve accurate modelling. This involves the designer 
in problems of word growth over the filter accumulation, and necessi-
tates partitioning of the system into identical cascadable blocks. 
The echo-canceller chip set of this case study was specified as 
follows: 
Filter bandwidth a 4 kHz (i.e. sample rate 	8 kHz) 
Filter length - 256 in basic form, up to 2048 in full form 
Converter resolution 	12 bits 
Tap weight resolution 16 bits 
The implementation of the canceller was undertaken as a case study 
in system design using the FIRST silicon compiler {3]. The compiler 
forms a chip plan (using bit-serial architecture) from a high level list 
of connected primitives, operators etc., generated by the user. The 
target process for the compiler is currently the Edinburgh Microfabrica-
tion Facility's 6-micron nMOS process. The compiler itself specified a 
system clock rate of 8 MHz. 
The system was initially proposed as a set of identical cascaded 
LMS processor blocks feeding an adder tree which in turn fed an error-
generating chip containing the control generator. Each LMS block was to 
Introduction 
Recent years have seen the emergence of many digital adaptive 
filter designs, a great proportion of which use variations of the Widrow 
LMS algorithm [1.]. The functional block diagram for such a filter is 
shown in Fig.L. 	With the advances in fabrication technologies and 
computer-assisted design techniques, more and more of these systems are 
being realised as VLSI chips or chip sets. 
A particular application area for adaptive filters is echo cancel-
lation in telephone networks. Due to various reasons (21, the received 
signal is corrupted by delayed versions of the transmitted signal. 	The 
solution is to adaptively model the echo path, and subtract the filtered 
received signal from the transmitted signal. If this process is carried 
out at both ends of the line, echo-free commtinication may be enjoyed. 
Fig.2 shows the usual configuration for a telephone line echo canceller. 
List of symbols 
The following symbols are used: 
N number of filter points 
variable scanning the Set of N filter points 
t - time 
s(t-n) 	the signal at filter point n, at time t 
h(n,t) the tap weight at filter point n, at time t 
d(t) - the training or conditioning signal at time t 
y(t) 	the filter output signal at time t 
e(t) the filter error signal at time t 
S the vector of signal samples in the filter at time t 
K a the vector of filter tap weights at time t 
- the vector of filter tap weights at time t+l 
S - the transposed signal vector S 
Theoretical background 
A transversal, or finite impulse response (FIR) filter produces its 
output sample by convolving its signal and tap weight vectors, i.e. 
y(t) - STU 	 (1) 
The Widrow LMS algorithm operates as follows. Firstly the output sample 
y(t) is subtracted form a training signal d(t) to produce the error sam-
ple e(t), i.e. 
e(t) 	d(t) - y(t) 	 (2) 
The algorithm forms an estimate of the correlation between the error 
sample and the signal sample at each tap in the filter, and updates each 
tap weight in such a direction as to reduce this correlation. 	The 
filter is said to have converged when the error signal and the signal 
vector are orthogonal. 
A useful cost function, describing the distance of the tap weight 
vector from the optimal solution (Wiener vector) [1], is mean-square 
error (MSE). The object of the algorithm is then to minimise the MSE. 
The Widrow LMS algorithm forms a crude estimate of the MSE by squaring a 
single error sample, and minimises this by gradient methods. Substitut-
ing eqn.(l) in eqn.(2) gives 
ABSTRACT 
A chip set is described for echo cancellation on tele- 
phone lines. 	Use is made of the FIRST silicon compiler, 
which drastically reduces design time. A set of three chips 
is built from a collection of FIRST operators, both chips 
and operators being described in detail. Some insight into 
the design process is given. Finally floor plans of the 
three IC designs are presented, along with projections for 
the completion of the project. 
Case Studies in System Design 
with a Silicon Compiler 
An LMS Adaptive Transversal Filter 
for Speech Echo Cancellation 
Report No. ISG/83/11 
INTERIM REPORT 
January, 1983 
Stewart G. Smith 
David Renshaw 
Peter B. Denyer 
Integrated Systems Group, Department of Electrical Engineering, 
University of Edinburgh, The King's Buildings, Edinburgh, UK. 
test chip (mentioned in connection with testing the code defining 
the behaviour of the primitive). This compiled chip is then fabri-
cated and tested to demonstrate equivalence of function, and to 
verify performance. Failure on this testing necessitates cell 
redesign. 	Once a primitive has been verified it can be accepted 
into the cell library. 
- 87 - 
declarations allocates a constant associated with design style. 
This is used to orient and place the final composed primitive. The 
second grouping of lines retrieves the primitive parameter values. 
The third group of lines is concerned with generating a unique name 
for the primitive; this is based on the primitive name and parame-
ter values. Next a test is made to determine whether the primitive 
with these parameter values has already been generated. 	If this 
has not been done then a symbol with the appropriate name is gen-
erated. The code for this is contained between the line: 
synibol(ttnauie); 
and the line: 
endsymbol; 
The fifth, sixth and seventh groups of lines of code are between 
these lines. The fifth group places the appropriate input connec-
tions, or predelay buffers conditionally selecting on parameter 
value. 	The sixth group of lines places the output buffers. The 
seventh group of lines is the main composition and selects 
appropriate cells depending on the parameter value, in this case 
the latency, and places them. This portion of code involves condi-
tional selection and placement as well as limited repetition, 
dependent on parameter value. Of the two remaining groups of lines 
the eighth determines the dimensions of the finished primitive and 
returns this information to the main placement programs and the 
ninth does the same for input and output port positions. 
Final verification of a primitive is achieved by compiling the 
endsymbol ; 













if(r == 0 	II r 1) 
height = 216; 
else 
height = 234; 
while(i 	< n) 
*f height += 36; 
I 	+= 4; 
if(n == 2*(n/2)) 
*fhejgIt += 32; 
else 
*f height += 14; 
width = 140; 
t_places[1 + i] = 122; 
f.places[2 + 1] = 80; 
fplaces[3 + 1] = 94; 
fplaces(4 + 1] = 38; 
f.places[5 + 11 = 66; 
fplaces(6 + 1] = 108; 
if(n > 1) { 
for(i = 1;i <= 6;1-+i) 
f..places[i + 1] i= 14; 
Table 111.5 
Again the lines of code have been grouped, in order to show the 
overall structure, which is as follows. The first line after the 
draw('ADDDEL" ,x,y); 
x=x-1-136; 
if(n == 2) 
draw('ASDELA" ,x,y); 
if(n == 3) 
draw(ASDELB" ,x,y); 
if(n == 4) 
draw( 'ASDELC' ,x,y); 
else 
if(n > 4) 
q=(n_5)/4; r=(n5)_(4*q) 
i8; 
if(r == 0 H r == 1) { 
draw( ASDELD ,x,y); 
x=x+38; 
else 
draw( ASDELG' ,x,y); 
xx+56; 
while(i < n) 
draw( ASDELFi' ,x,y); 
y127; 
if(r == 0 II r == 1) 
draw('ASDELI ,x,y); 
x=x+36 ; y=O; 
else 
draw(ASDELJ,x,y); 
x=x+36 ; y=0; 
i += 4; 
if(n == 2*(n/2)) 
draw( ASDELF ,x,y); 
y127; 
if(r == 0 II r == 1) 
draw( ASDELM" ,x,y); 
else 
draw( ASDELN ,x,y); 
else 
draw( ASDELE ,x,y); 
127; 
if(r == 0 H r == 1) 






designer = RENSE-JAW; 
n = f_plist[1]; 
mA = f_.plist[2); 
mB = f...plist[3]; 








box( 88 , 16, 92, 20 
if(iriA == 1) 
drawrot( °IPD" ,O, 14, 3); 
else 
drawrot('IP" ,0, 14,3); 
layer(METAL);  
box(88,2,92, 6); 
if(inB == 1) 
drawrot('IPD°,0,56,3); 
else 
drawrot(IP' 1 0,56,3); 
layer(METAL); 





box( 88 , 30, 92, 34) 
drawrot('OPBI" ,0,84,3); 
drawrot( 'OPBI' 10,112,3); 
x=92 ; y=0; 
if (n == 1) 
draw( "ADD" ,x,y); 
else 
any cell design environment comprising cell editors, a circuit 
extractor, circuit simulators, etc. using well documented methods 
for IC design. In designing the leaf cells their internal composi-
tion interfaces must be formalised and verified. This results in a 
formal expression of the rules of composition. 	Again this is 
interfaced to the compiler as a piece of code that can be compiled 
and linked with the rest of the compiler programs. The multiplier 
has been illustrated in Chapter 3 (q.v.). 
The parameters associated with an the ADD primitive are the 
latency, and the predelay on each of the inputs Thus the selection 
and placement of appropriate input connections or predelays must be 
made according to these parameter values. Then the appropriate 
choice of adder followed, if necessary, by the correct number of 
repeated blocks of delay has to be made, again according to the 
appropriate parameter value. Such informal rules of composition 
are formalised into code as shown in Table 111.5. 
systems can currently be developed and tested as multiple chip sys-
tems, in older technologies. 
Each leaf cell is represented as a separate numbered symbol. 
A complete list of the numbered symbol definitions together with 
their bounding box sizes is stored in file for use by the compiler. 
The portion of this relating to the adder primitive is shown in 
Table 111.4. 
!external symbol spec definitions 
external symbol spec("IP",1,0,0,14,92) 
external symbol spec("IPD',2,0,0,14,92) 
external symbol spec("OPBI",3,0,0,28,92) 
external symbol spec('OPBN",4,0,0,28,92) 
external symbol spec("ADD",31,-92,0, 144, 124) 
external symbol spec('SUB',32,-92,0, 144,124) 
external symbol spec( "ADDDEL" ,33, -92,0, 144,124) 
external symbol spec('SUBDEL,34,-92,0, 136,138) 
external symbol 5pec("ASDELA",35,0,0,46,138) 
external symbol spec('ASDELB",36,0,0,46,138) 
external symbol spec("ASDELC",37,0,0,46,138) 
external symbol spec('ASDELD',38,0,0,38,138) 
external symbol spec(ASDELE',39,0,0,14,138) 
external symbol spec('ASDELF",40,0,0,32,138) 
external symbol spec('ASDELG" ,41,0,0,56, 138) 
external symbol spec("ASDELI-I" ,42,0,0,36,127) 
external symbol spec("ASDELI",43,0,0,36,11) 
external symbol spec("ASDELJ',44,0,0,36,11) 
external symbol spec('ASDELK",45,0,0,2,4) 
external symbol spec('ASDELL",46,0,0,2,11) 
external symbol spec('ASDELM",47,0,0,19,11) 
external symbol spec("ASDELN",48,0,0,19,11) 
Table 111.4 
Methods for the design of leaf cells are not discussed here as 
they are adequately dealt with elsewhere. This may proceed within 
- 81 - 
should be defined in terms of a technology independent intermediate 
form for specifying circuits. Secondly, programs for generating 
technology specific layouts from these descriptions should then be 
available. At the time of implementing FIRST, such intermediate 
forms and generators were not available, so it was necessary to 
define the cells specifically at the mask level, and therefore at a 
technology specific level. Current research shows promise of pro-
gress in the direction of technology independent specifications for 
the future. 	The advantage that such organisation confers is that 
the task of translating the primitive library is reduced by orders 
of magnitude. The cost is that the resulting layouts may not be as 
area efficient as those of a custom designed library. 
The design of the leaf cells is done in the context of a modu-
lar floor plan capable of accepting their appropriate composition 
according to the relevant parameter values. The final form of the 
circuit blocks is the layout definition in the form of CIF code 
specifying the detailed mask geometry. The library discussed in 
this chapter has been constructed according to enhanced Mead - Con-
way design rules, for 4, 5 or 6 micron polysilicon gate NMOS pro- 
cess technology. 	This choice of technology was dictated by the 
facilities available for the development and proving of FIRST. For 
state of the art VLSI systems, FIRST will require 2 micron or sub 2 
micron two level metal CMOS technology or 1 micron or sub 1 micron 
NMOS technology. 	However, since bit serial architectures and the 
FIRST methodology simplify system partitioning, future one chip 
sigout, and param. The fourth group of lines checks for synchroni-
sation and warns of LSB mismatching of input signals, flagging node 
and time for ease of diagnostic purposes. The fifth group of lines 
fetches the values on each of the inputs by accessing the relevant 
parts of the record arrays and then computes the function of the 
primitive, using these values. In this context there is also a 
mechanism for retrieving an arbitrary depth of previous state 
information, if needed, and also for updating the state informa-
tion. This is not needed to model the adder function. If 
required, a further group of lines of code can be inserted to check 
the values for format etc, and diagnostic warnings can be issued. 
This is shown in the sixth group of lines. This can be a useful 
feature during system design and although the hardware will not 
perform this check, it gives a useful mechanism for flagging 
detectable conditions which might give rise to unexpected system 
behaviour. Usually such conditions are less easily traced during 
simulation when there are no such explicit diagnostic warnings. 
The seventh group of lines enter the primitive output values onto 
appropriate data records and event queue by means of the function 
enter—event. Finally the else condition of the initial if is 
evaluated to check primitive input activity when the control is not 
active. 
The next stage of primitive design is to create the logic and 
circuit blocks or leaf cells required to make up the primitive. 
Ideally this should be done in two stages. 	Firstly the circuit 
- 79 - 
else 
total = aval - (bval + cm); 
if((total > mantissa) && (Co <= NC)) 
sprintf(iness,"OVERFLOW IN ADD/SUB %d @ %s',n, 
bit time ( time)) 
warning(mess); 
if((total 	sign_extend) && (Co (= NC)) 
sprintf(tness,"tJNDERFLOW IN ADD/SUB %d @ s,n, 
bit_time(tinie)); 
warning(mess); 
enter _event (s time+delay, total); 
enter_event(co, time+(wordlength<<i) ,carry); 
else 
if((pda + pdb + pdc) == 0) 
timing_warning(n); 
Table 111.3 
The lines of code are grouped with consecutive groups separated by 
a single blank line. 	The general structure is as follows. The 
first group of lines after the declarations invoke the standard 
simulator functions ctlin, and param to find the primitive's lsb 
control node number and the values of the predelay parameters. 
These functions, together with others mentioned below, are the 
means whereby the primitive function code communicates with the 
rest of the simulator code. The values returned are allocated to 
local variables as access to them in this way is faster than 
repeated function calls. Next a test is made to determine whether 
there is an event on the primitive control node. The third group 
of lines fetches the remaining node numbers and parameter values 
associated with the primitive by means of the functions sigin, 
- 78 - 
add (n) 
mt a, b, s, delay, total, carry, cm; 
mt c, co, pda, pdb, pdc, clsb, aval, bval; 
char *bjttimeOmess[64mess1[256]; 
clsb = ctlin(ri,1); 
pda 	2 * pararn(n,2); 
pdb 2 * param(n,3); 
pdc 	2 * param(n,4); 
if(nodelist(clsb + j_max_ctl].time == time) 
a = sigin(n,1); 
b = sigin(n,2); 
c = sigin(n,3); 
s = sigout(n,1); 
co = sigout(n,2); 
delay = param(n,1) (< 1; 
if(((a > NC && (nodelist[a + j_max_ctl].time + pda) 	time) 11 
(b ) NC && (nodelist[b + j_max_ctl].time + pdb) time) II 
(c > NC && (nodelist[c + j_max_ctl].time + pdc) 	time)) && 
time >= op_start) 
timing_warriing(n); 
sprintf(messl,'----> Node i%d VALID @ %s",a, 
bit _tiiue(nodelist[a + j_max_ctl].time + pda)); 
sprintf(mess,",Node %d VALID @ %s",b, 
bit _tinie(nodelist[b + j_ivax_ctl].time + pdb)); 
strcat(messl ,mess); 
sprintf(uiess,',Node %d VALID @ %sO,c, 
bit _time(nodelist[c + j_max_ctl].time + pdc)); 
strcat(messl ,mess); 
trap(messl); 
aval = nodelist[a + j_inax_ctl].value; 
bval = nodelist[b + j_max_ctl].value; 
cin = riodelist[c + j_max_ctl].value & 1; 
total = aval + bval + Cm; 
if((total & carry_bit) == 0) 
carry = 0; 
else 
carry = 1; 
if((aval & sign—bit) 	0) 
aval 1=  sign_extend; 
f((bval & sign_bit) != 0) 
bval 1=  sign_extend; 
if(sign == PLUS) 
total = aval + bval + cm; 
- 77 - 
of this function. The testing of the code can be done by instal-
ling it into the FIRST compiler (by compilation and linking) and 
then by making up a system specification consisting of a chip, or 
chips with instances of the primitive. These are then stimulated 
by appropriate test patterns to check their response. 
An example of the form such code takes, and the way it inter-
faces into the overall structure of the compiler is shown in Table 
111.3 which gives the code of the functions defining the ADD primi-
tive. 
- 76 - 
addition of new primitive syntax to the compiler a trivial matter. 
The form required for this is illustrated in Table 111.2 which 
shows the definitions for ADD. Each consists of a single line per 
primitive containing, in order, the primitive name (alphanumeric), 
the primitive type code, the number of parameters, signal inputs, 
signal outputs, control inputs and control outputs. Each of these 
fields of information in the line is separated by a space, or 
spaces. 	This allows the compiler to recognise the primitive name, 
where it occurs and to check that it is being used correctly, and 
with the right number of parameters, inputs, outputs, etc. 
ADD 2 4 3 2 1 0 
Table 111.2 
In designing a primitive, the first task is to define the 
function required. This is achieved by formulating the word level 
behaviour of the primitive. This definition may have to be itera-
tively changed as design proceeds, since the final form will 
reflect on the one hand the arithmetic function required by the 
system designer and on the other hand the tradeoffs which the cir-
cuit designer can make to achieve hardware efficiency and minimal 
circuitry. 	In any event the final interface to the complier is a 
piece of program code in the form of a function (or routine) which 
can be compiled and linked to the main compiler programs. Verifi-
cation requires that the composition and layout information com-
bine, when processed, to yield the behaviour captured in the code 
- 75 - 
1.8. Primitive Design and Entry 
Primitive design consists firstly of designing a set of cir-
cuit elements from which a primitive can can be assembled. These 
are termed leaf cells. This must be done in the context of a modu-
lar floor plan which allows appropriate composition of leaf cells 
according to parameter value(s). The second design task is the 
formulation of the procedural rules of composition. A necessary 
part of design at this level is the verification of the internal 
composition interfaces. 	The third design task is to extract the 
generalised primitive behaviour from the circuits which result from 
composition of leaf cells. As a result of these design tasks the 
formal interfaces to the FIRST silicon compiler can be defined. 
These are: 
primitive syntax 
primitive functional behaviour 
primitive composition 
primitive leaf cell definition 
As an example of the structure of each of these consider the 
definition of the ADD primitive, giving the full definition of each 
interface. 
Primitive syntax is defined for the compiler in a data file 
reserved for this purpose. This file is read at the start of com-
pilation and sets up the definition and attributes of the prede- 
fined primitives. 	This information is held as a file to make 
- 74 - 
1.7.3. Clock generator 
Description: 
During layout  generation there is a run time option allowing selec-
tion of one of the following configurations: 
either two clock pads and corresponding clock distribution for 
input of externally generated two phase clocks, 
or alternatively one clock input with on chip two phase clock 
generation and distribution. 
Circuit Design 
Not yet available. 
1.7.4. Substrate bias generator 
Description: 
During layout generation there is a run time option allowing selec-
tion of one of the following configurations: 
no substrate bias generation (for use with multiproject EMF 
standard frames) 
on chip substrate bias generation. 
Circuit Design 
Not yet available. 
- 73 - 	 (PADOUT) 
Restrictions: 
1. 	Padout inputs must not be connected to GND or VSS, if the 
simulator is to function correctly. 
Simulator Checks and Warnings 
Node connectedness checks are made. 
Circuit Design 
The circuit design of padout is that of a modified Xerox-Parc out-
put pad, the modification involving only the addition of a clocked 
buffer and the necessary changes to metalisation to allow rotated 
placement. 
- 72 - 	 (PADOtJT) 
1.2.2. Padout 
Description: 
Padout is the mechanism for getting signals and controls off a chip 
and provides the correct drive and buffering capability for doing 
this. Communication between chips is pipelined, each chip boundary 
involving one half phase of the bit clock for data transfer. 
Syntax: 
padout (ccl -> ci) 
padout ssl -> si 
padout (ci,c2,... -> ccl,cc2,...) sl,s2,... -> ssl,ss2,... 
Note 
The first form is for a single control output. ccl is the on-chip, 
internal input node name and ci the off-chip, external output node 
name. Similarly the second form is for a single signal output, 
with ssl and si corresponding to ccl and ci respectively. The 
third form is the combined condensed form for multiple control and 
signal outputs. 
Function: 
output = corresponding input 
latency = 1/2 bit 
- 71 - 	 (PADIN) 
Restrictions: 
1. 	For the simulator to function correctly all inputs should be 
connected to an appropriate signal or control source. This may be 
either an output pad output node or an input file, corresponding to 
externally supplied signal generation. 
Simulator Checks and Warnings 
Node connectedness checks are made. 
Circuit Design 
The circuit design of padin is that of a modified Xerox-Parc input 
pad the modification involving only the addition of a clocked 
driver and the necessary changes to metalisation to allow rotated 
placement. 
- 70 - 	 (PADIN) 
1.7.1. Padin 
Description: 
Padin is the mechanism for getting signals and controls onto a chip 
and provides the correct drive and buffering capability for doing 
this. Communication between chips is pipelined, each chip boundary 
involving one half phase of the bit clock for data transfer. 
Syntax: 
padin (ci -) ccl) 
padin si -> ssl 
padin (ci,c2,... -> ccl,cc2,...) sl,s2,... -> ssl,ss2,... 
Note 
The first form is for a single control input. ci is the off-chip, 
external input node name and cci the on chip, internal output node 
name. Similarly the second form is for a single signal input, with 
si and ssi corresponding to ci and cci respectively. The third 
form is the combined condensed form for multiple control and signal 
inputs. 
Function: 
output = corresponding input 
latency = 1/2 bit 
1.7. MISCELLANEOUS PRIMITIVES 
There are two types of service primitive the I/O pad primi-
tives and the clock and subsrate bias generators. The former ar 
visible at the level of the FIRST HDL the latter are not, being 
selectable as options during run time of the physical design sub-
system. 
- 68 - 	 (FLIMIT) 
param min max constraint meaning 
1 1 32 integer system word-length 
2 1 n-i integer limit position 
3 0 1 integer predelay on input 
Function: 
output = input 	 if -k < input k 
output = k 	 if input >= k 
output = -k 	 if input <= -k 
latency = swl + 1 
Restrictions: 
For correct function parameter 1 must be the same as the sys-
tem word-length. 
ctrl must be ci control. 
Simulator Checks and. Warnings 
Circuit Design 
The circuit implementation of fliinit comprises of the following 
components: a storage shift register, maximum and minimum value 
generation circuits, over-flow detection circuitry and the final 
output select and control circuits. Modularity ensures that these 
components can be assembled according to the user chosen values of 
parameters 1 and 2 to carry out the defined function of flimit. 
- 67 - 	 (FLIMIT) 
Flimit 
Description: 
Signal maximum and minimum values are defined by the user chosen 
values of parameter 1. 	If the input signal exceeds either then 
f limit hard limits the output signal to the appropriate limit 
value. 	Otherwise the output value is the same as the input value. 
Note that the maximum and minimum values are symmetrically placed, 
about zero, not as in strict two's complement representation. 
Flimit is designed to deal with problems of overflow and dynamic 
range limitation. 





Flimit[swl,litn,dell] (ctrl) input -) output 
Parameters: 
Let n=swl, let m=litn and let k=2m1 
- 66 - 	 (FORMAT) 
Function: 
LSWout = input 
MSWout = 0 	if MSB of previous input = 0 
MSWout = 1 	if MSB of previous input = 1 
LSWout latency = I bit 
MSWout latency = (swi + 1) bits 
Restrictions: 
1. 	ctrl must be ci control. 
Simulator Checks and Warnings 
Checks are made for input synchronisation timing. 
Circuit Design 
Not yet available. 
- 65 - 	 (FORMAT) 
.i...2. 	Forinat1to2 
Description: 
Forinat1to2 accepts one input and outputs this input on LSWout with 
a one bit latency. The MSB of each input is latched and output for 
a full ci cycle and output on MSWout. Thus forinat1to2 takes in a 
single precision format signal value and outputs the corresponding 
double precision format value, with a full word of MSB repetitions. 
It interprets the input as a two's complement number. 
Flow Diagram & Simulation Pattern: 
input 
ctri)(/'FORMATI T02 
LSWout I 	I MSWout 
Syntax: 
Formatito2[del] (ctrl) input -> LSWout, MSWout 
Parameters: 
param min max constraint meaning 
1 	0 	1 	integer 	input predelay 
- 64 - 	 (FFOPMAT) 
Simulator Checks and Warnings 
Checks are made for illegal parameter values (not system word-
length, or out of range), and for input synchronisation timing. 
Circuit Design 
Not yet available. 
- 63 - 	 (FFORMAT) 
X = 	+ x1.21 + ... + x 2.2 2 + 
Note the unsigned integer interpretation of X, the LSWIn. 
Let 
Y = y0.2°  + y1.21 + •.. + - 
and let 
f = f.2° + f1.21  + ... + f2fl2.222 - f2n 1.221  
Define 
g = g0.2°  + g1.21 + ... + 	 - 92n-1-2 2n-1 
where 
g = f 	 if 	(2m1) < f < (2m )  
g = if f 4 -(2-1) g = (2m1) 	if f 	(2-1) 
Then Z is defined as follows: 
0 	1 	 n-2 	n-i Z 	z0.2 + z1.2 + . . . + 	 - z 1.2 
where 	z =g j+k 
and 
latency = 1 bit 
Restrictions: 
For correct function parameter 1 must be the same as the sys-
tem word-length. 
ctrl must be ci control. 
- 62 - 	 (FFORMAT) 
Syntax: 
Fformat2tol[swl,lim,shift,dell,del2] (ctrl) LSWin, MSWin -> output 
Parameters: 
Let n=swl, tn=liin, k=shift 
parain min max constraint meaning 
1 1 32 integer system word-length 
2 0 63 integer limit pointer 
3 0 63 integer shift pointer 
4 0 1 integer LSWin predelay 
5 0 1 integer MSWin predelay 
Note 
Parameter 2, the limit pointer, points to the bit position of the 
assembled double precision word which is to be the MSB of the lim-
ited single precision output. This defines hard limiting and over-
flow. 
Parameter 3, the shift pointer, points to the bit position of the 
assembled double precision word which is to be the LSB of the sin-
gle precision output. This defines the power of two shift value. 
A value of 0 points to LSB of LSWin 
A value of n points to LSB of MSWin 
Function: 
For notational convenience let 
Fforniat2tol[n,m,k,0,0] (C) X, Y -> Z 
Let 
- 61 - 	 (FFORMAT) 
Fformatltol(swl,ljm,shift,del] (ctrl) LSWin -> output 
Ffornlat2tol[swl,lim,shjft,cjell,del2] (ctrl) LSW1n, MSWin -> output 
Fforinat3tol[swl,1JJn,shift,dell,del2,del3] (ctrl) - 
LSW±n, MSWin, HSW1n -> output 
Fformat2tol is now defined in the same way as the previous primi- 
tives to illustrate this in detail. 
Ffortnat2to 1 
Description 
Fforinat2tol accepts two inputs on LSWIn and MSWin staggered by one 
word cycle. 	It interprets the first as magnitude only and the 
second as two's complement and assembles them into a double preci-
sion word. It then hard limits any overflow or underf low condition 
defined by the user chosen value of parameter 2. Finally it out-
puts single precision word whose start bit is defined by the user 
chosen value of parameter 3. 






- 60 - 	 (FFORMAT) 
1.6.1. Fformatltol, Fformat2tol, Fformat3tol 
Description 
Fformat is a fixed format primitive which accepts and assembles a 
multiple precision input and then performs a windowing operation to 
extract a single precision output word from this input. 	The win- 
dowing consists of doing overflow detect (as determined by parame-
ter 2, limit) clamping if overflow has occurred, and then extract-
ing the appropriate n bits of the multiple precision word (as 
determined by parameter 3, shift). In the syntax swl=n is the sys-
tem word length; lim is the pointer to the bit position of the mul-
tiple precision word which is to be the MSB of the limited single 
precision word; and shift is the pointer to the bit position of the 
multiple precision word which is to be the LSB of the single preci- 
sion output. 	The output selects n consecutive bits from the bit 
defined by parameter 3. A pointer value of zero points to LSB of 
LSWifl; a pointer value of n points to LSB of MSWin; a pointer value 
of 2n points to LSB of HSWin, etc. 
The latency of the primitive is 
latency = (p* + 1) bits 
where n=swl and p=1,2,3 for Eformatitol, Fformat2tol, Fformat3tol 
respectively. 
Syntax 
- 59 - 
1.6. FORMAT TRANSFORMATION PRIMITIVES 
Fixed-point, bit-serial arithmetic imposes constraints on the data 
input formats to hardware. In particular, for example, the inputs 
to adders must have the same scaling; inputs to accumulation may 
only require single precision, but the accumulation may need to be 
double or triple precision; multiple precision output may have to 
be scaled, limited, or reduced to single precision, etc. The prim-
itives described in this section are designed to meet these 
requirements. 	They comprise the following. There is a range of 
Fforinat primitives (standing for Fixed format), which reduce multi-
pie precision input to single precision, with scaling and overflow 
detect and clamping. There is a range of format primitives, which 
take single precision input and give multiple precision output 
(using MSB sign repetition). Finally there is a limiter, flimit, 
for detecting overflow and clamping. Additions to this range of 
primitives would include Pformat primitives (programmable format) 
and another version of the limiter. 
- 58 - 	 (CWORDDELAY) 
signals, can be shown to constitute guinea delay. 
Restrictions: 
A latency of one guinea is a latency of one word plus one bit. 
The maximum value of parameter 1 is determined only by floor 
plan considerations, typically values of up to 32 might be used. 
LSBctrl input must be ci control; ctrlin should be c2 or a 
higher level of control. It is unnecessary and inefficient to pro-
pagate ci through worddelay. 
Because of the internal architecture of cworddelay the input 
pulses must be of the form produced by the control generator (see 
description). 
Do not connect the input of cbitdelay to VDD or GND. 
Simulator Checks and Warnings 
No checks are made. 
Circuit Design 
As has been mentioned the basic circuit function block senses the 
MSB from the previous word and latches this, outputting it one bit 
time later for the duration of one word. This can be implemented 
by sixteen transistors per guinea delay compared to 6(n+i) transis-
tors, where n = swl. Where w is the value of the first, w such 
blocks are cascaded. 
	
- 57 - 	 (CWORDDELAY) 
latency = m(n + 1) bits 
The latency of cworddelay is determined by the user chosen value of 
parameter 1. 
Because it is designed for use with the special periodic waveforms 
generated by the control generator it is not necessary for it to 
store all the bits of each control word. By sensing the MSB of the 
previous control word and latching it for one word time the same 
function can be implemented with much reduced hardware. More for-
mally if we have, n = swi and 
cword[1,0](C,X -> Y) 
and if 
X = x0.20 + x1.21 + ... + X 	 2 - 
and Y is given by 






Y(t) = 0 	 if x _1(t-1) = 0 
Y(t) 	2 - 1 	if x 1  ?t-1) = 1 
In other words the output is zero if the MSB of the previous input 
word was zero and the output is all ones if the MSB of the previous 
input word was one. If the first parameter is w then the output 
word depends on the MSB of the w-th previous word. This bit level 
definition of function, together with the definition of the control 





Cworddelay is a special bit serial FIFO storage primitive. 	It 
gives storage with a latency in guineas" (multiples of one word 
plus one bit). 
Flow Diagram & Simulation Pattern: 
ctrlin 
ctrlln 	 I 	I I 




.€WDLY[1ZJ 	 [1 ctriout 
I 	 L ctrlout 	 SBctrf 
V 
Syntax: 
cworddelay [w,del] (LSBctrl, ctrlin -) ctrlout) 
Parameters: 
param min max constraint 
1 	1 	* 	integer 
2 0 1 integer 
Function: 
ctrlout = ctrlin 
(see restrictions and description). 
meaning 
latency in "guineas 
predelay on input 
- 55 - 	 (CONTROLGENEPATOR) 
values of the parameter n. 	The event control generation is 
achieved by use of a some detect circuitry, a latch and reset cir-
cuitry associated with the appropriate level of cycle. 
- 54 - 	 (CONTROLGENERATOR) 
Restrictions: 
All systems must have one and only one CONTROLGENERATOR. 
Only INH, the inhibit output and ci, the ci level control 
pulse are mandatory. 
A cycle statement at any level must be preceded by cycle 
statements at all previous levels. 
Controlgenerator should not be called in an operator. 
Note that the simulator output on !NH does not correspond to 
the actual hardware signal. 
Simulator Checks and Warnings 
No checks are made. 
Circuit Design 
The circuit design of controlgenerator is based on a divide-by-n 
counter (where n is the count period) which is used to implement 
each cycle. The counter is implemented as a synchronous binary 
counter, made up of a series of set-reset toggle flip flops. These 
are chained with a combination of parallel and series carry chains, 
a maximum count detect and counter reset circuit. The choice of 
this counter architecture over alternatives was dictated by the 
fact that it gave the best trade-off between low area, high speed 
and modularity of layout composition for the required range of 
- 53 - 	 (CONTROLGENERATOR) 
asynchronous requests. 
The problem of interfacing FIRST bit-serial systems to the real-
time sample and hold hardware and other systems can be approached 
in two ways. Firstly a single fast system clock can be divided 
down and used as required throughout the system. This is simple 
and reliable but may be inflexible. 	Where such a solution is 
either not possible or inappropriate it may be necessary to provide 
a synchronisation facility between the bit serial system clock and 
that of other systems. 	This is done by allowing the bit-serial 
system to operate in "burst mode by computing a full sample cycle 
and then, on completion inhibiting its clock until the start of the 
next sample. The INH output of the control generator provides a 
cycle complete" signal for use in such clock start-stop circuits. 
To illustrate the syntax of controlgenerator, consider the follow-
ing instance. 






The simulation pattern shows a timing diagram of the outputs of 
this control generator and the effect of an asynchronous request on 
the input evr. 
- 52 - 	 (CONTPOLGENERATOR) 
dividers. The lowest level of hardware control is the two phase 
bit-clock which is invisible to the system designer. The next 
level of control is ci, the LSB marker pulse, which is high for the 
duration of one bit time per word, in coincidence with LSB, and low 
for the rest of the time. The next level of control is (:2, which 
is normally used to control multiplexing operations. The cycle 
pattern of c2 is high for the first word of the cycle and low for 
the remainder. 	Thus c2 marks and counts units of ci. Similar, 
higher order control cycles can be generated above c2. 	Non-cyclic 
control pulses which are high for an initial interval corresponding 
to any of the levels of cyclic control and then low indefinitely 
thereafter are required for reset, initialisation etc. Such pulses 
need to be available as a result of an external, asynchronous 
request, ER1, ER2, ... etc. This type of control is defined as an 
event. 
In the syntax for specifying CONTROLGENERATOR, the body of the 
definition consists of an arbitrary number of cycle statements, 
each of which can be optionally followed by an event statement. 
Each cycle statement has a single user chosen parameter which 
defines the period of that control cycle. The period counts units 
of the previous cycle length. 	Each cycle has one output, each 
event has an event request input and an event output. 	The first 
level of control is ci. This defines the system wordlength and is 
high during the LSB and low until the next LSB. All outputs from 
the controlgenerator are synchronous. 	The event inputs can be 
- 51 - 
Parameters: 
Para in min 	max 	constraint 
period 2 256 integer 





Controlgenerator is a primitive with a variable number of inputs 
and outputs. 	Event request inputs ER1, ER2,... their associated 
outputs El, E2,... and their associated event statement are 
optional. 	(To each ER1, ER2, ER3,... there corresponds respec- 
tively El, E2, E3,... and an event statement at the appropriate 
level. To each ci, c2,... there corresponds a cycle(perl), 
cycle(per2), . . . statement.) 
Function: 
ci = 1 	 for all values of t 
c2 =2n 	 if T = 0 mod (per2) 
0 otherwise 
c3 = 2' 	 if T = r mod (perl.per2) 
where r = 0, 1,..., (per2 - 1) 
= 0 	 otherwise 
where time reference starts at T=0 when the first word of each con-
trol is produced at the controlgenerator output. 
Because of the cyclic nature of the control pulses the control gen-
erator can be implemented as a bank of counters, or frequency 
- 50 - 	 (CONTROLGENERATOfl) 
I. .. 2. Controlgenerator 
Description: 
Special purpose hardware is needed to generate the control pulses 
required to keep bits, words, multiplexing and initialisation 
operations etc, in synchronisation. This is the function of the 
CONTROLGENERATOR primitive, which produces ci, c2, ... and event 
levels of control 
Flow Diagram & Simulation Pattern: 
C1 ci C1CYCLE [5] 
eyr CYCLE [4] Hc? c2 










- 49 - 	 (CBITDELAY) 
Function: 
output = input 
latency = user chosen value of parameter 1 
Restrictions: 
Cbitdelay does not alter the value of the input in transfer-
ring it to the output, it acts as a storage FIFO. 
The maximum value of parameter I is determined only by floor 
plan considerations, typically a value of less then or equal to 24 
is appropriate if the largest primitive is a 16 coefficient bit 
multiplier. 
Simulator Checks and Warnings 
No checks are made. 
Circuit Design 
Cbitdelay is implemented as a shift register using a six transistor 
bit leaf cell. 	The input predelay is used as first cell in all 
instances of bit delay with latency greater than 1. 	The output 
buffer forms the last half cycle of delay. 
- 48 - 	 (CBITDELAY) 
.i... CONTROL PRIMITIVES 
Cbitdelay 
Description: 
Cbitdelay is a storage primitive for control. It is a first in 
first out bit serial memory. The main system use of cbitdelay is 
for implementing compensating delay to synchronise control streams. 
In particular it is used to make delay lines from which appropri-
ately delayed versions of control can be tapped. 




cbitdelay [latency] (input -) output) 
Parameters: 
paramu min max constraint meaning 
1 	1 	32 	integer 	latency of output 
- 47 - 	 (WORDDELAY) 
Simulator Checks and Warnings 
Checks are made for coefficient (out of range) and for input syn-
chronisation timing. 
Circuit Design 
Worddelay is based on the three transistor RAM cell. 	The memory 
block architecture consists of a serial to parallel input register, 
a word organised block of RAM and a parallel to serial output 
register. 	The memory is organised as a FIFO with data items 
traversing the memory locations. This obviates address generation 
and the control associated with addressing. The input and output 
registers are clocked by cO, moving n bits serially each word time. 
The internal RAM is clocked by ci, moving blocks of words in RAM 
each bit time in such a way as to move all once per word time. 
- 46 - 	 (WORDDELAY) 
Parameters: 
param min 	max constraint meaning 
1 8 * integer number of words storage 
2 1 	swi integer number of significant bits 
3 0 1 integer predelay on input 
Function: 
Let swl=n. 
If bit in of input = 0 then: 
output = input AND (2in_1) 
If bit m of input = 1 then: 
output = input AND ((2fl1)(21fl1)) 
latency = (m*n + 1) bits 
The latency of worddelay is determined by the user chosen value of 
parameter 1. 
Restrictions: 
The maximum values of parameter 1 is determined only by floor 
plan considerations, typically values of between 12 and 64 might be 
used. 
The value of parameter 2 must be less than or equal to the 
system wordlength. 
Do not connect the input of bitdelay to VDD or GND. 
Ctrl must be LSB control ci, for correct function of the word-
delay primitive. 
- 45 - 	 (WORDDELAY) 
1.4.2. Worddelay 
Description: 
Worddelay is a word oriented FIFO storage primitive. 	It is spe- 
cially organised for efficient storage of signals which utilise 
less than the full dynamic range of the system word-length, for 
example coefficients which do not require full precision quantisa-
tion. Word delay can be configured to store only the in least sig-
nificant bits of each input word. It assumes that bit in of the 
input word is the sign bit and that the remaining bits are identi- 
cal sign bit repetitions. 	It uses bit in on output to replicate 
these sign bit repetitions. 





worddelay [w,tn,del] (ctrl) input -) output 
- 44 - 	 (BITDELAY) 
Function: 
output = input 
latency = user chosen value of parameter I 
Restrictions: 
The output and input have the same significance. 
Bitdelay does not alter the value of the input in transferring 
it to the output. It acts as a storage FIFO. 
The maximum value of parameter I is determined only by floor 
plan considerations, typically a value of less then or equal to 24 
is appropriate if the largest primitive is a 16 coefficient bit 
multiplier. 
Do not connect the input of bitdelay to VDD or GNU. 
Simulator Checks and Warnings 
No checks are made. 
Circuit Design 
Bitdelay is implemented as a shift register with a six transistor 
per bit leaf cell. 	The input predelay is used as first cell in 
all instances of bit delay with latency greater than 1. The output 
buffer forms the last half cycle of delay. 
- 43 - 	 (BITDELAY) 
1.4. STORAGE PRIMITIVES 
.1..4..1. Bitdelay 
Description: 
Bitdelay is a storage primitive. It is a first in first out (FIFO) 
bit serial memory, implemented as a shift register. The main sys-
tem use of bitdelay is for implementing compensating delay to syn-
chronise signal data streams (cp. cbitdelay). 





bitdelay [latency] input -> output 
Parameters: 
Para m min 	max 	constraint 	meaning 
1 	1 32 integer latency of output 
- 42 - 	 (SUBTRACT) 
The LSB of the difference will have this significance also. 
If the minuend and the subtrahend have one guard bit (i.e., 
the two MSBs are identical) then no overflow can occur. 
Two word subtraction can cause arithmetic growth of one bit. 
Overflow in the previous difference is flagged by the single bit 
value of carryout at LSB time, with value 0 - no overflow and 
value 1 - overflow. 
For single precision subtraction the borrrow input is con-
nected to ground. 
For correct function the ctrlin input must be LSB or ci con-
trol. 
Simulator Checks and Warnings 
Checks are made for overflow and underf low and for input synchroni-
sation timing. 
Circuit Design 
As for adder primitive (see section 4.5.2 circuit design). 
- 41 - 	 (SUBTRACT) 
The latency of the difference can be user chosen by means of the 
first parameter. 
F is the word of bitwise subtraction bits generated from the sub-
traction, and has a fixed latency of only one bit. 
More formally, if 
A = a0.2°  + a1.21 + ... + a2.22 - a1.2 
B = b0.2°  + b1.21 + ... + b 2.2 2 	b 1.2 
C = c0.2°  + c1.21 + ... + C 22n2 - c 1.2 
then F is given by 
F = f0.2° + f1.21  + ... + fn222 - fi.2 
where 
	
f1 	= c AND (a0 XNOR b 0  ) OR (( NOT a 0  ) AND b0) 
= 	AND (a 
j-1 
 XNOR b 
j-1  OR (( NOT a j-1 
 AND b j-1 
where 1<=j<=n-1 and 
= f 	AND (a 1 XNOR bn_i)  OR (C NOT a 1) AND b 1) 
Note that, in this final expression for f0 the values a 1 and b 1 
refer to the previous input words and f 
ni  the corresponding previ-
ous word carry bit. Thus f0 is an overflow flag for the previous 
subtraction. 
Restrictions: 
1. 	For subtraction to function correctly the LSB of the minuend, 
the subtrahend and of the borrowin must have the same significance. 
- 40 - 	 (SUBTRACT) 
pararn min max constraint meaning 
1 1 32 integer latency of difference 
2 0 1 integer predelay on minuend 
3 0 1 integer predelay on subtrahend 
4 0 1 integer predelay on borrowin 
Function: 
difference = minuend - subtrahend - (borrowin & 1) 
latency = user chosen value of parameter 1 
borrowout: 	see below 
latency = 1 bit 
For notational convenience, suppose we have: 
subtractlmn,0,0,0)(cLSB) A,B,C -> D,F 
and let swl=n. Then the minuend, A, and subtrahend, B, are assumed 
to have the same interpretation i.e., LSBS have the same signifi-
cance. D is the difference of A and B with only the LSB of C sub- 
tracted. 	In normal use C will be replaced by GND so that this LSB 
is zero. Subtracters can, however, be cascaded to form multiple 
precision subtraction, in the same way as for addition. This is 
achieved by connecting the borrowout from the previous subtracter 
to the borrowin of the next (cp. description of the add primitive). 
In this case the C input is used to propagate any borrow generated 
from the subtraction of less significant bytes into the subtraction 
of the more significant bytes. The first subtracter has its C 
input replaced by GND. 
- 39 - 	 (SUBTRACT) 
1.3.10. Subtract 
Description 
Subtract forms the bit-serial difference of the first two inputs. 
The third input is an external borrow, which is strobed at LSB time 
to initialise the subtracter. This input is normally connected to 
GND, but can be used to construct multiple precision subtracters 
(see function). The second output, the word of borrow bits, is 
provided for the same purpose. The latency of the difference out-
put can have any user chosen value in the allowed range. 










subtract [latency,dell,del2,del3] (ctrlin) - 
minuend,subtrahend, borrowin -) difference, borrowout 
Parameters: 
38 - 	 (ORDER) 
Function: 
max = rnax(inl,in2) 
min = tnin(inl,in2) 
latency = (n + 3) bits 
Restrictions: 
The value of parameter 1 must be the same as the system word-
length for correct function. 
For correct function the inputs must both be positive, MSB=O. 
An alternative version of order could be designed to cope with 
negative and positive inputs but this requires greater hardware 
area (see section on circuit design). 
For correct operation ctrl must be ci. 
Simulator Checks and Warnings 
Checks are made for illegal values of parameter 1, input out of 
range (underf low) and for input synchronisation timing. 
Circuit Design 
The circuit design of order requires two storage shift registers, 
output select circuitry and a detect and control latch circuit. 
Detection is achieved by a small finite state machine or serial 
decoder which determines which input is the larger. This is imple-
mented in random logic using some fourteen transistors. 





The order primitive takes two input signals which are assumed to be 
positive and orders them so that the larger is output on max and 
the smaller is output on mm. The original cell designs were done 
for a specific application not requiring the more general form 
capable of ordering both positive and negative inputs. 




. t eRD (24O7 
maxj7min 
Syntax: 
order[swl,del](ctrl) inl,in2 -) max, mm 
Parameters 
Let swl=n. 
parant min max constraint meaning 
1 	4 	32 	integer 	system word-length 
2 0 1 integer predelay on input 
- 36 - 	 (MULTIPLY) 
repetitions. 
4. 	For correct function ctrlin must be the correctly synchronised 
LSB pulse or ci control. 
Simulator Checks and Warnings 
Checks are made for invalid parameter values, incorrect data for-
mat, incorrect coefficient format and for input timing synchronisa-
tion. 
Circuit Design 
The multiplier design is a variant of the modified Boothe serial- 
parallel multiplier. 	It differs from the Lyon multiplier in 
respect of recoding and output scaling. Recoding is not performed 
in an explicit block at the input but is distributed throughout the 
multiplier cells. The scaling is adapted to preserve the arith-
metic output format as explained in the previous section. The gen-
eration of the rounding signal is achieved using a compact shift 
register in the I/O channel and is thus made invisible to the sys-
tem designer. The basic leaf cells are two bit multiplier cells, 
composed from a recoder cell, a programmable add-subtract cell and 
some shift register cells. A block diagram is shown in Figure 3.9 
and an instance of circuit layout in Figure 3.10 (both in Chapter 
3). 
- 35 - 	 (MULTIPLY) 
The consequence of this is that if the data is interpreted as being 
fractional 2's complement with 2 guard bits (sign repetition) and 
the coefficient is interpreted as being in bit fractional 2's com-
plement with the m-th bit as sign bit then the product will be 
fractional 2's complement with 2 guard bits. This corresponds to 
the form in which the multiplier is often most conveniently config-
ured. 
In systems where the product scaling is not as required either the 
data or coefficient must be scaled prior to multiplication or the 
product must be scaled subsequently. Alternatively the double pre-
cision product might be used in conjunction with a format primi-
tive. 
Restrictions: 
1 	Due to the nature of the multiplication algorithm employed 
there must be two bits of sign extension on the data word to 
prevent internal overflow, i.e., the three most significant bits of 
the data word must be identical. 
The parameter coeffbits should lie in the range: 
4 <= coeffbits (= swi - 2 
Because only the first (coeffbits) bits of the coeff word are 
used as the multiplier coefficient, it is assumed that the MSB of 
these is the sign bit. The remaining more significant bits making 
up the full coefficient word of (swi) bits are assumed to be MSB 
- 34 - 	 (MULTIPLY) 
interpretation of data, coefficient and product. Other interpreta-
tions are now dealt with. 
If it is assumed that data and product have the same scaling 
(arbitrary, not necessarily integer, i.e., the LSB of each has 
the same significance) and if it is assumed that the coeffi-
cient has integer interpretation then the truncated product 
Is: 
product=X*Y*2 (rn-i)  
If it is assumed that the data and product have the same scal-
ing and that the coefficient is not integer but integer scaled 
by a factor of 2 (where k is integer) then: 
productX*Y*2_m 
If it is assumed that the coefficient is integer and that the 
product is interpreted as having the significance of the data 
scaled by a factor of 2  (where p is integer) then: 
product=X*Y*2m 
If it is assumed that the coefficient is interpreted as 
integer scaled by a factor of 2 and the product is inter-
preted as having the significance of the data scaled by a fac-
tor of 2  then: 
productX*Y*2P-k-(m-1) 
- 33 - 	 (MULTIPLY) 
Pounding 
Figure 111.2 shows how the rounded product is formed when the mul-






Pounding is performed by adding 
(2m_2)  to the product before trun- 
cating. 	This is achieved by the conventional minimal hardware 
rounding scheme as used by Lyon. This rounding algorithm is inac-
curate, having a small positive mean error. This is because the 
exact mid values are consistently rounded to the higher value. 	In 
system structures with feed-back this effect may be unacceptable. 
Where this is found to be the case different rounding hardware is 
needed. 	This requires more complex circuitry and correspondingly 
more chip area per multiplier. Pounding with zero mean error is 
selected by setting the type parameter to 2. 
Interpretational Scaling 
The value of the product has been expressed in terms of an integer 
	
- 32 - 	 (MULTIPLY) 
double precision product (cp. the dpmultiply primitive). 	The way 
in which the multiplier forms the truncated product is illustrated 
in Figure 111.1. 
lsss 	 j X=data 
n 
Is 	 Y = coefficient 
M 







- 31 - 	 (MULTIPLY) 
parain ii,fl max 
1 	0 	2 
2 	4 	24 
3 0 1 
4 	0 	1 
constraint 	meaning 




even integer 	coefficient word-length 
integer 	predelay on data 
integer predelay on coefficient 
Note: See description for details on rounding. 
Function: 
product=data*coe f 	f fb.tts-1) 
delayeddata = data 
where data, coefficient & product are interpreted as integers. 
latency = (1.5 * coeffbits + 2) 
where this latency is that of ctrlout, product & delayeddata. 
For notational convenience, suppose we have: 
inultiply[0,m,0,0](clin -> clout) X,Y -) Z,W 
and let swl=n. Then the data X (or multiplicand) consists of n 
bits with the three most significant being identical (two sign bit 
repetitions and a sign bit). The coefficient (or multiplier) Y 
consists of n bits. Only the least significant in of these are used 
by the multiplier, the most significant n-rn bits are assumed to be 
sign bit repetitions. The product of an n bit word with an in bit 
word requires (n+m-1) bits to represent it. Since the bit serial 
system word-length is fixed at n bits, this product can either be 
truncated or rounded to give n bits, or it can be split into two n 
bit words, a most significant and a least significant, giving a 
- 30 - 	 (MULTIPLY) 
Multiply 
Description 
Multiply forms the truncated or rounded product its the two inputs. 
The data (multiplicand) is assumed to have its top three bits 
identical. Coefficient (multiplier) quantisation is assumed to be 
less than or equal to that of the data. The significant bits of 
the coefficient are assumed to be right justified and the remaining 
bits are assumed to be sign bit repetitions. The input data is 
available, synchronously with the product, as a second output. 
Flow Diagram & Simulation Pattern: 
data 	coeff 
€_ [t,8,oT-'. 
delayeddata j 	product 
Syntax: 
multiply [type, coeffbits,dell ,del2] (ctrlin -> ctrlout) - 
data,coeff -> product,delayeddata 
Parameters: 
- 29 - 	 (MULTIPLEX) 
Restrictions: 
The output and input have the same significance. 
Multiplex does not alter the value of the input in transfer-
ring it to the output but only selects between the two inputs 
according to the control word. 
Ctrlin must be either 0 or (2"1) for correct functioning of 
multiplex. 	The function of multiplex for any other control input 
word is not a valid word but breaks down into bit-wise selection. 
Thus ctrlin must be level of control above ci. 
Simulator Checks and Warnings 
Checks are made for input synchronisation timing. 
Circuit Design 
The multiplex circuit is a straightforward combinational logic 
design, with no special features. 
- 28 - 	 ( MULTIPLEX) 
latency = user chosen value of parameter 1 
For notational convenience, suppose we have: 
multiplex [m,O,O] (C) X,Y -) Z 
and let swl=n. The multiplex primitive is a data selector of one 
from two input streams. 	More formally the bit-wise function of 
multiplex can be defined as follows. Define the input words X, Y 
as: 
X = x0.2 + x1 21 + ... + x 2.2 2 - Xni •2 fl1  
Y = y0.2° + y1 .21 + ... + n-22 	- Yn_i 2  
Define the control word C as: 
C = c0.20 + c 1 .21 + ... + c 2.2 2 - Cn1•2 f1  
For each word either c = 0 for all values of j or c = 1 for all 
values of j. 
Let the output Z be denoted by: 
Z = z.20 + 11 .21 + ... + z 2 .2 2 - 
Then Z is defined as follows: 
z 	= ((x )AND(NOT(c )))0R((Y )AND(c 
In other words if c 	is 0 then z 	is x or else if c 	is I then z 
is y 
- 27 - 	 (MULTIPLEX) 
1.3.2. Multiplex 
Description 
Multiplex is a one of two input select switch controlled by c2 or a 
higher level of control. 	The output latency is a user chosen 
value. 





multiplex {latency,dell,del2] (ctrlin) inl,in2 -> output 
Parameters: 
param min max constraint meaning 
1 1 32 integer latency of output 
2 0 1 integer predelay on ml 
3 0 1 integer predelay on in2 
Function: 
output = ml 	if ctrlin = 0 
output = in2 	if ctrlin = (1 << swl) - 1 
- 26 - 	 (MSHIFT) 
hardware can be used. This however gives rise to primitives which 
have zero or negative latency. The concept of a primitive with 
zero or negative latency is problematic for the implementation of 
an event driven word level simulator. Thus for FIRST mshift has 
been implemented as a primitive with one bit latency and no 
hardware overflow detect, though the simulator gives warnings of 
overflow. The circuit can then be implemented as a multiplexer 
with one input tied to ground and the other connected to the input 
signal. A control delay line and distributed nor gate with 
inverter then make up the control function. 
- 25 - 	 (MSHIFT) 
Function: 
output = input * (2 
latency = 1 bits 
Restrictions: 
The parameter p should be less than (n-i), and must be less 
than n. 
For correct operation ctrl must be ci. 
For any value of p the result may be an overflow. 	This will 
occur if the (p+i) most significant bits of the input are not 
identical. 
Simulator Checks and Warnings 
Checks are made for out of range parameter values, for overflow and 
for input synchronisation timing. 
Circuit Design 
Several versions of hardware are possible, including overflow 
detect and a variety of latencies. Two problems arise in deciding 
how to implement mshift. The first problem concerns overflow and 
the second latency. If overflow detect is omitted then a consider- 
able reduction in circuitry is achievable. 	If additionally the 
output bits are made available with minimum latency then minimum 





The mshift primitive functions as repeated arithmetic left shift, 
the number of repetitions being defined by the user chosen value of 
parameterl. In other terminology the effect is the same as multi-
plying the input signal by 2, where p is positive and integer. 
Flow Diagram & Simulation Pattern: 
I Input 
fL(M3H/vr [2. 0] 
output 
Syntax: 
mshift [p,del] (ctrl) input -) output 
Parameters: 
Let sWi = n 
min max 
 
1 1 n 
 
2 0 1 
 
- 23 - 	 (DSHIFT) 
Function: 
output 	input / (2P) 
latency = (p + 3) bits 
Restrictions: 
The parameter p should be less than (n-i), and must be less 
than n. 
For correct operation ctrl must be ci. 
If a value of p > n-i were to be chosen, then the division 
would underf low. 
Simulator Checks and Warnings 
Checks are made for out of range parameter values and input syn-
chronisation timing. 
Circuit Design 
The circuit for dshift is implemented by using a one bit shift 
register latch in the signal path. This either passes the input 
signal to the output or latches a previous input. 	Circuitry for 
controlling this latch function consists of a cascadeable control 
delay line together with an associated distributed nor gate and 
inverter. 	Finally there is an initial input delay of one bit and 
an output stage. 





The dshift primitive functions as repeated arithmetic right shift, 
the number of repetitions being defined by the user chosen value of 
parameterl. In other terminology the effect is the same as divid-
ing the input signal by 2, where p is positive and integer and 
then truncating to ri bits. 





dshift[p,del](ctrl) input -> output 
Parameters: 
Let swi = n 
param min max constraint meaning 
1 	1 	n-i 	integer 	power of 2 divide 
2 0 1 integer predelay on input 
- 21 - 	 (DPMULTIPLY) 
4 <= coeffbits (= swi - 2 
Due to the fact that only the first (coeffbits) bits of the 
coeff word are used as the multiplier coefficient, it is assumed 
that the MSB of these is the sign bit. The remaining more signifi-
cant bits making up the full coefficient word of (swi) bits are 
assumed to be MSB repetitions. 
For correct function ctrlin must be the correctly synchronised 
LSB pulse or ci control. 
The MSWprod of a product always emerges one word later than 
the LSWprod. 
Simulator Checks and Warnings 
Checks are made for incorrect data format, incorrect coefficient  
format and for input synchronisation timing. 
Circuit Design 
The multiplier design is derived from the modified Boothe serial-
parallel multiplier used for the multiply primitive. The hardware 
architecture of the multiply primitive forms all (ri+ni-1) bits of 
the double precision product internally, so that dpmultiply 
includes some additional hardware to extract these bits, together 
with the added control and sign extend circuitry. 
- 20 	 (DPMULTIPLY) 
consists of n bits. Only the least significant in of these are used 
by the multiplier, the most significant n-rn bits are assumed to be 
sign bit repetitions. The product of an n bit word with an in bit 
word requires (ni-in-I) bits to represent it. Since the bit serial 
system word-length is fixed at n bits, this product can either be 
truncated or rounded to give n bits (cp. multiply primitive) or it 
can be split into two n bit words, a most significant and a least 
significant, giving a double precision product. The double preci-
sion product is formed by packing the n least significant bits of 
the product in LsWprod and the remaining bits contiguously in 
MSWprod starting at LSB. The remaining bits of MSWprod are sign 
bit extensions. 
Addend 
The product output can encorporate a restricted offset added into 
it in the process of its formation. This is a consequence of the 
hardware implementation of the multiplier (cp section on circuit 
design). 
Restrictions: 
I. 	Due to the nature of the multiplication algorithm employed 
there must be two bits of sign extension on the data word to 
prevent internal overflow, i.e., the three most significant bits of 
the data word must be identical. 
2. 	The parameter coeffbits should lie in the range: 
- 19 - 	 (DPNULTIPLY) 
Para m min max constraint meaning 
1 4 32 even integer coefficient word-length 
2 0 1 integer predelay on data 
3 0 1 integer predelay on coeff 
4 0 1 integer predelay on addend 
Function: 
Let n=swl and m=coeffbits then: 
LSWprod = (data * coeff '- addend) AND (2) 
MSWprod = (data * coeff + addend) AND ((2 -l)(2"1)) 
deldata = data 
where data, coeff, addend & product are interpreted as integers. 
Latencies ar given as follows: 
ctrlout latency = (1.5 * m + 1) bits 
deldata latency = (1.5 * m + 1) bits 
If swi >= (1.5 * m + 1) then 
LSWprod latency = 1 bit 
MSWprod latency = (n + 1) bits 
otherwise: 
LSWprod latency = (1 ± (1.5 * in + 1) - n) bits 
MSWprod latency = (1.5 * m + 2) bits 
For notational convenience, suppose we have: 
dpmultiply[m,O,O,O](clin -> clout) X,Y,A -) P,Q,Z 
and let swl=n. Then the data X (or multiplicand) consists of n 
bits with the three most significant being identical (two sign bit 
repetitions and a sign bit). The coefficient (or multiplier) I 
- 18 - 	 (DPMULTIPLY) 
1.3.4. Dpmultiply 
Description: 
Dpmultiply is a bit-serial multiplier which outputs a double preci-
sion product on two wires. The lower precision bits are packed, 
right justified, on one output and the remaining bits are packed, 
again right justified, on the other output, whose remaining bits 
are sign extensions. The two parts of the product emerge separated 









deldatoj 	j MSWprod 
LSWp rod 
Syntax: 
dpmultiply[coeffbits,dell,del2,del3](ctrlin -> ctrlout) - 
data, coeff ,addend -) LSWprod, MSWprod , deldata 
Parameters: 
- 17 - 	 (CONSTGEN) 
Circuit Design 
The circuit implementation of constgeri consists of a parallel load 
serial output m bit shift register, and control circuitry to govern 
the loading. The shift register has its parallel load inputs 
hardwired to correspond with the values of constspec and the bit u 
is latched to generate the (n-m) bit repetitions.. 
- 16 - 	 (CONSTGEN) 
* see below 
Function: 
Let the base two expression for k be as follows: 
k = k0.20 + k1.21 + 	+ 	 + 
Note that in <= ri and that this is unsigned integer representation, 
not twos complement representation. Define 
x = x0.2°  + x1.21 + ... + x 2.2' 2 - xnl.2n_l  
where x •=k 	for j<m and x =k 1 for j>=m. Then x is composed of 
the in bits of k right justified with n-rn MSB repetitions left jus-
tified to make up an n bit word. The interpretation of x is as 
two's complement integer. 
output = x 
latency = 1 bit 
Restrictions: 
For correct function parameter I must be the same as the sys-
tem word-length. 
ctrl must be ci control. 
Simulator Checks and Warnings 
Checks are made for input synchronisation timing. 




Each ci cycle, consgen outputs, LSB first, the ni bits defined by 
the user chosen value of parameter 2 followed by (n-m) MSB repeti- 
tions to make up the full n bit constant word. 	Constgen is, in 
effect, a one word bit-serial ROM. The use of constgen in fixed 
coefficient designs saves generating required constant values off 
chip. 





Constgen [sigwl,constspec] (ctrl) -) output 
Parameters: 
Let n=swl, m=sigwl, k=constspec 
param min max constraint meaning 
1 	1 	32 	integer 	significant word-length 
2 * * integer coded constant 
- 14 - 	 (ADD) 
Circuit Design 
The adder circuit design was adapted from the programmable add-
subtract circuit used in the multiplier. It uses data select pass 
transistor EXOP networks, giving minimum transistor count and power 
dissipation. This style of design can, however, give rise to speed 
and testability problems. 




For addition to function correctly the LSB of each addend and 
of the carryin must have the same significance. The LSB of the sum 
will have this significance also. 
If each addend has one guard bit (i.e., the two MSBs are 
identical) then no overflow or underf low can occur. 
Two word addition causes arithmetic growth of one bit. 	Over- 
flow in the previous sum is flagged by the single bit value of car-
ryout at LSB time, with value 0 signifying no overflow/underf low 
and value I signifying overflow/underflow. 
For single precision addition the carryin input is connected 
to ground. 
The carryin/carryout configuration is designed for easy con-
struction of multiple precision addition 
For correct function the ctrlin input must be LSB or ci con-
trol. 
Simulator Checks and Warnings 
Checks are made for overflow and underf low and for input synchroni-
sation timing. 
- 12 - 
	
(ADD) 
all 	1a2 	bli 	1b2 
'CND 	 COTT)f 	 NC _...I _[1. 0.0.0]  ___4f0 [1. 
iJJ -- 
j yl 	 y2 
F is the word of bitwise addition carry bits generated from the 
addition, and has a fixed latency of only one bit. More formally, 
if 
A = a0.2°  + a1.21  + ... 	+ an2.22 - 	a 1.2 1  
B = b0.2°  + b1.21  + ... + 	b2.22 - bn_i•2  
C = c0.20  + c1.21 + ... + c 2.2' 2 - 
then F is given by 
F = f.2°  + 11.21 + ... 
+ 12n-2 - 
where 
= C0 AND (a0 XOR b 0  ) OR (a0 AND b0) 
= 	AND (a 
j1 
 XOR b 
j-1 
 OR (a j1 AND b 
where 1<=j<=n-1 and 
f0 = f 	AND (a_1 XOR b_1) OR (a 1  AND bni) 
Note that, in this final expression for f0 the values an_I  and b 1  
refer to the previous input words and 
1n1 
 the corresponding previ- 
ous word carry bit. Thus f0  is an overflow/underf low flag for the 
previous addition. 
- 	11 	- (ADD) 
parani nun max constraint meaning 
1 1 32 integer latency of sum 
2 0 1 integer predelay on addendi 
3 0 1 integer predelay on addend2 
4 0 1 integer predelay on carryin 
Function: 
sum = addendi + acldend2 + (carryin & 1) 
latency = user chosen value of parameter 1 
carryout : 	see below 
latency = 1 bit 
For notational convenience, suppose we have: 
add[m,0,0,0](cLSB) A,B,C -> S,F 
and let swl=n. Then the addends A, B are assumed to have the same 
interpretation i.e., LSBs have the same significance. S is the sum 
of A and B with only the LSB of C added in. In normal use C will 
be replaced by GND so that this LSB is zero. Adders can, however, 
be cascaded to form multiple precision accumulation. 	This is 
achieved by connecting the carryout from the previous adder to the 
carryin of the next as shown in the figure below, which illustrates 
adders configured for double precision addition. In this case the 
C input is used to propagate any carry generated from the addition 
of less significant bytes into the addition of the more significant 
bytes. The first adder has its C input replaced by GND. 
The latency of the sum can be user chosen by means of the first 
parameter. 
- 10 - 
.1...2. Add 
Description 
Add forms the bit-serial sum of the first two inputs. 	The third 
input is an external carry, which is strobed at LSB time to ini-
tialise the adder. This input is normally connected to GND, but 
can be used to construct multiple precision adders (see function). 
The second output, the word of carry bits, is provided for the same 
purpose. 	The latency of the sum output can have any user chosen 
value in the allowed range. 
Flow Diagram & Simulation Pattern: 
addend 1 


















add [latency,dell,del2,del3] (ctrlin) - 
addendi ,addend2, carryin -> sum, carryout 
Parameters: 
- 9 - 	 (ABSOLUTE) 
complement) rather than inversion and addition of one bit at LSB 
significance, (two's complement) was made in order to achieve a 
hardware saving of a half adder. This saving is not significant 
but the added accuracy was not needed for the applications for 
which the primitive was originally designed. In a comprehensive 
cell library another variant of absolute might be made available, 
giving the accurate two's complement version. 
- 8 - 	 (ABSOLUTE) 
Function: 
output = input 	if MSB=O 
= NOT(input) 	if MSB=1 
latency = (swl + 3) bits 
Restrictions: 
The value of parameter I must be the same as the system word-
length for correct function. 
If the sign bit of the input is negative (MSB=I) then the 
operation of absolute is to invert the input. This one's comple-
ment function gives rise to a one bit error at LSB significance in 
the output, but results in smaller hardware area (see section on 
circuit design). In principle there is no reason why a two's com-
plement version could not also be available. 
For correct operation ctrl must be ci. 
Simulator Checks and Warnings 
Checks are made for illegal values of parameter 1 and for input 
synchronisation timing. 
Circuit Design 
The circuit is implemented as a shift register to store the input 
word, a sign detect and programmable output inversion circuit fol-




1.3. ARITHMETIC PRIMITIVES 
.1•.•1. Absolute 
Description: 
The absolute primitive performs the modulus or full-wave rectify 
function on two's complement data. It outputs the input signal if 
this is non-negative or the inverted signal if it is negative. 





absolute [swl,del] (ctrl) input -> output 
Parameters: 
param min max constraint meaning 
1 	6 	32 	integer 	system word-length 
2 0 1 integer predelay on input 
-6- 
where w is a number of words and b is a number of bits with O<=b<n 
then Y(T) is defined when (T mod n)=b. 
Usually the values are just interpreted as words X and Y, with 
latency d. 	However, if reference to a specific bit at a specific 
input time is needed then this can easily be done, using the above 
notation. 	For simplicity, values are interpreted throughout as 
being integer. Other interpretations are explained where appropri- 
ate. 	Note that the abbreviation swi, for system word-length, is 
used frequently. 
-5- 
functioning. 	In order to simplify the notation, the node and the 
value on it are given the same name. As a value this is inter-
preted as an arbitrary word value at an arbitrary time. The values 
of the inputs will all occur simultaneously ( i.e., each primitive 
has time-aligned inputs). Each resulting output value is defined 
with an associated latency, which is the delay from the input time 
to the time when the corresponding output occurs (the outputs of a 
primitive may not be time aligned). This notation can be directly 
derived from a more rigorous notation specifying distinct node 
names, bit and word values at specific times. The latter notation 
is required when more complicated reference to bit values or inter- 
nal state values is needed in the definition of function. 	This 
latter notation is as follows. 
Let T=0,1,2,... denote bit time. Suppose that this time is 
local to the inputs of the primitive in question, so that at time 
T=O the LSB appears at the inputs. Let the system word-length be n 
and suppose that X is an input node. Further let T=t where (t mod 
n)=O then the value on X at time t is: 
X = x(t).20  + x(t+1).21 + ... + X(+fl2).2n2 - x(t+n-1).2 1  
where x(j) = 0,1. This is often abbreviated to: 
X = x0.2°  + x1.21 + ... + xn 	- x 1.2 
These expressions represent the 2's complement integer interpreta-
tion of the bit stream which occurs at X from T=t till T=t+n-1. 
If the latency of an output node Y relative to X is d=nw+b 
-4- 
information would enable a skilled circuit designer to enter new 
primitives into the compiler, given the system requirement for 
such. The contents of this section are not of primary concern to 
the system designer who wishes only to use the compiler. 
The primitives fall into the following categories: arithmetic 
function, storage, control, formatting and miscellaneous. Within 
these functional groupings they are presented alphabetically for 
convenience of reference. For each primitive a brief description 
is given. A flow diagram representation is shown with a simulation 
pattern. The FIRST HDL syntax is then listed. A short explanation 
of the parameters, a definition of function and notes on restric-
tions for valid use are given. The checks that the simulator car-
ries out are listed. These greatly aid system description debug-
ging and are not, in general, duplicated by the corresponding 
hardware. In addition to valid node connectedness checks the most 
frequently made check is for input timing synchronisation, which 
identifies latency inconsistencies in a design. 	Finally some 
relevant notes on circuit design are added. 
1.2. Notation 
Some notation is required for defining primitive function. 
This can be more or less formal. In particular, each primitive 
input or output node will have a name and at each of these named 
nodes bit streams, grouped into words as marked by the associated 
LSB or ci control, will appear throughout the time the primitive is 
-3- 
List of Primitives 
Arithmetic Storage 
Absolute 	Bitdelay 



















Substrate bias generator 
Table 111.1 
for its use. As this text is an introduction to the design metho-
dology underlying such a compiler and to the system design tech-
niques that can be applied with it, the technical circuit and pro-
gramming issues are not dealt with in detail. 
In the first sub-section the notation is explained which is 
used to describe the function and operation of the primitives. 
Next, in the main numbered sections of the chapter, a specific 
primitive library is defined and described. Finally, the chapter 
ends with a short section which indicates how primitives are 
designed and defines their formal interfaces to the compiler. This 
section is intended to give a flavour of what is involved in build-
ing the primitive library as part of a silicon compiler. Such 
-2- 
.1. PRIMITIVE LIBRARY 
1.1. Introduction 
In this chapter the elements of a bit-serial primitive library 
are described. 	Primitives are the lowest functional elements in 
the system hierarchy. They exist in hardware and all systems will 
be constructed from them. In the abstract case it may be possible 
to define a canonic or covering set of such primitives, which will 
deal with all problems. However, practical requirements for 
hardware and computational efficiency often demand that special 
solutions be developed for particular applications. The approach 
adopted has been to build a library of primitives to cover the 
range of applications of most interest to the specific applications 
of signal processing. Where no primitive exists to meet a require-
ment then it is necessary to custom design and verify one and then 
add it to the library. Our experiments in system design using the 
techniques outlined in this text demonstrate that a wide range of 
applications can be satisfied using a set of a dozen or so primi-
tives. These are listed in Table 111.1. 
It is a primary feature of this library that all of the com-
posed primitives obey the geometric, electrical and signaling con-
ventions set out in chapter 3. The details of the design of each 
primitive are technology specific and each primitive library 
requires a separate technical accompanying document. 	This are 
needed for documentation and maintenance of the compiler, but not 
APPENDIX II : THE FIRST PRIMITIVE LIBRARY 
ABSTRACT 
This appendix describes the FIRST primitive hardware. 
The primitive hardware was specified by P. B. Denyer and 
D. Renshaw and was designed by D. Renshaw, U. S. C. Wal-
lace, D. J. Talbot and N. Henderson. Hardware was 
tested by D. Renshaw and N. Henderson. The contents of 
this appendix was written by D. Renshaw; it appears in a 
reduced form in "VLSI Signal Processing: A Bit-Serial 
Approach" by P. B. Denyer and D. Renshaw (Addison-
Wesley). 
- 1.31 - 
dii 	 • d12 	 .d13 
tfolt I I 	tfo2 
	
tfl 	 II 	 1 	1  
1 __ 	____________ 	____________ 	 fo Object 	 j Object I Object 	
o3 
bo 	1  bi 
tbi tbol 
dol 
I tbo2 	 1 tbo3 	
do  
'I 	I 	tfo di 	f I 
I fo ti] 	_ 
Object I bi 
I' tbi bo 1 
tbo 	 do 
Figure 4 
'p1 	 ti3 
'p2  





cc11 	cc2l 	cc 	cc22 
Figure 5 
- 1.30 - 
OPERATOR F7 di1,di2,d13,gi,fi,bi,tfi,tbi -> - 
do  ,do2 ,do3 ,fo ,bo ,tfi 1 ,tfi2 ,tfi3,tbo I ,tbo2 ,tbo3 
SIGNAL di, do, tfo, tbo 
Object [1] di,gi,fi,bi,tfi,tbi -> do,fo,bo,tfo,tbo - 
TIMES 3 WITH - 
fo -> fi - 
tfo => tfi = tfil, tf12, tf13 - 
bi <- bo - 
tbi <= tbo = tbol, tbo2, tbo3 
di = dii, di2, d13 
do = dol, do 2, do  
END 
OPERATOR F8 (cii, ci2 -> ccli THROUGH 13, cc2i THROUGH 23) - 
ii, i2 -> til THROUGH 13, t21 THROUGH 23 
SIGNAL o1, o2 
CONTROL col, co2 
Thing [1] (cil,c12->col,co2) ii,i2 -> oi,o2 TIMES 3 WITH - 
(cii, ci2 => col = ccli THROUGH 13 : co2 = cc2l THROUGH 23) - 
ol, o2 => i2 = til THROUGH 13 : i2 = t21 THROUGH 23 
END 
! = 
Note that in the first of these examples the signal names have been 
chosen with a mnemonic letter code as follows. 
gi 	global input 
di distinct inputs 
do 	distinct outputs 
fi forward cascade input 
fo 	forward cascade output 
bi backward cascade input 
bo 	backward cascade output 
tfi tapped forward cascade input 
tfo tapped forward cascade outputs 
tbi tapped backward cascade input 
tbo tapped backward cascade outputs 
In the second example there is more than one signal connected in 
the same cascade node. All such signals must be named in the same 
phrase, as lists of nodes followed by corresponding lists of tap 
names. The lists of tap names are separated by the : symbol. 
- 1.29 - 
in 
 B1 	B F 	B
out 
-1B1 I Bi 
tapi 	 jtop2 
I0utJB] 	DB
in  
_tap3 B} 	s} 	j 	
in 
top2, 	 tapi 
Figure 3 
nodes must be named for each of the tapped cascade configurations. 
This occurs through the phrase starting with the = symbol and fol-
lowed by the list of nodes which are to be used for each of the tap 
points. 
Figures 4 and 5 show two further examples which combine 
several different types of connection in one structure and two cas- 
cades of connection of the same type, respectively. The associated 
syntax follows. 
