Design and simulation of a latched array processor / by Bonsen, Georg Alexander Zur
Lehigh University
Lehigh Preserve
Theses and Dissertations
1988
Design and simulation of a latched array processor
/
Georg Alexander Zur Bonsen
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Bonsen, Georg Alexander Zur, "Design and simulation of a latched array processor /" (1988). Theses and Dissertations. 4907.
https://preserve.lehigh.edu/etd/4907
<.: .• 
',(, ' 
DESIGN AND SIMULATION OF A 
LATCHED ARRAY PROCESSOR 
by 
Georg Alexander zur Bonsen 
A Thesis 
Presented to the Graduate Committee 
of Lehigh University 
in candidacy for the degree of 
Master of Science in Electrical Engineering. 
Lehigh University 
Bethlehem, Pennsylvania 
1988 
This thesis is accepted and approved in partial fulfillment of the require-
ments for the degree of Master of Science in Electrical Engineering. 
l"lAY 18.19&8 
Date 
Advisor in Charge 
.. 
CSEE D partment Chairperson 
. ,r, 
•• 
11 
• 
; 
.. 
For Doris and Rudolf, 
Corinna, Matthias Gabriele 
and my Grandfather 
• • • 
Ill 
,, 
( 
.. 
Acknowledgments 
I would like to give niy special thanks to Professor Meghanad D. Wagh 
who always supported and directed me during my research. It has been a great ('·,. 
pleasure to work with him as he was always available for help and exchanging 
ideas . 
I want to thank many friends, especially M. Khan Dhodhi, Isil Can-
zaboglu, Junhua Zhu, Irwan Siddhartha and Raphael Richards for assisting me 
on my research, writing my thesis and having great time. 
I would also like to thank Professor Carl Holzinger for whom I have 
worked as a teaching assistant and acquired a significant amount of experience 
such as teaching laboratory seminars. I also want to express my thanks to 
Professor Frank Hielscher, Professor Clarence Joh and Professor Kenneth 
Tzeng for providing me with the necessary background knowledge which I was 
able to apply in my thesis research. 
Finally, I would like to thank the most wonderful and closest people to 
me, my father, mother, brother and sister who were very supportive during my 
three· undergraduate and two graduate years at Lehigh University . 
• • • lll 
\ 
Table of Contents 
Abstract 1 
2 
3 
4 
5 
6 
1. Introduction 
1.1 Developments in Numeric Processing 
1.2 The LICAM Processor 
1.3 Symbolic Logic Simulator Language 
1.4 Overview of the Thesis 
2. Digital Logic Simulators 7 
2.1 Hardware Simulation -An Introduction 7 
2.2 Levels of Abstractions 8 
2.3 Structural and Functional Properties CHDLS 10 
2.4 Procedural Characteristics of CHDLS 10 
2.5 Textual and Graphical CHDLS 11 
2.6 Examples of Classic Computer Hardware Description Languages 12 ) 2.6.1 A Hardware Programming Language (AHPL) 12 I 
~2.6.2 Computer Design (Description) Language (CDL) 13 
2.6.3 Digital (System) Design Language (DDL) 14 
2.6.4 Instruction Set Processor ISP 15 
2.7 Drawbacks of Classical Register Transfer Languages 16 
2.8 Alternative Solution: High Level Languages 17 
2.9 Too many Different Languages 18 
2.9.1 Consensus Languages - A Possible Solution? 19 
2.9.2 The DoD VSHIC Project 20 
2.10 Symbolic Logic Simula~r Language 23 
3. Symbolic Logic Simulator SLS 25 
3.1 Overview of the Simulator 25 
3.2 The Simulation Language SLSL 27 
3.3 The Compiler for SLSL 31 
3.4 The SLS Simulator 32 
3.4.1 Internal Data Structures and Simulation Setup 33 
3.4.2 Internal Building Blocks and Simulator Execution 35 
3.5 The Output Generator 37 
3.6 The Preprocessor 39 
3.6.1 The Preprocessor Language 40 
3.6.2 Preprocessing Procedure 42 
4. Background ·of High-Speed Multipliers 44 
4.1 Introduction 44 
4.2 Parallel Processing, an Introduction 44 
4.3 Parallel Architectures for High-Speed Multiplication 46 
4.4 Iterative Cellular Array Multipliers 48 
4.5 Pipelining Iterative Cellular Multipliers 51 . -.,., 
4.6 Latched Iterative Array Multiplier Based Processor 52 
5. Architecture of the LICAM Arithmetic Processor 54 
• IV 
0 
5.1 Introduction 
5.2 Arithmetic Unit 
5.2.1 The Latched Iterative Cellular Multiplier 
5.2.2 Intermediate Carry-Save Adder 
5.2.3 Carry rropagation Adder 
5.2.4 Overflow Detection Logic 
5.2.5 Output Selector Module 
5.3 Scratchpad RAM and Look-up Table ROM 
5.4 Input Interface 
5.5 The Control Unit 
5.5.1 Instruction Set and Format 
5.5.2 Memory Addresses 
5.5.3 Data Selectors 
5.5.4 Control Signals 
5.5.5 Initiating;simple Multiplications 
5.5.6 Initiating"Progranimed Processes 
5.5. 7 Description of the Subsystems 
5.6 Data Routing Network ~ 
5. 7 Output Interface 
5.8 Summary 
6. Applications of th.e LICAM Processor 
6.1 Introduction 
6.2 Pseudo-code Representation 
6.3 Inner Product of Vectors 
6.4 Squaring a Matrix 
6.5 Determinant Calculation 
6.6 Polynomials 
6.7 Multiple Overlapped Multiplication A*B*C*D 
6.8 Concluding Remarks about Applications 
7. Conclusion and Future Work 
7 .1 Design Summary 
7.2 Future Work 
7 .2.1 Future Work on SLSL and Design Automation Software 
7.2.2 Improvements to the LICAM Processor 
7.2.3 Interfacing the LICAM Processor to the Outside 
References /\ 
J Appendix A. Summary of · Symbolic Logic Simulation Lan-
guage SLSL 
A.I Declaration of Identifiers 
A.2 Statements 
A.3 Constants 
A.4 Operators 
A.5 Splicing and Concatenation 
A.6 Structural Reserved Functions 
A.7 High-Level Functional Reserved Functions 
V 
' 
. 
. 
. : I. 
54 
55 
58 
59 
62 
62 
64 
65 
65 
67 
71 
71 
71 
73 
74 
75 
75 
78 
79 
79 
91 
91 
91 
94 
94 
96 
97 
100 
102 
105 
105 
106 
106 
108 
110 
112 
116 
116 
116 
116 
117 
117 
118 
1.18 
\ 
' 
' 
A.8 Additional Support Functions 
A.9 Compiler Parsing Syntax 
Appendix B. Summary of Preprocessor Commands l,· .. 
B.1 Preprocessor Control Fields 
B.2 Identifiers and Operators 
B.3 Preprocessor Statements 
B.4 Preprocessor Parsing Syntax 
Vita 
• VI 
• 
119 
119 
123 
123 
123 
124 
124 
127 
Figure 3-1: 
Figure 3-2: 
Figure 3-3: 
Figure 3-4: 
List of Figures 
SLS Data Flow Diagram 
Nonrestoring square root hardware 
SLSL Code to simulate hardware of Figure 3-2 
Simulator Execution Model 
I 
I 
Figure 3-5: 
Figure 3-6: 
Figure 3-7: 
Format Specification for Square Root Program 
The Conversion file used with Square Root Circuit 
Output from a Simulation of logic of Figure 3-3 
Figure 3-8: A SLSL Program using Macro Blocks 
Figure 5-1: Layout of LI CAM Processor 
Figure 5-2: Layout of Arithmetic Unit 
Figure 5-3: 8x8 bit Modified Latched Iterative Cellular Array Mul-
tiplier. The LICAM processor uses a 16x16 bit array 
with 258 full adder cells 
Figure 5-4: Carry-Save and Carry-Propagation Adders 
Figure 5-5: Overflow Detection 
Figure 5-6: Memory Subsystem 
Figure 5-7: Input Interface 
Figure 5-8: Control Unit 
Figure 5-9: Instruction Format 
Figure 5-10: Control Unit Subsystems 
Figure 5-11: SLSL Listing of LI CAM Processor 
Figure 6-1: Pipeline Scheduling Table for 4-0perand Multiplication 
' \ 
•.,I 
•• Vll 
26 
29 
29 
34 
38 
39 
40 
42 
56 
57 
60 
61 
63 
66 
68 
70 
71 
76 
90 
103 
List of Tables 
Table 5-1: Control Unit Control Signals 
Table 5-2: Selectable Sources 
·, 
.. 
••• Vlll 
• 
72 
73 
(, . 
.. 
\ 
\ 
Abstract 
This thesis discusses the development of a parallel digital simulator and a 
parallel processor based on a latched array multiplier. 
\ 
The digital simulator processes hardware descriptions written in 
"Symbolic Logic Simulation Language" (SLSL). SLSL was designed to permit 
simple and compact descriptions of complex hardware. Primitives are provided 
to describe structural and hierarchically organized prototypes at both gate and 
functional levels. This simulator proved the integrity of the design described 
below. 
An arithmetic processor based on a pipelined multiplier was developed to 
provide a set of arithmetic functions. The processor employs a modified Latched 
Iterative Cell11lar Array Multiplier (LICAM) to provide basic operations such as 
additions, subtractions and multiplications. The pipeline is interfaced with a 
control unit to take full advantage of its potential capabilities and achieve max-
imum performance. This control unit is organized to interpret low-level 
microinstructions which transfer data among the LICAM, scratchpad memory, 
look-up tables and input and output interfaces. It is also divided into four sub-
systems operating in parallel in order to permit overlapped processing and 
therefore a very high throughput. 
1 
Chapter 1 
Introduction 
Modern scientific computer applications require large amounts of numeric 
processing at a rapid pace. This continuous demand for ever increasing com-
puter performance has led to the development of increasingly powerful complex 
and compact arithmetic architectures employing temporal and spatial paral-
lelism. In the post-war period, the first-generation machines were introduced. 
Their vacuum tube circuits were too large to implement parallel arithmetic 
hardware. The second-generation computers employed transistors whose size 
permitted more sophisticated architectures including hard-wired floating-point 
arithmetic [1]. Machines of later generations used integrated circuits where 
hardware size is no longer the critical issue. The newer architectures have 
departed from the classic von Neu.mann machine and support parallel and 
pipelined processing. Some innovative solutions to improve processing include 
vector processing, reduced instruction set computers (RISC) and latched array 
arithmetic hardware. 
This thesis presents an auxiliary numerical processor which uses a 
latched array multiplier to execute one addition and one multiplication per clock 
cycle. The processor exhibits the characteristics of reduced instruction set 
processing and can serve as either a coprocessor for a host CPU or a processing 
element for a vector processor. The design of this processor is verified with a 
logic simulator developed as a part of this project. 
2 
\, 
(] 
1.1 Developments in Num.eric Processing 
A significant amount of hardware is necessary to calculate more complex 
and time-consuming arithmetic operators such as multiplications, divisions and 
roots, and transcendental functions such as logarithms and trigonometric func-
tions at high speeds. Since many scientific applications use these operations 
frequently, their processing speed is intimately related to the arithmetic 
hardware implementation. 
Two distinct choices are available to optimize arithmetic hardware. One 
choice is to minimize the delay to process one set of operands. For example, the 
primitive shift-add multiplier can be improved by examining and shifting mul-
tiple bits at every clock cycle. The other choice is to pipeline the data path 
through arithmetic hardware and optimize throughput allowing overlapped 
processing of multiple operands. Thus, in the case of a multiplier, the repetitive 
cycles in the shift-add multiplier can be expanded into pipelined cellular 
hardware. 
A wide variety of parallel arithmetic architectures to enhance throughput 
already exist. However, these architectures are not always fully utilized. These 
shortcomings are typically due to slow or inefficiently designed control units. 
Ordinary processors require too much time to fetch instructions and operands 
and store results in order to execute small repeatedly used algorithms. A com-
pact auxiliary arithmetic processor should remove the burden from the host 
CPU to calculate small ·formulae and transcendental functions quickly. Such an 
auxiliary processor would allow one to exploit the complete potential of an arith-
metic pipeline. 
3 
[, 
.. 
1.2 The LICAM Processor 
This thesis proposes an arithmetic processor based on a modified Latched 
Iterative Cellular Array Multiplier (LICAM) called LICAM Processor. This 
pipelined array processes one addition, subtraction and/or multiplication per 
clock cycle. The built-in controller executes a set of microinstructions which 
reside in a ROM (or may be downloaded from the host). A look-up table is avail-
able for numeric constants. A small scratchpad RAM serves as temporary data 
storage. The controller is capable of calculating small formulae and polynomials 
such as Taylor series. The controller is capable of supervising oJ'erlapped pre-
programmed processes and simple multiplications in order to take advantage of 
the potential performance of the pipeline. 
The LICAM processor is intended to be a slave to a host or a supervisory 
interface. It is suitable as a single auxiliary co-processor for one host processor 
or in a vector processor system. In the first case, the host deposits operands into 
I/0 ports and expects results to return some time later. In· the latter case, an 
array of operands is deposited in a dedicated memory bank. An _interface unit 
transfers the operands into one or multiple LICAM processors and accepts the 
returned results. 
For simplicity of architecture, the proposed LI CAM processor is based on a 
,·. 
16 and 32-bit fixed point arithmetic. However, modifications in the arithmetic 
unit permit floating-point operations to take place at almost the same through-
put. 
4 
.. 
1.3 Syinbolic Logic Simulator Language 
No forn1al or mathematical approaches are available to design a11d imple-
ment fast parallel hardware. Fornial approaches such as using boolean equa-
tions and finite state diagrams are limited to combinational and primitive se-
quential logic. More complex problem solutions cannot be clearly defined and 
require a substantial amount of designer's experience and intuition [1] to come 
up with a design which utilizes hardware to provide maximum performance. 
1 The only means to verify the completeness and integrity of complex computer 
designs is extensive simulation and actual implementation. 
Thus, a logic simulator is an essential component in the toolbox of a com-
pu~er designer. The simulator provides a formal Computer Hardware Descrip-
tion Language (CHDL) consisting of a set of rules and conventions to describe 
the prototype. To support the proposed processor, a new digital logic simulator 
was developed as a part of this project. This simulator supports a highly struc-
tured, simple and yet powerful language called Symbolic Logic Simulation Lan-
guage (SLSL). The SLSL describes logic elements and their interconnections 
with a set of "C"-like statements. Different from other languages, the SLSL al-
lows a very compact, though fully defined and easily readable means to describe 
structured and complex system designs. This language also permits easy inter-
facing with external data files to obtain and deposit numeric records. The 
simulator uses an event-driven strategy to execute the SLSL statements to 
mimic the parallel nature of hardware. 
1Computer architecture is also known as Black Art. Cited by a professor in a computer ar-
chitecture and parallel processing class at Lehigh University 
5 
"· 
J 
v: 
1.4 Overview of the Thesis 
This thesis thus concentrates both on the simulation and on architecture. 
Chapter 2 of this thesis summarizes the concept of hardware simulation, some 
popular languages, their advantages and drawbacks. Chapter 3 introduces the 
SLSL and its software support. Chapter 4 discusses the recent progress on fast 
arithmetic hardware, specifically on multipliers. Chapter 5 gives a detailed 
description of the proposed LI CAM processor. The SLSL code of the processor is 
also given and is simulated to verify its operation. Chapter 6 shows a series of 
algorithms the processor is capable of executing. Finally, Chapter 7 sum-
. marizes the ramifications of this processor into a computer environment and _ 
lists avenues for future developments. 
·' 
6 
" 
.• 
\ 
.1 
I 
"' Chapter 2 
Digital Logic Simulators 
2.1 Hardware Sim.ulation - An Introduction 
With the ever increasing complexity of digital systems, conventional 
verification methods to ensure their integrity have become more difficult and 
time-consuming. One conventional method would be to implement tl1e 
prototype with commercially available TIL chips on protoboards. However, 
large-scale designs require excessive amounts of chips and interconnections and 
unacceptably long construction times [2] and construction of large projects, i.e . 
.. 
<l:} 
processors for parallel applications, with discrete devices is virtually impossible. 
Formal Computer Hardware Description Languages (CHDLs) alleviate 
this problem. Simulations detect detect design errors before actual implemen-
tation starts, thereby significantly reducing design phase and avoiding costly 
repetitive manufacturing cycles [3, 4]. CHDLs generally consist of a set of con-
ventions which are used to describe the interconnection of logic elements such 
as combinational logic, memory devices and 1/0 interfaces. Modern CHDLs al-
low easy description of large hierarchically structured or repetitive systems. 
Presently available fast computers and efficient simulation software is 
capable of verifying the functionality and timing behavior of complex implemen-
tations such as multiple processor models operating in par~llel reasonably well. 
For some CHDLs, additional design automation software such as silicon com-
pilers are already available or currently under development. Besides, CHDLs 
provide a common communication medium for designers to express their ideas 
more concisely. Easily readable CHDLs can be incorporated in design docu-
7 
ments [5]. 
This chapter summarizes some of the more popular CHDLs. It also 
provides the reasoning for developments of a new design language, SLSL, as 
part of this project. 1 
.. 
2.2 Levels of Abstractions 
Stephen Y. H. Su [6] has stated that digital hardware can be described in 
any of the following levels of abstraction. Various CHDLs are designed to 
primarily support one or more levels. 
1. Algorithmic Level 
2. Block (or PMS.: Processor/Memory/Switch) level 
3. Instruction Set Level 
4. Register Transfer / Microcode Level 
5. Gate Logic Level 
6. Circuit Level 
The Algorithmic Level specifies only functional properties of a prototype 
and ignores any structural architectural descriptions. For example, a 16 bit ad-
' ., 
der unit is described as a simple addition function. In this level, the designer 
need not provide the architectu.re of the adder. Several CHDLs provide a large 
library of such high-level fu~ctions. 
The second level describes the glo}>.al block structure of the system model. 
Typical blocks include processors, memory arrays, switches, I/0 interfaces and 
blocks performing operations on data. The Processor/Memory/Switch language 
(PMS) developed by Bell and Nevell [7] provides means to describe such blocks, 
/ their properties and their interconnections. 
8 
• ,, 
The instruction set level describes digital hardware from the perspective 
of machine language instructions. These descriptions provide necessary infor-
mation to define the instruction format and describe the processor as an instruc-
tion interpreter. The Instruction Set Processor (ISP) is a common CHDL which 
emphasizes on this level [6, 7, 8]. 
The register transfer level describes the data flow among registers, 
memory and 1/0 devices. In most CHDLs, the control unit is described indepen-
dently in from of control flow statements and state assignments. This level is 
supported by most CHDLs including "A Hardware Programming Language" 
(AHPL) [9], "Computer Design Language" (CDL) [10], and the 
"Registertransfersprachen" (RTS I,II,III) [11]. 
The logic level describes the exact hardware layout in terms of intercon-
nected gates. For example, a 16 bit adder may be described as a ripple carry 
adder (for an area-optimized implementation) or as a carry look-ahead adder 
(for a speed-optimized solution). A suitable language for this level is the 
"Digital Design Language" (DDL) [3, 6, 12]. 
In the circuit level, logic gates are described in terms of electronic ele-
ments such as MOSFETs. Different from the above levels, these electronic cir-
cuits are verified with analog circuit simulators such as SPICE (13, 14]. 
Modern VLSI technologies require attention to a seventh level of abstrac-
tion which describes the physical characteristics and the layout of individual 
transistors and standard cells. These information become increasingly impor-
tant ·for establishing design rules and the CMOS VLSI layouts. 
9 
" 
I 
2.3 Structural and Functional Properties CHDLS 
Presently available CHDLs belong to o:o.e of the following categories 
[15, 16]: 
{' 
,I 
1. Functional CHDLs 
2. Structural CHDLs and 
3. Hybrid CHDLs. 
Functional CHDLs describe the functional behavior and algorithms which 
do not depend to specific logic structures. These languages are most suitable for 
hardware descriptions in the first three levels of abstraction. On the other 
hand, structural CHDLs specify the exact logic layout and are best suited for the 
register transfer and gate logic levels. Hybrid CHDLs adopted the advantages 
of both structural and functional languages. These languages permit flexible 
access to both high-level arithmetic units and low-level gate logic. Most CHDLs, 
including AHPL and DDL belong to this category. 
2.4 Procedural Characteristics of CHDLS 
Computer Hardware Description Languages can also be characterized ac-
cording to their manner of execution as [15, 16]: 
1. Procedural 
2. Nonprocedural or 
3. Semiprocedural, a combination of both. 
Procedural CHDLs require their statements to be arranged in a specific 
sequential order. Similar to modern high level programming languages, these 
statements are executed in a sequential order. Occasionally, some hardware 
descriptions are written in regular progJamming languages in order to achieve 
10 
high simulation speeds. In this case the programmer must assure that date be-
ing referenced in one statement must be assigned in a previous statement. 
Nonprocedural languages require no textual ordering··at all. These lan-
guages permit a closer hardware description since the components are expected 
to operate in parallel and every statement describes such a component. Th.e or-
der of the statements does not affect the described hardware and the consequent 
simulation. 
CHDLs of the third category require a limited degree of sequential order. 
Programs written in register transfer languages are divided into a fmite set of 
segments related to states. Each segment contains one or more register tran~fer 
instructions and may be arranged in any order. The control unit is described I. 
separately by a set of sequencing operators found in high level programming 
languages. 
2.5 Textual and Graphical CHDLS 
Textual CHDLs do not require any graphical support devices. It is true 
that a large number of convenient graphical design automation software exist. 
However, in most cases, they need expensive design tools, nonstandard image 
files and additional data storage to describe graphical shapes and placements of 
digital elements. Lipovski stated that textual languages were more suited to 
"express variations and subtleties better than graphical languages can express 
them by shapes and sizes of figures" [17]. 
Textual CHDLs eliminate machine dependency by constraining the 
character set to a common ASCII standard. Unfortunately, the original versions 
of several CHDLs including AHPL and DDL use special symbols found in APL 
.. 
•;, 
11 
, 
• 
fonts. Some CHDLs use overstrike character combinations which makes source 
code look awkward and difficult to read. 
2.6 Examples of Classic Com.puter Hardware Description 
·, Languages 
The next sections will give a summary of classic computer hardware 
description languages. Specifically, the characteristics and features of AHPL, 
Jt 
CDL, DDL, and ISP are discussed. 
2.6.1 A Hardware Programming Language (AHPL) 
~PL has been introduced in 1971 [17] by F. J. Hill and Gerald 
R. Peterson as an educational tool to describe and simulate digital hardware 
without depending on specific hardware. The design goal was to create a semi-
procedural register transfer language which syntax is based on APL. 
. (\ 
. .. AHPL files consist of a set of interconnected modules. Each module· con-
tains its own finite state automaton, combinational logic and declared busses 
and storage elements. They are divided in.to a set of states and each of them 
contains one or· more register transfer statements. A set of sequencing 
operators (jump statements) are available to determine which state(s) to ac-
tivate next. Since AHPL assumes control units based on the one-flip-flop-per-
state [18] scheme, multiple states may be excited simultaneously and their 
register transfers are executed together [9, 19]. 
The major drawbacks of AHPL and most register transfer languages is 
that the control unit is described separately and not in sufficient detail. 
Representing the hardware control by a set of states is advantageous.in designs 
involving complex control algorithms. However, AHPL does not allow 
12 
! 
straightforward implementation of alternative controllers such as using n-bit 
\ control sequencing registers to provide 2n different states. 
'...w-' 
' Presently, AHPL is widely used in academic institutions since F. J. Hill 
introduced it in his text book [9] in conjunction with an introduction into the 
architecture of his Simple Instructional Computer (SIC) but has not found its 
way into the toolbox of a practicing engineer. 
2.6.2 Computer Design (Description) Language (CDL) 
CDL was introduced by Chu in 1965 [10] and is widely used in Europe 
[17]. This language exhibits similar characteristics found in AHPL, but uses a 
different, more hardware-related control structure. 
CDL programs are parti~ioµ.ed into storage, control and processor seg-
ments. The storage segment declare_s identifiers as registers, memory elements 
.. ,\.)1 
and switches. The control structure includes declaration of the clock signal and 
the state sequencing register. The actual hardware description is located in the 
processor structure. This structure consists of a set of conditions and their cor".'. 
responding transfer statements. A condition is defined as a combinational logic 
function which depends on the state sequencing register. If the condition is ac-
tive, the corresponding transfers are excited. These condition blocks do not re-
quire any specific sequential order. The only method to change the state is to 
modify the state sequencing register with such a register transfer statement. 
Generally, CDL describes hardware at a slightly lower level of abstraction 
than AHPL. The control unit is no longer mutually separated from the data 
flow and various control schemes can be implemented without difficulty. 
13 
't. 
( 
/1 
2.6.3 Digital (System) Design Language (DDL) 
DDL, first introduced by J. R. Duley in 1967 [20], is a hardware des~rip-
tion language suitable for both low-level gate logic and register transfer 
specifications and high level descriptions of arithmetic and relational functions. 
The language is flexible enough to allow descriptions of structured and hierar-
chically organized hardware [3, 12]. 
DDL is organized to describe systems which contain a set of automata. 
Each automaton contains its own control unit and accesses public facilities. 
Public facilities refer to globally defined wires, registers, memory arrays an~--_/S 
combinational logic. In addition, each automaton contains internal circuitry 
ref erred as private facilities. Repetitive logic can be defined once with an 
operator··.· declaration and referenced throughout the automaton wherever 
needed. The only significant drawback in this language structure is the fixed 
predefined bit width of automata and operator definitions. 
Similar to AHPL, a set of states can be defined and dedicated state tran-
sition statements are used to define the control algorithm. In addition, DDL 
allows partitioning the ~nite state set into segments. Additional sequencing 
statements permit actions similar to subroutine calls. If the last statement in 
the called segment does not contain a sequencing operation, control is returned 
to the calling state. Further activation operators specified in one automaton is 
capable of influencing the control of a subsequent automaton. 
DDL permits a limited amount of timing simulation if the frequencies of 
the clock signals and discrete delay elements are predefined delay elements are 
inserted to simulate propagation delays. Wherever propagation delay is neces-
sary, predefined delay elements must be inserted. 
)! ·. 
14 
Despite its flexibility, DDL has some drawbacks. For accurate timing 
verification, more complex delay elements must be incorporated by i.e. 
automatically appending fixed or variable delays on all logic gates. Further-
more, DDL requires APL characters where some of the overstrike combinations 
are difficult to read. Small words such as increment are easier to une:rstand 
than an overwritten "c" and an arrow-up symbol. Furthermore, the bit widths of 
the interfaces of automata and user-defmed operators are fixed. Multiple defini-
tions of a particular regular structure are required if the same structure is to be 
applied for different bit widths. 
Overall, DDL represents a well organized hardware description language 
for both functional and structural applications and has proved to be a popular 
language in the U.S. industry [17]. 
2.6.4 Instruction Set Processor ISP 
The Instruction Set .Processor, introduced by Bell and Newell [7], is 
specifically designed to describe the functional behavior of program-driven 
processors as a symbolic instruction interpreter and calculator. In ISP, one has 
to supply information to define all register and memory arrays. Typically, 
defined registers include the program counter, accumulator, stack pointer, one 
or more index registers and various flags. At least a primary memory array 
must be available to supply machine instructions. The instruction formats and 
the statements are similar to register transfers. Both floating-point and integer 
operations are available. A large variety of both low-level logic and high-level 
arithmetic functions such as averaging, transcendental and number conversions 
are available. 
15 
•• 
Generally, ISP allows simple verification and optimization of instruction 
sets without considering a specific processor architecture. Not much effort is 
necessary to make improvements on the projected instruction set and format. 
Describing the same processor with a lower-level CHDL requires much more 
code and design time. Once the definition of the instruction set and simulation 
with ISP is completed, one may consider th~ design of processor architecture. 
2. 7 Drawbacks of Classical Register Transfer Languages 
A hardware description program should relate very closely to the 
projected architecture. Ideally, every statement should add a discrete element 
to the prototype. However, efforts have been made to add features into register 
transfer languages which widen the gap between hardware description and the 
actual logic layout. These features include state partitioning and constructs 
found in high-level programming languages. 
Commonly, the same register transfer statement must be referenced mul-
tiple times throughout the program to specify the same hardwired interconnec-
tion. On the other hand, a set of transfer statements which differ very little 
may require an unexpectedly large amount of hardware to implement. Thus, it 
is impossible to judge the hardware complexity from their register transfer 
programs. 
The authors of hardware description languages tended to adopt many 
primitives from high level languages, especially from APL and ALGOL. Thus, 
partitioning of description files into individual states and excitation of the next 
state after a clock cycle is analogous to a set of programming statements written 
in a high level programming language executed sequentially. A set of control 
instructions literally permit jumping to specified states and calling a set of 
16 
,, 
states just as subroutines are accessed. These sophisticated control statements 
have no direct hardware relationship. This, again, leads to difficulties in judg-
ing the fmal size and structure of the control unit. 
Furthermore, some register transfer statements permit conditional IF and 
CASE statements to regulate the data flow. During compilation, all these high 
level statements are replaced by appropriate multiplexer interconnections. It is ' 
however better to describe this hardware since it requires only a small amount 
of additional code. 
Overall, register transfer languages serve as a useful tool to verify the 
functionality of systems which use complex control algorithms. However, in 
most cases the description of the control unit remains constrained to the avail-
able sequencing operators. Ironically, DDL provides, on one hand, a large 
variety of high-level control statements while, on the other hand, permitting 
detailed low-level structural descriptions. During the final design phase 
-I~ however, the logic circuitry is laid out and a concise gate-level hardware 
V verification is recommended. 
2.8 Alternative Solution: High Level Languages 
Proper use of modern high level programming languages with a versatile 
set of arithmetic and logic operations allows very fast hardware simulations. 
The application provide an environment in for hardware description and sup-
port subroutines to run simulation. A set of assignments and use of available 
operators describe the .system model. Since programming statements are ex-
ecuted sequentially, arranging the statements in a particular sequence is neces-
sary. Advanced programming environments are available in "C", Lisp and 
Prolog [2] .. 
17 
• 
A different approach to use high level languages for hardware simulation 
is SAKURA [2]. The SA.KURA preprocessor translates a hardware description 
file into a source code program. Consequently, this program is compiled and 
executed. The major disadvantage of this is that the compiler for the program-
ming language SAKURA uses to generate output files is machine dependent and 
therefore has a limited use. 
2.9 Too m.any Different Languages 
Du]iing) the last two decades, a large number of CHDLs were designed to . 
' ' 
describe hardware in all levels of abstractions. Some of them gained widespread 
acceptance in the industrial and educational sectors. Other languages were not 
sufficiently defined to serve as suitable input for design automation software. 
Presently, different institutions work with mutually incompatible CHDLs which 
makes cooperative design projects difficult to accomplish. Two more dis-
advantages specifically affect the private industry. R. Piloty pointed out that 
CAD-based design automation systems present hardware designs more attrac-
tively, and that "there exists no comprehensive hardware and firmware design 
methodology ·telling how to use CHDLs effectively" [21]. Many computer en-
gineers would agree with Lipovski's argument that they were presently 
"confounded" in their own "Tower of Babel" [17]. 
The possible solutions to achieve a common CHDL are the consensus Ian-
. guage concept, which aim is to integrate existing languages into one common 
domain, and the intention of the U.S. Department of Defense to introduce new 
and flexible CHDL. The corporations would be required to use this language to 
"') document and simulate their hardware related to defense projects. A common 
CHDL will provide better communication, but may inhibit further research in 
18 
I > 
CHDLs. 
2.9.1 Consensus Languages · A Possible Solution? 
In 1973, Piloty et al. [21] launched the consensus language (CONLAN) 
project to establish a common base for existing hardware description languages 
and enable the integration of existing CHDLs into one family. Using this con-
cept, new CHDLs can be defined upon existing ones without changing or ex-
panding the supporting software. The project began with the establishment of a 
"--, Base Consensus Language (BCL)./.,BCL is built upon a more fundamental Ian-
- f' ~- _,-·(" 
guage called Primitive Set Consensus Language (PSCL) in order to verify its in-
tegrity and run demonstrations. PSCL provides a very fundamental but power-
ful set of conventions which serve as the basis for BCL and any further defined 
CHDL. Consensus languages refer to six principal entities listed below: 
1. Objects 
2. Operations 
3. Types 
4. Classes 
5. Hardware descriptions 
6. Language definitions 
Objects refer to values or to carriers of varying values. Operations refer to 
code which processes objects. Types refer to the set of v~lues assigned to objects 
'\ 
and are closed under a set of operations. One type is the integer.closed under 
arithmetic, logic and relational operators. Classes refer to a group of types to be 
used in defmed languages. The last ~wo entities obviously refer to descriptions 
of digital hardware and definition of sibling languages. 
How would someone go about to mint a new language? Every defined con-
.........._ 
sensus .language can act as a reference language (REFLAN). A set of constructs 
19 
. ~ ' 
are available to define the necessary syntax and semantics for a· new language. 
This building-block concept proves to be suitable for many CHDLs [22] and al-
lows easy implementation of highly specialized simulation languages in any 
level of abstraction. 
In 1985, the compiler package called CONLIT [22] was developed to use 
language-specific parsing information tables to translate hardware descriptions 
written in any consensus language into an intermediate code called IREEN. 
IREEN uses a common set of conventions which is common to all consensus lan-
guages. In addition, the compiler can establish parsing information tables for 
new languages from tables made for their reference languages. The simulator 
called REGLAN reads intermediate IREEN files and verify described hardware. 
Features of REGLAN include but are not limited to advanced data levels such 
as high-impedance states, propagation delays and bidirectional busses based on 
open-collector or tri-state circuitry. 
Future work is aimed toward a fully integrated CONLAN-based design 
automation system. Such a complete system requires a sufficiently large library 
of module descriptions written in IREEN, necessary software for transferring 
hardware descriptions into chip layouts (i.e. routers, silicon com.pilers), and 
support for graphic workstations [22] . 
2.9.2 The DoD VSHIC Project 
In March, 1980, the U.S. Department of Defense launched the Very High 
Speed Integrated Circuits (VSHIC) Program to develop a flexible and expand-
able language called VSHIC Hardware Description Language (VHDL). During 
Summer of 1981, the Institute for Defense Analyses (IDA) workshop [23] took 
place to establish standardized language primitives. In 1983, contracts were 
20 
awarded to develop supporting software. As of August 1985, VHDL has gained 
widespread acceptance by the private sector and IEEE is thinking·about declar-
ing it as the de jure standard [24]. The Department of Defense expects all con-
tracted companies to use VHDL for their designs and documentations. 
The prime reasons for developing a new language which should gain 
widespread acceptance in both private and public sectors were to provide a 
single, fully-defined hardware description language. Widespread acceptability 
was achieved by letting user companies participate in the establishment of lan-
guage requirements and the development of the syntax and semantical struc-
ture [24]. Some of the design objectives for VHDL are listed below: 
1. Easy to learn and maximum degree of flexibility 
2. Structured and hierarchical descriptions 
3. Easily adaptable to new device technologies 
4. Flexible support on timing and propagation delays 
5. Supporting all levels of abstractions 
6. Diverse logic levels 
Since VHDL allows design descriptions in all levels of abstractions, a 
hardware system can be described through either structural descriptions or 
functional algorithms. Thus, a project partner can disclose or exchange the 
functional properties of their design while keeping the architecture classified. 
Different model representations are available to describe the control unit 
separately, if necessary. 
' 
VHDL programs are divided into user-defined logic blocks called entities. 
Further code permits structural and hierarchical interconnections. An out-
standing feature of VHDL is the access to external libraries of predesigned logic 
blocks and complete systems. In addition, VHDL provides technology insertion. 
21 
Available elements are bounded to low-level, i.e. circuit descriptions which refer-
. ence a specific technology ftle. Such technology-dependent information 
generally includes fan-outs, propagation delays, power delays and noise margins 
[24]. With the release of a new device technology, only the technology-related 
information is updated while the logic designs remain the same. 
With the available technology data, VHDL takes propagation timing and 
delays into account. Two timing measures, namely the clock period and the unit 
propagation delays imposed by combinational logic and recorded. A tradeoff be-
tween accurate timing verification and fast logic simulation is possible. 
Similar to high level programming languages, VHDL provides to user-
defined data types besides the predefined data types of Unknown, Ambiguous, 
High Impedance, 0 and 1. VHDL also provides scalar types which can be used 
pass parameters into predefined logic blocks such as bit widths, voltage levels 
and delays. Composite types such as arrays and records can be defined, too. 
Assertions can be inserted in VHDL programs to detect high fan-outs or 
• 
unknown states in wires. An assertion includes a condition, and if the condition 
is true, a message will be reported and the simulator may be instructed to ter-
minate execution. 
Shahdad explains that "the underlying dynamic execution model defined 
by the language is event driven'' [24]. The changing outgoing lines will consider 
dependent statements for future execution. This process is repeated until all 
\ t 
signal changes passing through combinational logic have settled. This execution 
model is closely related to the model used to simulate SLSL files. 
22 
A large variety of flexible features are incorporated in VHDL, but it would 
be impossible to list all of them. The only significant drawback ofVHDL is that 
some companies must modify their software tools to conform to the require-
ments of the DoD. The worst case would be a setback inside companies, espe-
cially when they must abandon their advanced proprietary simulation facilities 
[22]. 
Since 1985, the work on VHDL is focused on new simulation packages, 
interface (translation) programs for existing simulators, design capture tools 
and design automation tools such as silicon compilers for the layout of silicon 
'\ 
: 
chips [24]. 
2.10 Sym.bolic Logic Siinulator Language 
Despite the availability of many CHDLs, I have designed my own CHDL 
called Symbolic Logic Simulator Language (SLSL) and implemented necessary 
software to run simulations and generate formal output listings [25]. The main 
reasons are listed below: 
1. Very compact, though easily readable language. 
2. Statements must correspond to logic blocks directly and control 
and data flow units must be treated equally. No run-time sequenc-
ing operators or constructs are allowed. 
3. A preprocessor provides compile-time statements to permit struc-
tural and hierarchical designs. 
4. Passing parameters into macro blocks should not be limited to 
dedicated entities such as identifiers and numeric parameters. 
5. Supports communication channels with random noise. 
A small amount of code must be sufficient to describe complete system 
models for demonstration purposes. The language provides a flexible instruc-
23 
.. 
tion set to cover both low-level structural and high-level functional hardware 
description. High-level functions include 2's complement arithmetic. 
SLSL is a nonprocedural language where every statement describes a dis-
crete hardware component. It does not separate the control unit from the data 
flow network and requires the control unit to be described in the same manner • 
as the data flow is described. No constructs are provided to describe any unit 
alternatively, i.e. with state segmentation and sequencing operators because 
control over the actual hardware layout will be lost. 
60'' 
SLSL provides a set of preprocessor commands to allow structured 
designs. These commands permit using integer expressions rather than fixed 
numeric constants, loops to define repetitive structures, and macro blocks with 
parameter passing. Different from ordinary CHDLs, only one universal type of 
parameters exist: character strings. Virtually anything, whether it is an iden-
tifier, operator, numeric constant or expression, whole statements or macro 
calls, may be passed into macro blocks. These features provides a high degree of 
design flexibility. 
Finally, SLSL does support binary communication channels with adjust-
able random noise. This feature permits verifications of telecommunication 
hardware such as error control coders. 
SLSL does not outperform modern hardware description languages, but 
provides easy means to simulate hardware very quickly. Details about the Sym-
bolic Logic Simulation Language, its supporting software packages and ex-
amples are summarized in the next chapter. 
24 
.:.· 
0 
Chapter3 
Symbolic Logic Simulator SLS 
3.1 Overview of the Siinulator 
Chapter 2 summarized several digital logic simulators currently available. 
The Sym:bolic Logic Simulator (SLS), developed at Lehigh University, is capable 
of simulating complex digital hardware at a high speed at both logic and func-
tional levels. A specially developed programming language called Symbolic 
Logic Simulator Language (SLSL) is provided to describe the hardware. Every 
statement of this language describes a discrete logic element or a function block 
and its interconnections. 
Using the compiler developed as part of this project, hardware description 
files written in SLSL are compiled into object codes the simulator can under-
stand. If the hardware to be simulated is complex and contains repetitive ele-
ments, a preprocessor can be invoked to convert a highly hierarchical and struc-
tured code into a 'flat' hardware description file. The object code generated by 
the compiler may be simulated using SLS in both interactive and batch mode. 
At the end of the simulation, an intermediate trace list is generated. A docu-
ment generator has been provided to convert this trace list to a predefined for-
matted and easily readable docume11.t file. The data flow diagram is shown in 
Figure 3-1 on page 26. 
Common features of SLS include a flexible library of reserved logic ele-
ments, use of busses with a maximum of 32 bits, sequential acce~ to data files, 
telecommunication channels with random noise and the capalit!~ ;::~;\os-
sible glitches at the clock inputs of clocked devices. In order to ~intain high 
25 
I 
' 
Source Code .SAC 
' . 
Preprocessor 
-------- 1------1 Review Listing .REV I 
Flat Logic 
.LOG Input Files 
.INP Format Spec. . FMT 
Generated Code .COD 
Compiler Simulator 
E----jsymbol Table .SYMI ....._ ----
File Info Table .FIL 
!comp. Listing .LST I Output Files .OTP Trace List 
.TRC 
jFormat Spec. .FMT 11----.. 
Documenter 
' I Final Output 
.OUT I 
Figure 3-1: SLS Data Flow Diagram 
• . 
26 
,tr.• 
simulation throughput, some features which require too much processing time 
have been omitte·d. They include taking propagation delays into account and 
use of tri-state elements, wired-ORs with open-collector gates and transmission 
gates. 
3.2 The Simulation Language SLSL 
The simulation language SLSL is designed to act as a simple but powerful 
interface to describe a digital architecture. It uses strong type-checking and a 
syntax structure adopted from modern high-level languages such as Pascal and 
"C", is not case-sensitive and allows embedded comments. 
A typical hardware description file begins with the name of the design, 
followed by a block defining all identifiers. The main body of the file consists of 
descriptive logic statements. The declaration block identifies wires, registers, 
random access and read-only memories. Parameters related to all identifiers 
include bit widths and initial values. 
The statements in the main block describe the hardware through logic-
boolean statements ranging in complexity from simple gates to arithmetic func-
tions. Most statements assign a source identifier or a logical combination of 
them to a destination iq.entifier .. Each destination may be assigned only once, 
since otherwise it would correspond to tying the outputs of multiple logic blocks 
together. Unassigned identifiers similarly mean illegal open connections. The 
statements in a SLSL program are not required to be writte·n in any specific 
sequence, since they are executed in parallel. In other words, this corresponds 
to simultaneous activation of all logic elements. 
27 
/_ 
J 
; 
Most combinational logic is represented by a mathematical expression 
containing unary and binary operators such as"-" (inverters), "*" (AND) and"+" 
(OR), and a library of reserved functions. The number of gates or the size of the 
logic block is determined by the bit width of the destination identifier. Similar 
to arithmetic expressions found in modern programming languages, all 
operators are assigned to specific precedence groups. For example, AND-
operators have higher precedence than OR-operators. Parenthese~ are available 
to circumvent the precedence rules. 
All available reserved functions are designed to be as flexible as possible. 
For example, the single function MUX. can install multiplexers of any size and 
configuration. The bit width of the destination determines the number of paral-
lel multiplexers and the number of selector bits determines the number of input 
lines to each multiplexer. 
A master clock signal is supplied by a 1-bit wire identifier called CLOCK. A 
small set of statements permits the generation of multi-phase clock signals. The 
clock signal is necessary to latch data into registers and memory arrays. 
The hardware description is complete when all identifiers are assigned. 
For complex designs which involve repetition and hierarchical properties, dedi-
cated preprocessor control fields can be inserted into the source code to invoke 
repetition and define macro blocks. The preprocessor will convert the highly 
structured design into a flat layout the compiler and simulator can understand. 
A typical SLSL program describing the hardware of the 16 bit square root 
calculator of Figure 3-2 is shwon in Figure 3-3, page 29. This nonrestoring 
28 
' 
itial Contents 
g A: Operand[13 .. 0] 
g 8: 0 
In 
Re 
Re 
Re g R: { 0:14, Operand[15 .. 14]} - 1 
Reg.A 
\ 5 
\ 
I '·~ 
'IJ,~ 
y 
\ Adder 
/ 
v14 
/ 16 
/ 
Reg. R 
Rf13 .. 01 
I 
Reg.B 
/ v8 
0 1 
V 
/ () 
XOR-Gate 
/ 
/16 . 
I 
Sub 
MSB of R C>o 
Figure 3-2: Nonrestoring square root hardware 
SYSTEM SQARE ROOT; 
-/* Imp1ementation of a 8 bit square root 1ogic 
Written by Georg A. zur Bonsen, Nov 9, 1987 */ 
• 
I 
'WIRE Sub, Sum[16], Cout, Operand[16], Zero, Load, Right [16]; 
REGISTER A[14], B[S], R[16], State[4]; 
BEGIN 
,. 
Trace:=C1ock; 
/* Definition of finite state control unit: 
Modulo 9 counter*/ 
State, C1ock := Mux( Andsum(State=S:4), Inc(State), 0:4); 
Load:= -Orsum(State!l:4); /* If State= 1 */ 
Zero:= -orsum(State); /* If State= 0 */ 
Read("INPOT",Load, Operand); 
Right:= Xorgate(Sub,{0:5,B,Sub,-Sub,1}); 
CPA{ {R[13 .. 0],A[13 .. 12]}, Right, Sub, Sum, Cout ); 
Sub:= -R[15]; /* Cornp1emented Sign Bit*/ 
; R, C1ock := Mux( Zero, Sum, Dec({O :·14,0perand[15 . . 14] }) ) ; 
Zero,< <A, Operand[13 .. O] ); A, C1ock := Mux( 
B, C1ock 
End. 
:= Mux( Zero, · { B [ 6 . . 0] , Sub} , 0 : 8 ) ; 
Figure 3-3: SLSL Code to simulate hardware of Figure 3-2 
<,> 
29 
'· . 
square rooting is based on the following formulae: 
Let the given 
A = ~ = ao + al + ·· · + a2n-1 
and the desired square root 
B = Bn =ho+ h1 + ... + bn-1· 
Defining ~ as 
R. = A. - B.2 1 ~~ 1 
one gets 
R. 1 = A. 1 - B. 12 1+ ~-Lj_+ 1+ 
= 4R. + 2a2. + a2. 1 - b.( 4B. + 1 ) I 1 l+ I 1 
(3.1) 
(3.2) 
(3.3) 
(3.4) 
The process of calculating the square root consists of choosing ap-
proximate bi (0 or 1) such that I Ri+l I is minimal at each stage. Further details 
about this implementation are given in [References]. 
Note that one clock cycle is required to load a parameter and initialize the 
setup and eight cycles are required to complete the square root, a total of 9 clock 
cycles are required in this example. The ZERO and LOAD signals are active at 
states O and 1 respectively. The statement TRACE:=CLO~K; selects time in-
, 
stances to be included in the trace list. The READ statement obtains .an operand 
from the data file INPUT at the rising edge of LOAD. The remaining six state-
ments describe the actual square root module. 'ro appreciate the compact na-
/. 
ture of expressions in SLSL, examine the statement 
Right:= Xorgate(Sub,{0:5,B,Sub,-Sub,1}). 
30 
,· . 
• 
i 
., 
This statement describes an array of sixteen XOR-gates with one common 
input called SUB. Sixteen gates are required because the identifier RIGHT 
represents a 16-bit bus. The second argument must be a 16-bit word. In this 
case, the word constitutes a concatenation of five smaller words. The most sig-
nificant five bits are tied to ground. The next eight bits are those of register B 
and the least significant three bits are connected to SUB, its complement and to 
a logic 1. Similarly, a carry-propagation adder is described as 
CPA( {R[13 .. 0],A[13 .. 12]}, Right, Sub, Sum, Cout). 
The five parameters of the adder are two summands of any bit width, a 
single-bit carry-in, the sum and a carry-out signal. The first summand is a con-
catenation of the 14 lower bits of register R with bits 13 and 12 of register A.\f\ 
brief summary of all SLSL statements and the syntax diagrams are listed in the 
appendix A. 
3.3 The Compiler for SLSL 
The SLSL compiler employs the one-pass top-down LLl parsing method to 
parse the SLSL code and generate the files necessary for the simulator. The 
parser verifies the syntax and assures that all bit widths do match. Memory 
addresses are assigned for all declared identifiers. Compilation stops at the first 
occurrence of a syntax or semantical error. The compiler generates a listing, an 
object code file, a symbol table and a table listing referenced sequential data 
files. 
The compiler listing contains a copy of the source code with numbered 
lines, a table of all declared identifiers and their cross-references, and a sum-
. . 
31 
/ 
/ 
mary indicating the sizes_ of generated codes and memory requirements for the 
simulator. The cross reference listing for every identifier lists the start ad-
dresses of all statements where they are referenced. This listing is not used by 
the simulator but is provided1 as an aid to the designer. 
The object code file consists of a set of compiled instruction blocks. Each 
block refers to a compiled statement. A simple instruction set has been 
designed to express the code. Details of this are discussed in section 3.4. 
The symbol table transfers all declared identifiers, their bit widths, as-
signed addresses, initial values, cross reference listings tables and other neces-
sary parameters to the simulator. Additional information, such as the memory 
requirements for code, data and cross reference memory spaces, and a set of 
starting addresses pointing to statements which must be executed first, are ap-
pended at the end of the symbol table file. 
Finally, the file information table is used to transfer referenced file 
names, flags to indicate whether they are input or output files, and addresses 
pointing to their corresponding com piled READ and WRITE statements and as-
signed buffer memories, to the simulator . 
. 
3.4 The SLS Siniulator 
SLS is a parallel simulator. Therefore, the first action it takes after load-
r ing the files described in section 3.3. is to establish elaborate combinations of 
data structures in order to imitate the parallel nature of digital hardware. The 
most trivial solution to achieve this would be to repeat executing all statements 
sequentially until all signal changes have propagated via combinational logic to 
their storage devices. This solution is very wasteful since many statements 
32 
1"·· 
would be executed where no signals propagate at all. This project therefore 
made use of an event driven strategy to determine which statements are to be 
executed. 
If an execution results into a change in the logic state of a variable, then 
all statements, which reference this variable, are queued up· for execution. This 
process is repeated until all signal changes have propagated to their destination 
storage elements. Details of this are provided in section 3.4.2. 
3.4.1 Internal Data Structures and Simulation Setup 
Before simulation can begin, following interconnected data ·structures are 
created by SLS: 
1. Code Memory 
2. Data Memory 
3. Symbol Table 
4. Cross Reference Space 
5. File Table Space 
6. Clock Status Pointer List 
7. Trace List Index 
In the beginning, the compiled object code and the file information-tables 
are loaded directly into the Code Memory and File Table Space respectively (See 
Figure 3-4 on page 34. Next, the symbol table file is accessed to establish the 
internal symbol table and initialize Data Memory space for wires, storage ele-
ments and ftle buffers. Most of the information found in the symbol table file is 
transferred directly into the internal symbol table. 
For every identifier- being loaded, appropriate Data Memory space is al-
located to store the contents and additional information such as the amount of 
33 
.. 
... 
,·. 
Format Specification File 
l 
Output Table 
····················································· 
List of variables to be 
kept for output trace 
list. This table is ob-
tained from .FMT file. 
Object Code File 
I 
Code Memory 
Dynamically allocated 
memory for compiled 
simulation instruc-
tions. 
Code Addresses 
and Instructions 
FIFO Queue 
Holds start addresses 
-- of statements schedu-
led for execution. Sta rt-
Adc r. 
Queue empty? Clock Signals 
Control Unit 
Supervises simulation 
and provides user-
interface. 
l 
Console 
Symbol Table 
Stores identifiers and 
their parameters. 
I 
Symbol Table File 
Intermediate Tracelist File 
Output Interface 
Generates I nterme-
diate trace list. 
Addr & Data 
' I 
Data Memory 
Dynamically allocated 
- memory for wires, 
registers, memories' 
and file buffers. 
• ' 
Addresses 
and Data 
Execution Unit 
Fetches and executes 
instructions located 
in statements. 
File Data 
File Access Unit 
Interface to access 
sequential input and 
output data files. 
l I 
Symbol Table File 
Crossref. Tables 
Dynamically allocated 
list to provide code 
addresses for activa-
ting new statements. 
. 
Xreference 
related Info 
Cross ref. Unit 
Fetches start addres-
- ses of statements Dest. 
which depend on mo-
Addr.. dified variables. 
Start Addresses 
t Data 
LIFO Stack 
Temporary data 
storage. 
Capacity: 512 words. 
Data Files File Info Table 
Clk State Table 
List of pointer poin-
ting to clock states of 
registers, memories 
and file buffers. 
Error Handling 
This unit is invoked 
upon occurrences of 
fatal,errors. 
Figure 3-4: Simulator Execution Model 
34 
... ~.' 
\. 
t- . . 
bytes required per data word. Obviously, one byte is sufficient to hold the logic 
levels of I-bit strands up to 8-bit wide bµsses. The contents will be initialized if 
it is specified in the source code. Furtherniore, the addresses pointing to their 
cross reference tables are provided. A separate cross reference memory is al-
located to hold the lists of statement start addresses for every identifier. For 
clocked elements and file buffers, a clock status byte to store the previous. and 
current clock state and buffer space to accept incoming data is also provided. 
Before simulation starts, a Clock Status Pointer List, shown on Figure 3-4, 
is installed to permit easy sequential access to the clock status bytes of all 
clocked elements located in the data memory. This table is also useful to check 
for rising edges and glitches. 
In most simulations, a trace list is desired. The names of all identifiers 
must be supplied in order to capture the required data. The names are obtained 
from the Format Specification File (see section 3.5). The simulator looks every 
name up and keeps the corresponding data memory addresses. During simula-
tion, after all signals changes have settled, the clock state and the contents of 
selected wires and storage elements are passed to the intermediate trace file. 
As explained in section 3.5, the Output Generator converts this trace file into a 
tabulated output listing. 
3.4.2 Internal Building Blocks and Simulator Execution 
The simulator contains the following major building blocks: 
1. Execution Unit 
2. Cross Reference Unit 
3. FIFO Queue 
4. File Access Facilities 
35 
··"J 
5. Control Unit 
The Execution Unit, located in the center of Figure 3-4 is based on a stack-
oriented symbolic calculator. It is invoked whenever a statement must be ex-
ecuted. The instruction set is based on Reverse Polish Notation. Instructions 
are executed sequentially. Available instructions include numeric constants be-
ing pushed onto stack, access to wires, registers, memory arrays and file buffers, 
all necessary arithmetic and logic operations and delimiters indicating end of 
statements. 
Whenever data is written into memory and the contents are changed, the 
Cross Reference Unit is accessed to place start addresses of all instructions af-
fected into the FIFO Queue. Whenever the execution unit completes a state-
ment, the start address of another statement is fetched from the top of the 
queue and the exec~tion continues. 
The File Access Facility is in charge ·of reading and writing data files. A 
warning is sent to the control unit if the file cannot be opened or if the end of file 
has been reached. /. 
The Control Unit supervises the entire simulation process. Both batch 
mode and interactive simulation methods are possible. Batch mode is invoked 
by passing command line parameters to the simulator. Available parameters 
instruct the simulator to simulate a given number of cycles or continue until a 
warning condition, i.e. end of an input file, has occurred. There are commands 
to determine whether to exit after simulation is complete, suppress status dis-
,,,..,~-
plays and generate intermediate trace files. In interactive mode, a limited set of 
. . " . 
commands is available to inspect and alter the contents of wires, registers and 
36 
r, 
l 
.-
,I 
memory arrays, and perform stepwise or fast simulation. 
{ Before simulation can begin, all statements are executed once in order to 
flood initial values into the system and assure consistency between inputs and 
outputs of logic elements. Therefore, an initial list of statements to be executed 
first must be filed in the FIFO queue. This list contains the cross reference list-
ings of the clock signal and of all storage elements. Statements assigning logical 
constants to variables are also included. To accomplish this, these statements 
are detected by the compiler and their start addresses are provided at the end of 
the symbol table. 
At the ·beginning of every simulation cycle, the clock signal is com-
plemented and all statements which depend directly on the clock signal are en-
tered in fthe queue. After all signal propagation triggered by this has been com-
pleted, the clock status bytes of all clocked elements are checked for rising edges 
and possible glitches. If a rising edge is detected, the data pending in the input 
buffer of a latch (a simulator conv·enience) is transferred to the latch or the to 
the selected memory location. Since the outputs from such elements may 
change, time must be given to allow them to propagate to further storage 
devices. This procedure is repeated until no more signal changes have occurred 
and then, a copy of the signals are sent to the trace file. 
3.5 The Output Generator 
The output generator converts the trace files into properly formatted docu-
ments. The major features of this generator include displaying numerals in 
common radices (binary, base-4, octal, decimal and hexadecimal), displaying 
negative numbers (sign-magnitude, l's and 2's complement numbers) and ac-
cessing conversion files to represent data in symbolic form. 
37 
.. I 
Every page of the final output file starts with a header and follows with a 
listing of the contents of requested records. Timing information is supplied in 
the left hand column. A maxim,um of 10 lines is allowed for every record. The 
Format Specification File supplies the header and specifies the desired output 
format including a list of identifiers with desired formats and, optionally, con-
version file names. 
The SLSL square root program of Figure 3-2 uses the format specification 
file shown in Figure 3-5 The first line of this file shows the number of lines to be 
reserved for the original header, the number of conversion files and their names. 
This~is followed by the actual header. 
The list of identifiers includes thd identifier name, row and column posi-
tion to print it, number of digits and a representation indicator containing ei-
ther a letter specifying the radix (and optionally preceding sign to determine the 
type of negative numbers), or a number referring to a conversion file . 
.. 5 1 STATES 
Square Root Logic, Written by Georg 
Nov 1, 1987, Solution is in B-Reg 
A R 
A. zur Bonsen 
State Operand Reg ,,,J Reg Right Sum 
B 
Sub Reg 
----------------~-----~------------------------------State 1 1 6 0 
Ope1=:and 9 1 5 -D 
A 18 1 5 D 
R 25 1 5 -D 
Right 32 1 5 -o 
Sum 39 1 -5 -D 
Sub-46 1 1 B 
B 51 1 3 D 
Figure 3-5: Format Specification for Square Root Program 
• 
38 
Conversion files contain a list of symbols. Number i is converted to i+lth 
symbol using this file. The STATES file referred in Figure 3-5 is shown in Figure 
3-6. 
ZERO LOAD SECOND THIRD FOURTH FIFTH SIXTH 
SEVENTH EIGHTH 
Figure 3-6: The Conversion file used with Square Root Circuit 
The contents of the identifier STATE are represented by the symbols ~hown 
in this file. For example, FIFTH is printed in place of 5. If no symbol is found 
corresponding to a value, question marks are printed instead. 
A typical output from the simulator due to the SLSL program of Figure 
3-3 using format and conversion files of Figure 3-5 and 3-6 is shown in Figure 
3-7. 
3.6 The Preprocessor 
Due to the structured, specifically repetitive and hierarchical nature of 
logic designs, the 11 SLSL code may turn out to be very large if no control se-
quences are available to allow automatic repetition and macro blocks. This 
facility is provided in SLSL by a preprocessor which reads the source code, 
recognizes dedicated control fields, interpretes them and convert the _highly 
structured design into a fiat logic design where every individual element is 
shown. 
39 
.. 
I 
SQARE ROOT 
-
02/14/1988 18:22:26 Page 1 
9 Square Root Logic, Written by Georg A. zur Bonsan 
Nov 1, 1987, Solution is in B-Reg 
A R B 
State Operand Reg Reg Right Sum Sub Reg 
----------------------------------------------------000000 1 ZERO 10000 00000 00000 -00005 -00005 -1 000 
000001 1 LOAD 00200 10000 -00001 00003 00001 0 000 
000002 1 SECOND 00200 07232 00001 -00005 00000 1 000 
000003 1 TH:IRD 00200 12544 00000 -00013 -00010 1 001 
000004 1 FOURTH 00200 01024 -00010 00027 -00013 0 003 
000005 1 F:IFTB 00200 04096 -00013 00051 00000 0 006 
000006 1 S:IXTH 00200 00000 00000 -00101 -00101 1 012 
000007 1 SEVENTH 00200 00000 -00101 00203 -00201 0 025 000008 1 E:IGHTB 00200 00000 -00201 00403 -00401 0 050 000009 1 ZERO 00200 00000 -00401 00803 -00801 0 100 000010 1 LOAD 00000 00200 -00001 00003 -00001 0 000 000011 1 SECOND 00000 00800 -00001 00003 -00001 0 000 000012 1 TB:IRD 00000 03200 -00001 00003 -00001 0 000 000013 1 FOURTH 00000 12800 -00001 00003 00002 0 000 
000014 1 FIFTH 00000 02048 00002 -00005 00003 1 000 000015 1 SIXTH 00000 08192 00003 -00013 00001 1 001 000016 1 SEVENTH 00000 00000 00001 -00029 -00025 1 003 000017 1 EIGHTH 00000 00000 -00025 00059 -00041 0 007 000018 1 ZERO 00000 00000 -00041 00115 -00049 0 014 
Warning: End of file reached in input data fi1e: INPUT 
000019 1 LOAD 00000 00000 -00001 00003 -00001 0 000 
Figure 3-7: Output from a Simulation of logic of Figure 3-3 
3.6.1 The Preprocessor Language 
All preprocessor statements must be located inside preprocessor control 
fields. These fields are initiated and terminated with specific control characters 
and can be inserted anywhere in the source code. 
The preprocessor language compares to a ve~y simple subset of Pascal and 
. 
provides constructs such as FOR, WHILE and REPEAT loo_ps, conditional state-
ments, constants, integer math with a wide variety of arithmetic, logic and rela- . 
tional ope~ators and definition of macro blocks and parameter passing. The 
40 
I 
' ! 
preprocessor concept allows one to create multiple copies of SLSL code located 
.. inside a REPEAT-UNTIL loop such as 
... &[ a:=0; repeat] ......... &[ a:=a+l; until a>=lO ] .. . 
Further, insertion of calculated numeric expressions may also be ac-
complished by control fields %[ expr ]. For example, in the above loop, %[a] in-
serts the value of integer a into the code. Similarly, a statement like 
WIRE &[ for i:=O to 7 do] 
Bus%[i*4] [32],&[end] Solution[32]; 
' 
is flattened by the preprocessor into the conventional SLSL hardware 
description. Note that the preprocessor language is mutually independent from 
the simulator source code. The identifiers defined in the preprocessor code do 
not affect any simulator identifiers or vice versa. 
Macro blocks can be used to avoid repeating coding a subcircuit because it -
is used repeatedly. The subcircuit is defined once and referenced wherever 
needed. Similar to functions found in Pascal and "C", parameters can be trans-
ferred down and local variables can be defined. Note that parameters are not 
limited to identifiers or expressions. Anything, including w·hole statements or 
just simple operators or function names can be transferred. 
The program of Figure 3-8 illustrates the use of Macro blocks, parameter 
passing and local variables. The definition of NAND latch references three 
parameters. The first two parameters may contain any expressions for one-bit 
set and reset signals. Inside the macro block, these parameters are referenced 
with a % symbol. Other identifiers which start with a % are considered to be 
local wires, registers and memory elements. In the example shown above, 
. 
41 
%QBAR is a local wire. All other 'identifiers without % are considered global and 
assume the same value during every macro call. In the example, CLOCK is a 
global wire. Macro blocks can be called up by specifying the name and its 
quoted parameter strings inside a control field. During preprocessing, the con-
tents of the macro block are copied into the code. 
Q) 
&[ Macro Nand Latch(%Set, %Reset, %Q) ] 
-
W:IRE %Qbar; 
%Qbar :=~(%Set * %Q *Clock); 
%Q :=~(%Reset* %Qbar *Clock); 
& [ Endm ] 
• 
• 
. . &[ nand latch('Flag8','Res*(Enable+Intr)',':Indicator') ] 
-
• 
• 
Figure 3-8: A SLSL Program using Macro Blocks 
3.6.2 Preprocessing Procedure 
The preprocess consists of two steps: Recognizing and compiling control 
fields, and using them to generate the flat output code. 
The compiler scans through the structured hardware description file to 
detect and compile preprocessor control fields. A table of identifiers is built to 
keep track of integers, macro names, local and parameter identifiers, and an in-
ternal object code is generated for the stack-oriented symbolic calculator to 
evaluate all expressions and control preprocessing . 
. ~':, 
The object code is based on reverse polish notation and provides instruc-
tions which include numeric constants being pushed onto stack, access to ran-
dom access memory, arithmetic, logic and relational operators, jump statements 
and a set of instructions to handle macro blocks. Since the control fields may be 
42 
. ·, 
/' 
' \ / \ 
distributed throughout the source code, an associative reference table is es-
tablished to find the object code starts of all such fields. 
After compilation is complete, actual preprocessing begins immediately. 
The preprocessor rewinds the source code file and starts scanning the code. 
During this process, a copy of the code is written to the output file. Whenever 
the beginning of a control field is reached, the associative reference table is ac-
cessed to obtain the address pointing to the corresponding beginning of the com-
piled code block. The code is executed until a delimiting instruction requesting 
continuation or scanning or end execution are detected. 
G 
43 
Chapter4 
Background of High-Speed Multipliers 
4.1 Introduction 
Many present computer applications such as scientific, medical, 
meteorological, economic, defense and signal processing require extensive 
numerical computations. A wide range of parallel architectures have been intro-
duced in [26] to satisfy these computational demands. Section 4.2 of this chap-
ter summarizes the fundamental concepts of parallel processing. Current 
research results in cellular array multipliers, and in particular in the implemen-
tation and acceleration of parallel multipliers is described in Sections 4.3 to 
4.10. Finally, the design objectives of the proposed arithmetic processor are ex-
plained in Section 4.11. 
4.2 Parallel Processing, an Introduction 
Parallelism is categorized into two principal types called spatial and 
temporal parallelism. Spatial parallelism is a result of multiplicity of proces-
sors, arithmetic units, memory arrays and other subsystems. In principle, an 
n-fold duplication should increase the speed of a spatially parallel system n 
times, but the bottlenecks in shared hardware, i.e. data busses and memory 
banks as well as the structure of application software does not allow this 
theoretical gain to be realized [26]. 
Temporal parallelism, on the other hand, is achieved by dividing long data 
paths through combinational logic elements into multiple latched segments 
[1, 26]. This implementation, known as pipelining, reduces the propagation 
delay between any two latched stages and allows multiple processes to run 
44 
t;.--· 
through the pipeline simultaneously. Use of temporal parallelism requires a set 
of pipeline registers and a controller in charge of data flow. No duplication of 
original hardware is necessary for temporal parallelism. A common application 
of this technique is found in CPUs to accelerate throughput by pipelining the 
individual processing steps such as instruction fetch, operand access, execution 
and storage of results. 
Generally, a parallel processor executes algorithms more rapidly. 
However in most cases, duplication of hardware does not always result in an 
proportional increase of performance. There are two principal factors which in-
hibit the full exploitation of parallelism. Firstly, the algorithms used cannot 
utilize all of the parallel resources. Secondly, the hardware limitations such as 
mismatched bandwidths cause data flow bottlenecks. Solutions to this problem 
have been proposed by balancing bandwidths, i.e. by interleaving memory 
blocks and introducing hierarchically organized memory subsystems [26], but 
they are not comprehensive. 
Loss of performance in a parallel configuration is also caused by multiple 
interacting processors using common shared devices such as busses and memory 
units. Specifically, processors must wait until they get their turn to access 
shared busses and memory arrays. Similarly, when multiple processors 
cooperate to compute an algorithm, the processor which finishes its task first 
must wait until the other processors have completed their tasks, thus lowering 
the efficiency. 
Efficiency of a code executed by a parallel processor also depe·nds on the 
locality of the code. The code locality is categorized into three types: Temporal, 
spatial and sequential. Small code blocks which are executed repetively (for ex-
45 
... 
..,.._ 
ample loops) exhibit a high degree of temporal locality. Spatial locality is 
defined as predominantly accessing the same data within a given time span. 
Sequential locality refers to sequential execution of an instruction string with-
out many conditional jump statements. 
Parallel hardware should be tailored to exploit the_ localities in the al-
,. 
gorithms. If the intended code exhibits a high degree of temporal locality, the 
hardware design should consider loading small blocks of instructions into fast 
memory (cache, scratchpad registers, etc) and access them more rapidly. Paral-
lel hardware can be optimized to take advantage of spatial locality by providing 
a larger on-chip scratchpad memory or a cache memory. Temporal locality can 
be exploited by pipelining the processor into several separated instruction fetch 
J 
' / 
and execution stages. 
4.3 Parallel Architectures for High-Speed Multiplication 
This thesis derives architectures for high-speed numerical computations, 
in particular those involving multiplications. This section therefore summarizes 
various existing multiplier architectures, in particular, the parallel ones. 
One of the earliest multiplier architectures is the Shift-Add multiplier. 
This simple algorithm examines consecutive multiplier bits and adds ap-
propriately shifted multiplicands to the partial product register if it is active 
[1, 27, 28, 29, 30, 31, 32]. The performance of shift-add multipliers is limited by · 
two factors. Firstly, n clock cycles are required to examine all n multiplier bits, 
and secondly, the clock period is low enough to allow worst-case carry propaga-
tion. Some of the common remedies proposed for this include use of carry-save 
adders, skip-zero multiplications and multiple-shifts . 
.. 
46 
/ 
·i 
I 
The carry-propagation may be removed by replacing the carry propagation 
adder with two carry-save adders. These adders generate the fmal product in 
two components (partial sum and carry) which are added together by a carry 
--------- . 
propagation add)r at the end [1, 31]. 
One other method to increase throughput is to replace additions of a zero 
vector to the product register when the currently examined multiplier bit is zero 
by shifts. This reduces the number of additions from n to the weight of the mul-
tiplier [1, 27, 28, 31, 32] 
The number of additions may be further reduced by recoding the mul-
tiplier bits into signed binary digits in order to reduce the number of nonzero 
digits [28]. For example, using a signed digit form OOOlOOOOOTOOOO instead of 
00001111110000 reduces the multiplication from seven additions to two. 
Two algorithms, known as Canonical Multiplier Recoding Algorithm 
[28] and Booth's Algorithm [1, 31, 33], have been developed to recode the mul-
tiplier. Booth's Algorithm examines two bits at a ti~e to detect nonzero strings. 
·) ., 
This algorithm is not an optimum one because it interpretes all isolated zeros 
inside nonzero strings as endings of one string and beginnings of another string. 
The Canonical Algorithm recodes better than Booth's Algorithm since it ex-
amines two bits plus the high bit of the lower pair previously examined. It 
detects one third of all isolated zeros. Optimal recoding may be achieved with 
more complex combinational logic to detect all isolated ~ros and thus cut down 
on additions. 
One may also examine two or more multiplier bits in a cycle and add an 
appropriate multiple of the multiplicand to the product register. This nonover-
47 
lapped scanning may be combined with recoding to get overlapped scanning 
where the current bit group is appended with the LSB of the next higher group. 
This method detects k-length strings of l's and reduces them to a single addition 
and subtraction [1, 28,-,31, 32]. 
\. ' ) 
A completely different approach to multiplication was proposed by Wal-
lace [34]. He used multiple carry-save adders in a tree configuration to improve 
throughput significantly. The Wallace Tree adds shifted copies of the mul-
tiplicand. The copies to be added are determined by the l's in the multiplier. 
The Wallace tree is one of the first known parallel multipliers which is built 
solely from combinational logic. However, the main disadvantage is the large 
and irregularly structured hardware of the tree. Hallin [35] has proposed a 
smaller Wallace tree (k-pass tree) which accepts partitions of the multiplicand 
in multiple subsequent passes. However, additional circuitry is required to add 
the intermediate products generated by the small tree to obtain the final 
product. 
4.4 Iterative Cellular Array Multipliers 
The introduction of Iterative Cellular Array Multipliers (ICAMs) brought 
about a significant increase in multiplier performance. ICAMs are an extension 
of sequential shift-a.dd algorithms to iterative hardware 
[1, 28, 31, 36, 37, 38, 39, 40, 41, 42, 43]. They use Carry-Save Adders2 to avoid 
horizontal carry propagation. The cellular structure consists of an n-1 x n-1 ar-
ray of full adder cells and rum AND-gates, where n denotes the word size. 
2N-bit Carry-Save adders (CSAs) are arrays of·"N independent full adder blocks. CSAs accept 
three N-bit vectors and generate a partial sum and carry vector, both N bits wide. 
48 
.. 
Pezaris [37] modified the iterative cellular array multiplier to support 2's-
·complement arithmetics by interpreting the most significant bit with a negative 
weight. But this requires special adders which interpret data arriving at one or 
more inputs with negative weights. In order to maintain numeric consistency, 
I 
the sum, carry-out or both may return negative weight bits. The following table 
lists the four possible full adder types with the proper interpretations of SUM 
and COUT. 
Type 0: a + b + C - sum + 2*cout 
Type 1: a +b - C - - sum+ 2*cout 
Type 2: a ·- b - C - sum 
- 2*cout 
Type 3: - a - b - C - - sum - 2*cout 
The types of the adder cells to build an array must be selected correctly to 
obtain a positively weighted product with a single negative weighted sign bit 
(bit 2n-1). Hwang has derived two variations of Pezaris' 2's-complement mul-
tiplier [28]. The tri-section multiplier adopts Pezaris' interconnection pattern 
but ass"igns the types of full adders differently, whereas the bi-section array is 
structurally different and uses only two different adder types. 
Baugh. and Wooley also developed an alternative cellular array multiplier 
which support 2's-complement arithmetic but requires only the common type 0 
full adder. The algorithm, which is necessary to interconnect the full adders 
,.,I 
properly, is explained in detail in [28, 38] and [ 40]. Guild [ 44] has given a proce-
dure to fold the rectangular array into a triangular pattern to perform all logic 
operations as early as possible, thus accelerating multiplication . 
. 
j 
49 
{:. __ 
Takagi [ 45] has designed the architecture of an iterative cellular array 
multiplier which is based on signed-digit arithmetic. These arrays use recoded 
operands and logic to reconvert the product into binary notation. 
The performance of ICAMs is constrained by the propagation delays of the 
individual adder cells. Improved throughput is achieved by reducing the signal 
path through combinational logic. Instead of using CSAs to add the product 
terms generated in every row ·together, Stenzel [46] and Nakamura [41] have 
proposed generalized counters to reduce the signal path through combinational 
logic. The resultant architecture is a compromise between the high performance 
in Wallace Trees [34] and the well structured Guild Arrays [ 44]. Another ap-
proach to reduce the combinational delays is to replace each 4x4 full adder array 
square by a single high-radix, say radix-4, Nonadditive Multiplier Module 
(NMMs) [28, 41]. Throughput improvement is achieved if the worst-case delay 
inside one NMM is less than the equivalent delay through two CSA rows. 
Suitable implementations for NMMs are two-level combinational logic and 
ROM-based look-up tables. NMMs generate a series of product terms which 
must be reduced into a single product word. This reduction can be accomplished 
with bit-slice adders (Wallace trees), generalized counters [ 46], carry-save ad-
ders and/or propagation adders. 
The principal drawback of high-radix NMM based array multipliers is the 
necessity of hardware to reduce the product terms into the final product. If one 
instead uses an Additive Multiply Module (AMM), which accepts four k-bit in-
puts A, B, C and D and generates the 2k-bit result AB+C+D, the one alleviates 
the final addition [28, 41]. 
50 
Kai Hwang [40] combined the advantages of AMMs and the 2's-
complementation technique derived by Baugh and Wooley [38] to develop the 
Programmable Additive Multiplier (PAM) for a modular array multiplier. PAM 
provides two control lines, which are to be hardwired, to configure every module 
to adapt to the irregularities found in Baugh-Wooley's cellular array multiplier. 
4.5 Pipelining Iterative Cellular Multipliers 
The principal drawback of the various multipliers summarized above is 
the big propagation delay caused by very long data paths through combinational 
\. 
logic elements. This problem can be alleviated by dividing the entire data path 
into multiple latched segments. Pipelining does not reduce the delay between 
submitting operands and retrieving its product, however, a multiplicity of 
processes can be executed simultaneously in the same hardware. The resulting 
throughput is one multiplication per clock cycle where the clock period is the 
worst propagation delay between any two consecutive pipeline stages. Deverell 
[ 4 7] and Hallin [35] considered inserting latches after every row of the Guild 
[ 44] Multiplier Array to optimize the throughput. 
Pipelining Generalized Arithmetic Arrays which integrate operations of 
multiplication, square, division, remainder and square root has also been tried 
[28, 39, 47, 48]. 
In addition to these multiplication schemes upon repeated shift and adds, 
one can also multiply numbers through quarter-square algorithms [49, 50] and 
logarithmic look-up tables and linear interpolations [51]. 
51 
4.6 Latched Iterative Array Multiplier Based Processor 
The cellular array multipliers, which have been reviewed in previous sec-
tions, support a very limited amount of arithmetic features, namely simple ad-
ditions and multiplications. A small amount of additional circuitry may expand 
this functionality to a complete processing environment. This thesis develops 
one such processor module. It is called LICAM Processor, accepts single or mul-
tiple operand pairs and takes a small number of clock cycles to complete most 
tasks and return results. 
It is desirable that the central element of such a processor be a carefully 
optimized pipelined cellular array. Too few pipelined stages do reduce the 
worst-case propagation delay. On the other hand, too many stages result in too 
large number of clock cycles to multiply a single pair of operands. In addition, 
the distribution of latches plays an important role in minimizing propagation 
delay. The latches should be distributed to divide the total propagation delay 
into approximately even fractions [52]. 
The algorithms supported by the processor should be described with a 
small amount of microprogram-like statements which select the data to be sent 
'-
next into the cellular array. The array should be able to accumulate one mul-
tiplication result per clock cycle. Additional logic circuitry should be installed to 
permit subtractions. A local scratchpad memory should be available to save in-
termediate values. 
& 
Since a variety of operations, i.e. Taylor Series approximations of elemen-
tary functions, reference a series of frx:ed constants, the processor should provide 
its own internal look-up table. Other applications frequently reference simple 
constants such as small integers, e and 7t. Without a look-up table, the computa-
52 
• fi 
L 
' ,. 
·,l 
tional throughput is inhibited since more data must be transferred to the 
processor which keeps shared busses and external control processors more busy. 
For optimal exploitation of such a pipelined array, the processor architec-
ture, the control and the program structure should be organized to allow over-
lapped execution of the same algorithm on multiple data without using a mul-
tiport microcode memory. Further, whenever such overlapped operations do not 
fully exploit the entire capacity of the cellular array, the processor should utilize 
empty time slots for simple multiplications. 
From the perspective of the outer world, the LI CAM processor should act 
as a slave unit. All data flow should be governed by a host processor or other 
external control logic which is expected to be aware of the LICAM time delays. 
The proposed LICAM processor with these properties is described in chap-
ter 5. The hardware description language SLSL is used to verify its 
functionality. Chapter 6 summarizes some application algorithms for such a 
processor and their simulations. It also describes various methods to incor-
porate the LICAM processor into an external system environment. 
Y. 
53 
,• 
.. 
i 
Chapter5 
Architecture of the LI CAM Arithmetic 
Processor 
5.1 Introduction 
Some of the desirable properties of a processor based upon Iterative Cel- '. 
lular Array Multipliers (LICAM) were discussed in Chapter 4. This chapter 
describes the design of a processor with these properties. It provides high-speed 
computations with minimum logic circuitry. A throughput of at least one ad-
dition or multiplication per clock cycle is obtained. To alleviate the loss of speed 
due to frequent use of external data busses, this processor loads the instructions 
corresponding to a repetitive algorithm into the processor and executes them for 
a large amount of data in a pipelined fashion. 
Given arbitrary 16-bit numbers a, b and c, the processor calculates follow-
,"!! 
ing functions in one clock cycle. 
f =ab+ C 
f = C - ab 
f = ab - C 
f =-(ab+ c) 
(5.1) 
(5.2) 
(5.3) 
(5.4) 
In order to achieve high throughput without duplicating hardware, the 
Arithmetic Unit is based on a Latched Iterative Cellular Array Mul~iplier. To 
further improve the throughput, any free time slots are scheduled for over-
lapped execution of two or more processes or for independent multiplications. 
::"-··'-~. 
The processor is functionally separated into a latched multiplier and Control 
54 
[ _ 
-J. ( 
Unit to schedule preprogrammed algorithms. 
Figrire 5-1 on page 56 shows the block diagram of the processor. The 
Arithmetic Unit, located at the bottom right corner, contains the modified 
Latched Iterative Cellular Array Multiplier (LICAM). This unit permits a max-
imum throughput of two additions or subtra9tions and one multiplication per 
clock cycle. Peripheral components are required to supply the data for the 
Arithmetic Unit. These components include a Scratchpad RAM, a ROM serving 
as a look-up table for fixed constants, an Input and Output Interface and a Data 
Routing Network. Each of these co1nponents is controlled by the microprogram-
driven Control Unit. Following sections describe these units in detail. 
5.2 Arithinetic Unit 
The Arithmetic Unit is the workhorse of the processor and performs all 
additions, subtractions and multiplications on 16-bit operands to create 32-bit 
numbers. 2's complement integer number representation was used, but this 
may be modified to any fixed point representation. Since the data busses, which 
interconnect the Arithmetic Unit with peripheral units, are 16 bit wide, the 
least significant 16 bits are returned as results unless the upper 16 bits are ex-
plicitly requested. As shown in Figure 5-2 on page 57, the Arithmetic Unit is 
divided into the following five modules: 
1. Latched Iterative Cellular Array Multiplier (LICAM) 
2. Intermediate Carry-Save Adder 
3. Carry Propagation Adder 
4. Overflow Detector 
5. Output Selector 
55 
-. 
. 
1r 
Interaction btw Ctrl Unit and Input Interface Sequential Inputs X and Y 
~111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111• ~ Memory Addresses ~ 
-
. : ,11111111111111111111111111111111111111111111-: : - . . 
. • • • • - . . 
. - - . 
. - . -
-
: : . 
- . 
- -
- -. 
- -
- -. 
- -
-
• 
-
• 
- -
-
• 
- • 
-
• 
• • 
- • 
-
• 
-
• 
-
• 
• • 
-
• 
-
• 
Memory Unit Input Interface 
-
• 
-
• 
-
• 
-
• 
-
-
• 
-
• 
- -
-
• 
- -
-
• 
-
• 
- -
-
• 
-
• 
-
• 
-
• 
-
• 
-
• 
• 
-
• 
- -
-
• 
-
• 
- -
- -
-
• 
-
• 
-
. 
-
• 
-
.
-
• 
-
• 
-
.
-
• 
-
• 
-
.
-
• 
-
• 
-
• 
-
• 
-
• 
-
• 
-
.
-
• 
• 
-
• 
-
• 
- -
-
. 
-
• 
-
• 
-
• 
-
• 
- -
-
• 
- • 
-
.
-
• 
-
• 
-
• 
-
• 
- • 
-
• 
-
• 
-
• 
-
• 
-
.
-
• 
- -
- -
-
• 
- -
-
• 
-
• 
-
• 
-
. 
-
• 
-
• 
-
. 
Scratch pad 
RAM 
w 
Look-up Table . 
ROM 
MO M1 MR 
Data Routing Network 
Operands for 
prog. process 
X y 
23 .. 20 Assign to A 
19 .. 16 Assign to 8 
15 .. 12 Assign to C 
Operands for 
simple mutt. 
p 
p-1 
-
• 
-
.
-
• 
-
• 
-
• 
The following bits of the 
Control word are used: 
11 .. 08 Assign to W (memory) 
-
• 
-
• 
-
• 
-
• 
-
• 
-
• 
- -
- • 
-
• 
- -
-
• 
-
• 
-
.
-
• 
- - ..... i---a 
.
- . 
- -
- -. 
- . 
- -. 
- -
- -. 
- -. 
- . 
- . 
- -. 
- . 
- . 
- . 
W Control 
Word • • 
• 
• 
• 
• 
-
-• 
• 
-• 
• I 1111111111111111 I I I I I II 1111111111 I I I I I 1111 Ii, 
. 
. 
§ Arithm. Ctrl = 
• . 
• 
• . 
• 
• 
A 
- . 
- . 
: "'1111111 
-
Control Unit~-~-~ 
iilll1llll~li!illillillilil1 
Arithmetic Unit 
-
-
-
-
-
"PIii 
:8·:-:-:::Jf?y::::::::::r:::::::::::<1::::<<ii\::: · UuS St8·m- · · · · · · · · · · · · :.;.:,:::,:,:-:-:-:-:-: .. :.:-:-:,:,:-:::::,:.::::::.:::::::::::::::::::: 
.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.•.·.·.·.·. 
:::::::::::1:::1:1:1:::::/:l:::::::::1:::::\:::::::::::::l:[:::::::::::j: 
·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.· 
.·.·.·.·.·.· .. · ... · ... ·.·.-.·_ .·.·.· ... · .. ·. -.·. 
S-_ -· :u•b·.···s··:·y·-:st:-·-·e· •m-: > •2·.· > • > 
. . . . . . .·. . . ... · .·.·.·.·.·.·.·.·.· 
:-:·:•:•:•:-:-:-·.·.· .. ·.·.·.·.·.·.·-:.·.·.·.·.·.·.·.·.·.·.·.·.·.·.· 
...................... - ......... . '. . 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. - - . . . . . . . . .. -_-
Subsystem 3 · Overflow 
Detection 
B C 
111111111111111111111111111111111111111111111 
!c.s.l:(::i:i/:/:)ii:::::ii:i:i:i:ii::i::1:1:1;1:111:111:: 
:::::::1::;::::::i:1;1:::1:::::::::::i:/i:::::::::::::::::1:::::::::::[: 
:::::::::::::::::::::::::::::::;:::::;::::::::::::::::::::::::::::::::::: 
·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.· 
. ............ ' . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
. ' ........... - . . .. . ·cpA· .................... . . . . .. .. - . . .  . . .  . .  . 
........................ 
. . . ............. - . . . . . .. 
: ::, . :-.· ·.· ·::::::::::::::::::;:::_-:::::::::::::: : :::::::: 
. . . . . . . . 
. . . . . . . . . . . . . . . . . . . . . . . . . . .. 
. . . . . . .... 
. . . . . .. 
. . . 
. ... 
Output Sel. 
p 
Final Result to Output Interface 
· Figure 5-1: Layout of LI CAM Processor 
56 
\) 
..... .,.. 
Overflow bits . 
of operands 
A, Band C 
C_Reg 
NC,NP ,SP ,Hl,FI 
/ Operands 
A B C 
' ; I ' .. ,~', .. 
•. ' .. 
Latched ltera- tive Cellular Multiplier 
NP,SP,Hl,FI lntermed. Product 
Overflow Detection Carry-Save Adders and Complementation 
Logic to negate product and/or add pre-
vious product to current product 
Two overflow bits: 
One provides overflow for 
16 bit and the other for 32 
bit results. 
Hl,FI Next_PS, Next_PC 
Carry Propagation Adder 
generates a singe-word result 
Hl,FI 32-bit product 
Output Selector selects between lower 
16 bits of current product or upper 16 bits 
of previous product 
Control Signal for Fl 
Result P with overflow 
bit to Output Interface 
and Data Routing 
Network 
;: 
' i. ; 
Output Interface 
Figure 5-2: Layout of Arithmetic Unit 
57 
.. 
.· 
·,1..:..._,. 
The LICAM accepts three 16-bit operands A, Band C from the Data Rout-
ing Network and provides the product AB+C or AB-C split into two partial 
products. They are then added together (maybe with the previous result also) in 
the intermediate carry-save adder. The carry propagation adder produces the 
I 
final 32-bit result. The overflow detector checks for possible overflows at dif-
ferent stages of computation. Finally, the output selector selects the desired 16-
bits of the result and returns it to the Routing Network. 
In order to perform different arithmetic operations, the following control 
signals must be passed along with the operands: 
• NC: Negate operand C 
• NP: Negate intermediate product 
• SP: Sum with previous result 
• HI: Recall upper half word of previous final product 
• FI: Final Result Indicator 
The first four control signals are referenced by the different modules of 
the Arithmetic Unit. FI indicates that the result obtained is to be sent out. The 
total delay of the five modules of the Arithmetic Unit is seven clock cycles. Fol-
lowing subsections summarize the architecture of these five modules: 
5.2.1 The Latched Iterative Cellular Multiplier 
The LICAM, shown in Figure 5-3 on page 60, is a matrix arrangement of 
16x16+2 interconnected full adder cells. This array uses Baugh-Wooley's 
method to handle 2's-complement arithmetic. Details about this method are 
described in [38] and [ 40]. This cellular array has been expanded to add a third 
operand C to the product AB. Before addition, C is passed through an array of 
XOR gates in order to allow negation with the control signal NC. The im-
. .,, 
58 
~ - --·----
, 
plementation of 2's complementation is accomplished by utilizing the unused 
carry-in input of the full adder cell located in the top right corner. 
Pipelining in the LICAM is achieved by inserting la·tches at the beginning, 
at every fourth row and at the ·end of the array. This keeps the propagation 
delay through combinational logic to acceptable levels. The LICAM produces an 
intermediate product in form of two 32-bit summands. Since the operand C is 
added to a 32 number, a negation of 16 bit C requires input 1 at more significant 
16 bits. This is accomplishe·d by utilizing the unused inputs of the last full ad-
der row. 
5.2.2 Intermediate Carry-Save Adder 
Figure 5-4 on page 61 shows the Intermediate Carry-Save Adders, the 
Carry Propagation Adder, the Overflow Detection and the Output Selector. The 
Intermediate Carry-Save Adder block is responsible to negate the intermediate 
product and/or to add it with the previous result. 
The intermediate product can be negated with the control signal NP. 
Negation of a value represented by a partial sum and carry is achieved by bit-
complementing both words and incrementing the combined value by three. The 
initial row of XOR gates take care of bit-complements and the increment is ac-
complished by utilizing unused least significant bits of partial carry words 
entering CSAs. 
Two 32-bit Carry Save Adders are necessary to add the partial sum and 
ca~ry components of the current product and those of the previous result located 
in the registers "Next._PS" and "Next_PC". The control s1gllal SP the current 
contents of these :registers and adds them with the intermediate product. In the 
59 
('1 
1 
Operands coming from Data Routing Network 
A B C NC 
Re is er B Re 
b C 
c' 
-
a7b0 agl>o a&.O a41>0 aSl>O a:zbO &11>0 a OM> 
o' '7 0'16 o'.! o'! 0 1 3 o'2 o'1 o• 0 
-
a7b1 
-a7b2 
-
a71>3 
-
a7b5 
-
• 71>7 • 71,g 
c'7 
Partial Sum Re ister 
1,7 
Intermediate Product sent to next Arithmetic Unit Stages 
Figure 5-3: 8x8 bit Modified.Latched Iterative Cellular Array Multiplier. 
Q 
The LICAM processor uses a 16x16 bit 
array with 258 full adder cells 
60 
Intermediate Product from LICAM 
HI OV Partial Sum Partial Carry NP SP Fl 
V 32 ~ 32 I I I 
XOR Gate 
---
XOR Gate - AND Gate 
"' 
Carry-Save Adder . / ' . . ' • • ' ./ • P.Sum P.Carry 
I MSB / AND Gate MSB 
/ 
~ 
-~ 
-
/ / ~ Carry-Save Adder 
P.Sum P.Carry 
Next_P S Register Next_PC Register Previous product is 
/ fed back 
-MSB -
' 
Carry-Propagation Adder / 
/ 
MSB 
-
Overflow 
Detection / 
31 .. f 5 
--·----·-······ ···-···· ..... ·-• ~ 
32 bit ovrflw 
... .......... ---.. -·-· -·----· ----· 
. ·-· ... I ~ 31 .. 0 16 bit ovrflw 
/ 
V 
" "'15 .. 0 31 .. 16 I I 
High-Order Latch 
' / 
. ' 17x2-> 1 Multiplexer 
Result Register P 2 Delay Latches 
' V .
' I 16 
Result w/ overflow bit Fl 
. Figure 5-4: Carry-Save and Carry-Propagation Adders 
.. ;. 
61 
·~ 
., 
. 
following clock cycle, these registers are updated with the new result. 
5.2.3 Carry Propagation Adder 
The Carry Propagation Adder adds the final partial sum and carry stored 
in "Next_PS" and "Next_PC" together and produces a single 32-bit word. Since 
the slowest stage in the pipeline limits the pipeline throughput, it is necessary 
to use either a carry look-ahead or a conditional sum adder [ 40], instead of 
ripple-carry adder at this stage. 
5.2.4 Overflow Detection Logic 
In the LICAM processor, all data words are associated with an overflow 
bit. This bit indicates that the number is no longer valid because of an overflow 
in an earlier operation. The invalidity of operands implies invalidity of results. 
The Overflow bit is transmitted through pipeline stages by "C_OV". 
The Overflow Generation Module shown in Figure 5-5 on page 63 
produces two overflow bits indicating that an overflow for a 32-bit or a 16-bit 
result. The overflow bit for the result is derived from the following signals: 
1. Overflow bits associating operands A, B and C. 
2. Most Significant Bits of the partial sum and partial carry outputs 
of the upper CSA. 
3. Most Significant Bit of the partial carry word fed from the 
N ext_PC register into the lower CSA. 
4. Most significant bit of partial carry in NEXT_PC. 
0 
5. Carry~out bit of the carry propagation adder. 
6. Bits 31 .. 15 of the 32-bit product. 
The overflow logic is based upon extending all words in an ~rithmetic 
operation to 33 bits and using the most significant bit as an overflow bit. 
62 
I . 
J ( 
' 
Overflow signals in operands 
A B C 
Intermediate Product from LICAM 
. PS PC 
Latch 
The operands are 
extended by 1 bit by 
duplicating sign bits 
I ••----------•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••J 
/ 
.. 
r·••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••• J 
' 
/ 
,• 
I : •'" 
I : I ••• ••-••• ...... •••••••••• ••••.,•••••••••••• ••••••••••••J 
' ' ' : ! : Simplified im-
Previous 
NEXT PS 
... ::· ···-~-~--.~·.::.1..~~~- ... 1. ....• ::,. plem entation 
·······-.. :~~~-~::-~-:~~~~'.: ____ , .. ······ by duplicating """'--.....-U_p_p_e_.,.r.,_C_S_A _ _,/ 
j partial sum 
' ; msb 
' 
,.' 
..--:..·--------------J/ -- Previous 
_________________ __,/ 
..----------------.J/ NEXT PC 
_......._ _______ All carry-out 
""XOR / bits (pos. 32) 
must be 
'added' 
"""---_L_ow_e_r_C_S_A___, 
......-----------11--~/ 
-
""XOR / 
Effective 
Carry-out 
.--C-ar-ry--o-u-t -""--"'- '---c-~_A __ ..J7 
\ 
32-bjt,result 
Sign .... a __ l --------------
"" XOR / 
Product Sign Bit 
.--------------"/ 
""XOR 
OR-Gate 
-SP 
Current 32-
bit overflow 
Function V 
OR-Gate 
Function V, derived 
from 32-bit Product: 
V= 
Not {Bits 31 .. 15 = All 0 
or Bits 31 .. 15 = All 1) 
AND-Gate 
· 32-bit overflow 16-bit overflow 
Figure 5-5: Overflow Detection 
63 
The initial operands, namely both partial words originating from the 
LICAM, and the ANDed "NEXT_PS" are extended by duplicating their sign bits. 
The bit width of th eupper CSA shown in Figure 5-5 remains the same since the 
extension is possible by duplicating the sign bit of its partial sum. The effective 
carry-out bit is generated by XORing this extended bit with the most significant 
bits of all partial carry signals entering the lower CSA and the CPA and with 
the Carry-out signal originating from the CPA. Note that the most significant 
bits of all partial carry words have been left-shifted to position 32. The overflow 
signal for this set of adders is finally generated by comparing the effective carry-
out bit with the most significant bit of the final result. The overflow bit is aug-
mented with overflow signals originating from the operands. A latch holds· the 
overflow condition if more than two products are added together. Since this 
overflow bit is valid for 32-bit results only, additional circuitry called Function V 
is installed to sense overflows if the result is truncated to a 16-bit word. 
Function V checks if all bits in the upper 17 bit positions (bits 31 .. 15) are equal. 
If not, or if the 32-bit overflow is active, then the 16-bit overflow indicator is set 
to declare the result invalid. 
5.2.5 Output Selector Module 
The Output Selector Module selects either the lower 16-bit half-word of 
the current result or the upper half-word of the previous result. Normally, the 
lower half-word is referenced. For high-precision calculations, the entire 32-bit 
word can be fetched by recalling the lower half-word immediately followed by 
the upper half word. In this case, the control signal "HI" must be raised one 
clock cycle after the operands have been sent in. The Output Selector also 
selects the appropriate overflow condition. 
64 
5.3 Scratchpad RAM and Look-up "'able ROM 
~ 
The scratchpad memory shown in Figure 5-6 ~ page 66 is a medium for 
' , 
temporary data storage for sixteen 17-bit words8. The architecture of the 
J 
memory block allows two simultaneous read and one write-access. This memory 
is addressed by 4-bit words "MO_Address" and "Ml_Address" which are supplied 
from the Control Unit by the microcode instructions. Both MO and Ml can be 
' '
used to retrieve data. However, only MO is available to write data to memory. 
In order to write data to memory, the control word "C_Reg" must supply 
information to route data from the source unit to memory. Numeric data includ-
ing the associated overflow indicator will be latched into register "M_Mdrw" be-
fore being stored in the scratchpad RAM. In order to retrieve data, the address 
must be supplied at the previous clock cycle to make the data available (See 
Figure 5-1) in "MO" and "Ml" for the Data Routing Network. 
A ROM look-up table is provided to store sixteen 16-bit numeric constants. 
The data retrieval from ROM is identical to the one described earlier. The 
recalled word at "Ml_Address" is available in "MR" for the Data Routing Net-
work. 
5.4 Input Interface 
The Input Interface accepts operands and a two-bit instruction frorr1 the 
host. This instruction determines whether to multiply the operands or to in-
itiate a preprogrammed process. As shown in Figure 5-7, the Input Interface 
sorts the operands into two queues based up,on the instruction. The two signals 
"Q_Al_Avai" and "Q_AP _Avai" report the presence of operand pairs for these 
3The 17th bit is used to store the overflow condition 
65 
. 
1 .. 
C_Reg Control Word 
Bits 11 .. 8 
Write instruct. 
Generator 
Bits 31 .. 28 
M Maro 
M Wrt 
Scratchpad RAM 
16 X 17 bits 
{16 data+ 1 overflow bit) 
MO M1 
Over-
flow 
Data Overflow Data 
/ 
/ 
\ 
Data, from Routing Network 
Bits 27 .. 24 
M Mar1 M Mdrw 
ROM-based Look-up Table 
16 X 16 bits 
MR {Rom Data) 
Data 
Loaded Data to be sent to Data Routing Network 
Figure 5-6: Memory Subsystem 
66 
' ,, 
,I 
( 
processes. The Control Unit recognizes these signals and replies with 
"Q_Adv_Al" or "Q_Adv_AP" in order to forward the operand pair through the ;, . ~ 
multiplexer array into registers "X" and 'Y" which are accessible to the Data 
Routing Network. 
. .. 
5.5 The Control Unit 
The Control Unit of the processor is its brain and has been carefully 
designed to optimize its throughput without using too much sophisticated 
hardware. The program and event-driven Control Unit controls the data flow 
between the Arithmetic Unit, memory arrays and input/output devices in order 
to calculate simple arithmetic functions. Whenever an operand pair wait in the 
Input Interface queue, the controller recognizes whether to do a multiplication 
or to initiate a programmed process. In the first case, the operands are directly 
passed to the Arithmetic Unit and the product will be sent out immediately. In 
the latter case, a preprogrammed process will be initiated and internal 
(micro)code4 memory will be referenced to calculate the implemented operation. 
The controller also determines in this later case when to access new operands. 
The unique architecture of the controller permits parallel execution of four 
overlapped processes. Without parallel control, many implemented algorithms 
would not fully utilize the Arithmetic Unit, especially when the controller is 
waiting for results to be fed back. Overlapped processing utilizes the the open 
time slots. Operands waiting for simple multiplications ftll the remaining open 
slots. 
4The code used in the controller differs from microcode in the ordinary sense. Since no dedi-
cated jump instructions are included, the sequencing will be provided by a counter rather than a branch address extension 
67 
X and Y are factors 
for multiplication 
Interface to host or other CPUs 
X and Y are operands Ext. X and Y 
for prog. processes 
Operand Queue for simple 
multiplications 
2 x 16 2-> 1 Multiplexers 
~--·· ...••..............•............••................... 
C Adv A1 
- -
Load operands for 
simple multiplication 
~--················································ C A1 Avai 
Operands for simple 
multiplication available 
....................................................... 
C Adv AP 
- -
Load operands for pro-
gram med operation 
······················································ ····~ 
C AP Avai 
Operands for pro~_ 
grammed operation 
available 
Control signals to interact with Control Unit 
Operand Queue for 
programmed processes 
Registers X and Y 
X y 
To Data Routing Network 
Figure 5-7: Input Interface 
~-
68 
,, 
A microcoded controller architecture allowing multiple processes running 
simultaneously is made possible by dividing the microcode memory into mul-
tiple banks to permit parallel access. The amount of parallelism is selected 
carefully to obtain a reasonable throughput within a limited amount of 
hardware. 
As shown in Figure 5-8 on page 70, the Control Unit consists of four Sub-
systems, an Input/Controller Interaction Logic Block, a Multiplication Opcode 
Generator and a Latched OR-gate Array. Each subsystem contains its own 
microcode memory with a capacity of sixteen instructions and can run not more 
,, 
than one process. The entire controller has a capacity of 64 instructions. 
A process is initiated by signals "Q_Al_Avai" or "Q_AP _Avai" indicating 
availability of new operands. If "Q_AP _Avai" is active and the Input/Controller 
Interaction Logic approves it, a new process initiated. Every process starts with 
-,- the instruction located in the first memory location of Subsystem 0. After the 
first 16 instructions are executed, the process is passed via the "Ck_Pass" line to 
the next Subsystem in order to execute the next set of 16 instructions. Simple 
multiplications are initiated if "Q_Al_Avai" is active and the Input/Controller 
Interaction logic approves it. 
Parallel processing is achieved by utilizing two or more subsystems simul-
taneously. All active subsystem generate data flow control words which are 
combined in the Latched OR-gate Array (bottom right corner of Figure 5-8) to 
produce the external control signals. Usually, while two or more subsystems are 
active, one subsystem sends an active control word., In some cases, only one 
subsystem would manage the data flow into two inputs A and B of the Arith-
~ 
' ' 
metic Unit while another subsystem may route data into the third input C. 
69 
Subsystem 0 
.. ·.<·>>>>>.·-·>: ... 
Pass 1 - 3 hand processes 
to next subsystems 
Subsystem 1 
Microcode< 
:Mern:c,ry < :::/;: ) : / • 
. . . . ..... 
Subsystem 2 
. .. 
::M·. : ·:~···: .... :·::::od: :> .. ::: :: :::::::: :::: . ·: : : : : : 
·. 1croc e·. ·.·.·.·.· .. ·.·c.·.·.· · ...... · ·.· ... '._ ..... ·.·.·.-.· .. ·.·.·.·.·.• ·. ·.·.· .. :-:-:-:.:-:-:-:..-: :-:.:-.-: >.·:-:-:,:-:- :<<<< -:.:-:.:-:: :. :-:-.- . . . . . . . . . . . . . . . - . . . . . . . . .. -
·.· '' _· . ·.·-·.·.· •.•.·.·.· .·.·.· .. ·.· .·. ·.·.· .. 
·M····.·.·.·c····.··.·.·.·. ·cc.·.·.·.·c.·.c·.·.·.·.· 
: > em.()!'¥<:::::: :::::: :? ::: : << : 
............ 
·.·.· .. ·.·.·,·.· .......... . 
. - ..... - . . . . . . 
. . . . . . . . . . . . . . . . . . . . . 
. .......... ' .. . . . . . . . . . . . . . ............... . 
. ·.·.·.·.·.·.·.·.·.·.·.·.·.· 
. ·.· ·.· ·.· .·.·.·. ·.·. ·. 
Subsystem 3 
.................. . . . . . . . . . . . . . . . . . .... - ... . 
- ...... - ... -.................... - -
....................... - .. - .... . . . . . . . . . . . . . . '' ...... . 
. . . ' . . . . . . . . . ........... '. ' ............. . 
. . . . . . . . . . . . ' ..... 
- . . . . ..... ' ....... . 
- ................. - . - - .. -
......... 
. . . . . - ................. - ' ...... . 
' 
J 
• 
CO_Pass (init proc.) 
Ck API (allows skip) 
Input/Control-
ler Interaction 
Logic Block 
C_Skip 
C1 Pass 
C2 Pass 
C3 Pass 
e. 
a. C Adv A1 
- -
Load operands for simple mult. 
b. C Adv AP 
- -
Load operands for program-
med operation 
c. Q A1 Avai 
Operands for simple m ultiplica-
tion are available 
d. C·AP Avai 
- -
Operands for programmed 
operation are available 
e. Interaction Control Lines 
Ck_XY: Attempt to read data 
from input 
Ck A 1 i: Permit initiation of 
simple multiplication 
Ck APi: Permit initiation of 
. programmed operation 
fy1 ultiplication 
Opcode 
Generator 
Latched 
OR-gate 
Array 
.__.._ __ ___._._ 
Generates a 
common set of 
control signals 
Intermediate External 
a. 
b. 
c. 
d. 
______________ __, 
C4 Pass 
dead end 
Control word generated 
by the subsystems 
Figure 5-8: Control Unit 
70 
C_Reg: Common external 
control lines 
The following subsections describe the controller instruction formats and 
execution, initiation of a process and the data flow control. 
5.5.1 Instruction Set and Format 
A microcode instruction consists of 32 bits and establishes the data flow 
among the arithmetic and peripheral units during a clock cycle. The instruction 
format is shown in Figure 5-9 and the acronyms are defined in Table 5-1 on 
page 72. 
MO Address Ml Address A Select B Se1ect 
31 28 27 24 23 20 19 16 
C Select Wrt Se1ect Al AP DN NC NP SP FI BI 
15 12 11 8 7 6 5 4 3 2 1 0 
Figure 5-9: Instruction Format 
5.5.2 Memory Addresses 
The scratchpad and look-up memories can be addressed with MO and Ml. 
Due to the access delays, the memory address must be supplied in the previous 
instruction (previous clock cycle) in order to have the requested data available. 
Writing data to scratchpad memory takes two clock cycles and is accomplished 
by sending both address from the instruction and the data available in the Data 
Routing Network. 
5.5.3 Data Selectors 
The four vertically-coded selector words establish the data flow to the in-
puts A, Band C of the arithmetic unit and to the scratchpad memory. The fol-
lowing table lists the codes and their corresponding data sources: 
71 
Code ,, Explanation 
MO Address Address for bi-directional access to Scratchpad Memory 
Ml Address Address for read-access from both Scratchpad Memory 
and ROM 
A 
B 
C 
Wrt 
Al 
AP 
DN 
NC 
NP 
SP 
FI 
HI 
Source Unit Selector for input A of Arithmetic Unit 
Source Unit Selector for input B of Arithmetic Unit 
Source Unit Selector f~r input C of Arithmetic Unit 
I I 
Source Unit Selector for data to be sent to Scratchpad 
Memory 
Permits initiation of a simple Multiplication 
Permits initiation of a new programmed process 
Terminates current process (Done) 
Negates operand C before it is added to product AB 
Negates the product AB+C to -(AB+C) 
Adds the previous product to the current product 
declares the outgoing result as a final value which will 
be sent out 
Selects the upper 16 bits of the previous product 
Table 5-1: Control Unit Control Signals 
72 
't 
Select Symbol Explanation 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 ... 
0 
1 
MO 
Ml 
R 
w 
p 
p-1 
X 
y 
15 
ConstantO 
Constant 1 
Scratchpad memory addressed by "MO Address" 
Scratchpad memory addressed by "Ml Address" 
ROM Look-up table addressed by "Ml Address" 
Data which is sent to scratchpad memory 
in previous instruction 
Current output of Arithmetic Unit 
Previous output of Arithmetic Unit, 
delayed by one clock cycle b 
Input Queue X 
Input Queue Y 
Reserved for future use 
0 
Table 5-2: Selectable Sources 
For example, if "A Select"=8, "B Select"=9, "C Select"=6, then the data 
pending in inputs "X" and 'Y" will be multiplied and augmented by the result 
which is currently leaving the Arithmetic Unit. Due to the delay involved in the 
latched Arithmetic Unit, the product given by the example shown above will be 
available after seven clock cycles. 
"W Select" routes data to the scratchpad memory. Writing takes place if 
this select word is nonzero. In other words, the reserved constant O cannot be 
written into memory directly. If a particular instruction is not to perform any 
operations, then the all selectors must be set to 0. 
5.5.4 Control Signals 
Arithmetic control bits include "NC", "NP",, "SP" and "HI". These control 
signals must be supplied with the operands sent to to the Arithmetic Unit. The 
effects of these signals have been explained in Section 5.5.1. 
73 
The only interface-related control signal is "FI" and is sent with the last 
operands through the Arithmetic Unit to the Output Interface. The Output In-
terface will be instructed to send the result out. 
"Al", "AP" and "DN" are process-related control bits. "DN" executes the 
currently loaded instruction and terminates the process immediately thereafter 
by deactivating the hosting subsystem. "Al" is sent to the Input/Controller In-
teraction Logic block to permit the initiation of a single multiplication. The 
Input/Controller Logic block receives these signals from all subsystems and 
makes an initiation if all subsystems give permission and at least two operands 
are waiting Input Interface queue. "AP" permits the initiation of a programmed· 
process and the initiation can take place if a operands are waiting in the input 
queue and if the other subsystems give the same permission. 
The following example summarizes the instruction set and format. The 
instruction 00890006, · ~xpressed as eight hexadecimal digits) describes the fol-
lowing: The third (8) and fourth (9) digit corresponds to routing the contents 
from the left "X" and right 'Y" registers from the Input Inte~face to registers "A" 
and "B" in the Arithmetic Unit. The fifth digit (0) indicates that input "C" in the 
Arithmetic Unit is held at zero. The last digit (6) corresponds to setting "SP" 
and "FI" to 1. 
5.5.5 Initiating Simple Multiplications 
If the initiation of a simple multiplication is approved, the 
Input/Controller Interaction Logic sets the signal "C_ADV_Al" to instruct the 
Input Interface to forward the operands to the Data Routing Network; to in-
struct the Multiplication Opcode Generator to dispatch a control signal to pipe· 
the operands directly into the Arithmetic Unit and to instruct the Output Inter-
74 
'_\ 
ll 
face to send the result out. This complex set of actions is accomplished by a 
single 32-bit instruction. 
5.5.6 Initiating Programmed Processes 
If "Q_Al_A V AI" is indicates the presence of two operands for a 
programmed process and the Input/Controller Interaction Logic approves the in-
itiation, a new process will be started in subsystems 0. As explained earlier, the 
initiation of a programmed process has a higher priority than a simple mul-
tiplication. 
If an existing process is still running in subsystem 0, all active processes 
must skip the yet unexecuted statements ahead and proceed with with the first 
instruction in the successive subsystems. Such a skip takes place whenever all 
subsystems permit the initiation of a programmed process, regardless if data is 
present or not. This requirement makes microprogramming easier. The new 
process reads data from the Input Interface by activating C_Adv _AP. This sig-
nal is active whenever any subsystem attempts to read data from the Input In-
terface. 
5.5. 7 Description of the Subsystems 
As shown in Figure 5-10, every subsystem consists of a microcode memory 
with a capacity of sixteen instructions, a program counter, an Active ·status 
Latch with Process Control logic and combinational logic blocks to generate con-
trol signals for the Input/Controller Interaction Logic block and the next subsys-
tem. 
. 
. The program counter consists of a 4-bit program counter register "Ck_PC" • 
(k refers to subsystem O through 3), an incrementer, gates to reset logic and 
logic to produce a signal which is high while the last (15th) instruction is ex-
75 
Mic~ocode 
Memory 
16x32 bits 
Program 
Counter 0 
Address ... _..ck_PC Clear Cond. ----
Ck PC=15 Control 
Word Read 
Intermediate Ext. 
Control word 
generated by 
this subsystem 
Clear Conditions 
. 
DN 
Generation 
of control 
signals to inter-
act with other 
subsystems 
Interaction Control Lines 
Ck_A 11 Allow init. of multiplication 
Ck_API Allow init. of prog. process 
Ck XV Active if instruction reads 
data from input interface 
Ck_pass C_Skip 
Active Status 
Latch and 
Process 
Control 
Ck Active 
DN 
' 
This module 
passes 
processes to 
next 
subsystems 
Ck+1 Pass 
Process being passed to 
next subsystem 
Figure 5-10: Control Unit Subsystems. 
76 
. ·" -.c. 
ecuted. The output of "Ck_PC" addresses the microcode memory and returns 
\ 
the inst~ction to the outside if the subsystem is active. The counter is reset if / ( 
one of pie following conditions are met: 
•, 
1. The subsystem is inactive, 
2. The "DN" signal in the current instruction is active, 
3. The skip signal forces all processes to continue in the next subsys-
tem with the first instruction. 
The Active Status latch "Ck_ACTIVE" indicates the presence of a process. 
A subsystem is activated with the incoming "Ck_PASS" signal. As explained 
before, "Ck_PASS" passes processes from one subsystem to another. The sub-
system remains active until one of the following conditions are met: 
1. The last (15th) instruction is executed and control will be passed 
down 
2. The "DN" signal in the current instruction is active 
3. The skip signal forces all processors to continue with the next sub-
system. 
· 
The module shown at the bottom right corner of Figure 5-10 contains the 
combinational logic to generate the outgoing "Ck+l_PASS" signal. The subsys-
tem hands processes to the next one if "DN" is inactive and if at least one of the 
following conditions are met: 
1. The last (15th) instruction is executed 
2. The skip signal forces the current process to continue with the next 
subsystem. 
A set of control signals are generated to communicate with the 
" 
Input/Controller Interaction Logic shown in Figure 5-8. 
77 
t1 
J 
\ 
"Ck_XY" is active whenever the current instruction contains information 
to access data from the Input Interface. The Input/Controller Interaction Logic 
' 
) 
activates "C_Adv_AP" whenever any subsystem raises "Ck_XY''. 
The outgoing control signals "Ck_AlI" and "Ck_API" give permissions to 
initiate straightforward multiplications and programmed operations respec-
,, tively. In_ any subsystem, if a process is active, these signals are directly con- ~·," 
trolled by "Al" and "AP". Inactive systems automatically hold them active in 
order to give continuous permission. 
"C_SKIP" is generated by subsystem O and is used by all subsystems. As 
explained previously, this signal instructs .all subsystems to advance all 
processes to the successive subsystems. Both the conditions listed below must 
be met to activate "C_SKIP": 
.,. 
1. All subsystems permit progTammed initiation 
2. A process is currently running in subsystem 0 
5.6 Data Routing Network 
The Data Routing Network establishes the connections among the Arith-
metic Unit, the Input Interface and both scratchpad and lookup memories. 
All routing instructions come from microinstructions executed in the Con-
· .. 
trol Unit. They specify the source for every destination register A, B, C and 
-W. A, Band Care the inputs of the Arithmetic Unit and Wis the data word sent 
to the scratchpad memory. Source registers include memory access MO, Ml, 
MR, Input Interface latches X, Y, the result from the Arithmetic Unit P and the 
· its_ previous result delayed b_y one clock cycle p-1, a copy of the word sent to 
78 
\ 
scratchpad W, and constants. 0 and 1. 
The delay register p-l is used to feed two consecutively generated products 
back into the Arithmetic Unit. A typical application of this is in the computa-
tion of A*B*C*D where the A*B and C*D are calculated consecutively and the 
products are, fed back to obtain the final result. Chapter 6 describes he algo-
rithm for this four-operand multiplication is described in detail. 
Since writing data into the scratchpad memory requires two clock cycles, 
Register W compensates for the temporary unavailability. 
r 
The Data Routing Network is potentially a fairly slow device with a great 
hardware complexity. Its design needs to be thought out in great detail to main-
tain the processor throughput. 
5. 7 Output Interface 
The Output Interface is added at the end of the Arithmetic Unit and is 
responsible for sending final results along with their overflow bits back to the 
host. Output is activated by "FI" whicl1 must be submitted with the last set of 
operands. 
5.8 SUD1D1ary 
The components which constitute the processor have been carefully 
selected to provide a harmonic interaction among them and achieve a balanced 
bandwidth between the Arithmetic Unit, the busses connecting peripheral units 
and the connection of the processor to external circuitry. This prototyp~ 
.~ 
represents just a very simple example to demonstrate this kind of architecture 
which uses pipelined arithmetic units optimally. This architecture can be ex-
79 
v 
panded to floating point arithmetic and with multiple or different arithmetic 
pipelines to provide high throughput in sophisticated mathematical operations. 
Despite achieving high throughput, the major drawback of the processor is 
the -lack of support for conditional jump statements. If conditional jumps are 
• 
. 
. 
included, a more sophisticated Control Unit is necessary for pipeline manage-
ment. However, this disadvantage is alleviated by the fact that the results of 
every operktion sent to the processor will be returned after a fixed delay. This 
feature becomes useful for systems which employ multiple processors intercon-
nected in parallel or cascades. 
The second drawback of this processor is its passive nature. For example, 
it cannot address external data. However, host architectures can be designed 
easily to utilize this processor optimally. 
Figure 5-11 shows the SLSL description of the entire processor including 
the LICAM, Controller, Input and Output Interfaces. To simulate this proces-
sor, one should preprocess, compile and then simulate the provided program. 
Note that to be able to ·simulate host interaction, we read operands from a file 
and write the results back to a file. Since SLS is a logic simulator, it does not 
take propagation delays into account. However, in the real implementation, one 
should be careful about implementing certain elements of the architecture. 
Typical examples of these are the CPA in the program needs a fast conditional 
carry or carry look-ahead adder in practice and the Data Routing Network 
which is an assembly of a multitude of 16 to 1 multiplexers must be a much 
more exotic network in actuality. 
80 
· .... ..,. 
SYSTEM LICAM SYSTEM; 
/* 
\ 
L I C A M - B A S E D A R I T H M E T I C F R O C E S S O R 
Written by Georg A. zur Bonsen, March 6, 1988 
Master Thesis Research and Development 
Language used: SLSL - Symbolic Logic Simulation Language 
Department of Computer Science and Electrical Engineering Lehigh University, Bethlehem, FA 18015, USA 
Advisor: Prof. M. Wagh 
Input File: OPERAND.INF. Records: ab Cd 
-a - 1 if d and e are factors for single multiplication, else 0 b - 1 if d and e are operands for microprogranuned oper~tion, else 0 
c,d - operands 
Output File: Overflow Indicator, Result 
*/ 
/* External Signals generated by the Control Unit*/ 
WIRE C_ADV_Al, C_ADV_A:P; 
REGISTER C_REG[32]; 
/* External Signals generated by the I/0 Queue*/ 
WIRE Q Al AVAI, Q A:P AVAI; 
- - - -REGISTER X[l6], Y[l6]; 
/* External Signals generated by the Network*/ 
WIRE A[l7], B[17], C[17], W[l7]; /* Bit 16 = Overflow Indicator*/ 
/* External Signals generated by the RAM Subsystem*/ 
REGISTER M0[17], M1[17], MR.[16]; /* Bit 16 = Overflow Indicator*/ 
/* External Signals generated by the Arithmetic Unit*/ 
Register P [1 7]; 
/* Data is written to trace list only if clock signal is high*/ ~ 
Begin 
Trace:= Clock; 
End; 
I* 
IN PU 'T I N T E R F A C E AND QUEUE March 4, 1988 
81 
I The simplified model describes the input interface to obtain operands I from the input file called "OPERANDS. INF". 
'---------------------------------------
' I
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
Inputs: 
Outputs: 
Regs: 
File-
Inputs: 
Wires: 
C ADV Al 
C ADV AP 
Q Al AVAI 
- -Q AP AVAI 
-X 
y 
Q AP PRES 
Q_XM, Q_YM 
Q_XP, Q_YP 
Q Al Avai T 
- - -Q AP Avai T 
Q For Mult 
Q_For_Prog 
Q_X, Q_Y 
Q_Allow_Input 
Q Adv Mult 
- -Q_Adv_Prog 
Advance operands for single multiplications 
Advance operands for microprogrammed operation 
Operands for 
Operands for 
Operand X 
Operand Y 
single mutliplication available 
microprogrammed operation available 
Indicates availability of both first and 
successive operands for a prog. operation 
Buffer for operands for single multipl. 
Buffer for operands for programmed oper. 
Operands for single multiplication waiting 
Operands for progranuned operation waiting 
Indicates operands for simple rnultiplic. 
Indicates operands for prog. operation 
Operands X and Y obtained from outside 
Permits reading next record 
Advances operands for mult. 
Advances operands for prog. 
from 
to X 
to X 
file 
and Y 
and Y 
-------------------------------------------* I 
WIRE Q_For_Mult, Q_For_Prog, 
Q_X[l6], Q_Y[l6], 
Q_Adv_Mult, Q_Adv_Prog, Q_Allow_Input; 
REGISTER Q_XM[16], Q_YM[16], Q_XP[l6], Q_YP[l6], Q_Al _ Avai _ T, Q_AP_Avai_T; 
/* Other identifiers are defined at the top of the program*/ 
Begin 
READ("OPERANDS.INP",Q_Allow_Input*-Clock, 
Q_For_Mult, Q_For_Prog, Q_X, Q_Y); 
Q_Allow_Input := Q Adv Mult*Q For Mult + Q_Adv_Prog*Q_For_Prog 
+ -Q_For_Prog*-Q_For_Mult; 
Q_Al_Avai_T, Clock:= Mux( Q_Adv_Mult, Q_Al_Avai_T, Q For Mult ); Q_Ap_Avai_T, Clock:= Mux( Q_Adv_Prog, Q_Ap_Avai_T, Q_For_Prog ); 
Q Al Avai . - Mux( C_Adv_Al, Q Al Avai T, Q For Mult . -
-
-Q AP Avai . - Mux( C_Adv_Ap, Q_Ap_Avai_T, Q_For_Prog . -
Q XM Q_Adv_Prog*-Clock . - Q X; /* Queue Stages I . -Q YM Q_Adv_Prog*-Clock . - Q Y; I . -
-
Q XP Q_Adv_Prog*-Clock . - Q X; I . -Q yp Q_Adv_Prog*-Clock . - Q Y; I . -
Q Adv Mult := C Adv Al+ -Q Al Avai T; 
- - -Q_Adv_Prog := C Adv AP+ -Q_Ap_Avai_T; 
X, Clock 
Y, Clock 
End; 
:= Mux( C_Adv_Ap, Q_XM, Q XP ) ; 
:= Mux( C_Adv_Ap, Q_YM, Q YP ); 
82 
) ; I* Bypass 
) ; 
*/ 
*/ 
/* 
I I I co NT Ro L UN IT March 3, 1988 I 
'----------------------------------' I I I The control unit supervises the operation of the Latched Iterative Cellular I I Array Multiplier and the data flow among memory modules. I I I I I I Inputs: Q_Al_AVAI Operands for single mutliplication available I I Q AP AVAI Operands for microprogranuned operation available I I I Outputs: C ADV Al Advance operands 
operands 
register 
for single multiplications 
Regs: 
Wires: 
C ADV AP 
C REG 
GO PC 
Advance 
Control 
. 
.. C3 PC 
CO ACTIVE .. C3 ACTIVE 
CO INT BUS .. C3 INT BUS 
- - - -CO EXT BUS .. C3 EXT BUS 
CO INHIBIT .. C3 INHIBIT 
CO PASS 
CO All 
CO API 
CO XY 
C SKIP 
C APi 
.. C3 PASS 
. . C3 AlI 
.. C3 API 
.. C3 XY 
for microprogrammed operation 
supplying ctrl info to outside 
Microcode Program Counter 
Indicates subsystem executing microcode 
Internal Bus (Next microcode word) 
External Bus (gated with C*_ACTIVE) 
Forces subystem to become inactive unless 
process is activated from upper subsystem 
Passes process to next subsystem 
Subsystems permitting single multipl . 
Subsystems permitting prog. initiation 
Active if microcode refers inputs X or Y 
Forces all processes to discontinue and 
pass control to subsequent control unit 
AND-gated signals CO_APi .. C3_APi 
Memory: CO MICRO MEM .. C3 MICRO MEM ROM for microcode 
Bit Pattern of microcode control word 
31 30 29 28 
MO Address 
15 14 13 12 
Select c 
27 26 25 24 
Ml/ROM Addr 
11 10 09 08 
Select Wrt 
23 22 21 20 
Select A 
07 06 05 04 
Al AP DN NC 
19 18 17 16 
Select B 
03 02 01 00 
NP SP FI HI 
Al= Allow initiation of a simple multiplication 
AP - Allow new microprogrammed initiation 
ON - Done Indicator, process is finished 
NC - Negate augment C before addition 
NP - Negate product AB+C before summation 
SP - sum with previous product 
FI - Final result indicator - activate a flag to the outside 
HI - Access upper 16 bits of current LICAM result 
Pre-definition of control signals */ 
&[ Al . - 7; AP . - 6; DN . - 5; NC . - 4; . - . - . - . -
NP . - 3; SP . - 2; FI . - 1; HI . - O; ] . - . - . - . -
/* ROM with Microcode 
Desired algorithms must be prograrmned into this ROM*/ 
/* Contents: Application 2c: Product :=A* B· * C * D, 32 bits 
4 Overlapped Processes are possible*/ 
ROM CO_Micro_Mem(4) [32) - { 00890000H, 00890040H, 00000000H, 00000000H, 
83 
I 
. ~ 
00000000H, 00000000H, 00000000H, 00000000H, 
00670002H, OOOOOOE3H, 00000000H, 00000000H, 
00000000H, 00000000H, 00000000H, 00000000H }, 
Cl Micro _Mem(4) [32] - { 00000000H, 00000040H, 00000000H, 00000000H, 
00000000H, 00000000H, 00670002H, 00000023H, 
00000000H, 00000000H, 00000000H, 00000000H, 
00000000H, 00000000H, 00000000H, 00000000H }, 
C2_Micro_Mem(4) [32] - { 00000000H, 00000040H, 00000000H, 00000000H, 
00670002H, 00000023H, 00000000H, 00000000H, 
00000000H, 00000000H, 00000000H, 00000000H, 
00000000H, 00000000H, 00000000H, 00000000H }, 
C3_Micro_Mem(4) [32] - { 00000000H, 00000000H, 00670002H, 00000023H, 
00000000H, 00000000H, 00000000H, 00000000H, 
00000000H, 00000000H, 00000000H, 00000000H, 
00000000H, 00000000H, 00000000H, 00000000H }; 
WIRE &[ For Query:=0 to 3 do] 
C?_Int_Bus[32], C? Ali, C? Pass, C?_Inhibit, 
C?_Ext_Bus[32], C?_Api, C?_XY, 
&[End] 
c_skip, C4 Pass, c_Api; 
REGISTER CO_PC[4], Cl PC[4], C2_PC[4], C3_PC[4], 
CO_Active, Cl Active, C2_Active, C3 Active; 
Begin 
/* Description of each of the, four control subsystems 
·. ! 
Note: The question mark will be replaced by the contents of the integer variable 'query'. This feature is provided by 
the preprocessor. The preprocessor is an application program 
of the Symbolic Logic Simulator software package. 
*/ 
c_skip :·= 
&[ For Query:= 
CO_Active * CO_Int_Bus[%(AP]] * C_Api; 
0 to 3 do Atsign:=Query+l] 
/* Description of?. control subsystem*/ 
C? Active,Clock := C? Pass+ C? Active*-C? Inhibit; 
-
-C?_Api := -c? Active+ C?_Int_Bus[%[AP]]; 
C? Ali : = -c? _Active + C?_Int _Bus[% [Al] l; 
C? Int Bus := C?_Micro_Mem(C?_PC); . 
C? Ext Bus := ANDGATE( C?_Active, C?_Int_Bus ); C?_PC, Clock:= ANDGATE(-C? Int Bus[%[DN]]*-C Skip*C? Active,INC(C? PC)); 
- -
- -
-C? Inhibit := ANDSUM (C?_PC) + C?_Int_Bus[%[DN]) + c_skip; C? XY := ( C?_Int_Bus[23] * -c?_Int_Bus[22] * -c?_Int_Bus[21] 
+ C? Int Bus[l9] * -c? Int Bus[l8) * -c? Int Bus[l7] 
- -
- -
- -+ C?_Int_Bus[lS] * -c?_Int_Bus[l4] * -c?_Int_Bus[l3] 
+ C? Int Bus[ll] * -c? Int Bus[lO] * -c? Int Bus[ 9] ) -t•·-
--
--* C? Active; /* Finds all references to X and Y */ C@ Pass := (ANDSUM(C?_PC) + c_skip * C?_Active) * -c?_Int_Bus[%[DN]]; 
& [ End ] 
/* Generation of incoming and outgoing control signals*/ 
C Adv AP . - co XY + Cl XY + C2 XY + C3 XY; . -
- -C Adv Al . - co Ali * Cl Ali * C2 Ali· * .C3 Ali * Q Al Avai * -c_Adv_Ap; . -
- -
-
-C _Api . - co _Api * Cl . - _Api * C2 _Api * C3 _Api; 
co Pass . - C Api * Q AP Avai; . -
-
C_Reg,Clock . - co Ext Bus + Cl Ext Bus + C2 Ext Bus + C3 Ext Bus . -
84 
• 
~j 
\ ,, 
/' 
+ ANDGATE(C_Adv_Al,{0:8,lOOOB:4,lOOlB:4,0:8,000000lOB:8}); 
/* Instructs a multiplication of operands X and Y */ 
End; 
/* 
DAT A R O U T I N G NETWORK March 5, 1988 
This network is modeled with multiplexers and many other implementation 
methods exist. 
Inputs: C_Reg 
X,Y 
p 
Outputs: A, B, C 
The Control Register 
Inputs X and Y from I/0 Queue 
Product from LICAM 
Operands for LICAM 
Notice: All 17 bit busses contain an overflow indicator and the 16bit data 
The overflow indicator is generated by the arithmetic unit and is 
carried along until the result is sent out. 
*/ 
REGISTER W_Dlyd[17], /* Data being written to memory remai~ _ _accessible */ 
P _Dlyd [1 7]; 
Begin 
/* Preprocessor Macro Definition of one set of multiplexers in the 
Data Routing Network*/ 
&[ Macro Network(%Dest,%Range) ] 
%Dest := Mux ( C_Reg[%Range], 0:17, 
& [ Endm ] 
{ 0, MR}, 
{ 0, X}, 
0:17, 
1:17, 
W_Dlyd, 
{O,Y}, 
0:17, 
MO, 
P, 
0:17, 
0:17, 
W_Dlyd,Clock := W; /* To bypass scratchpad memory*/ 
P_Dlyd,Clock := P; /* Previous P */ 
%Network('A','23 .. 20') 
%Network('B' ,'19 .. 16') 
%Network('C' ,'15 .. 12') 
%Network('W' ,'11 .. 8') 
Ml, 
P_Dlyd, 
0:17, 
0:17 ) ; 
End; 
,·., , ... ,.; 
I* 
*/ 
T W O - P O R T MEMORY March 6, 1988 
Model of a 2-port memory. Two READ and one WRITE operation may be 
performed simulataneously, assume one delay cycle .. , 
Inputs: C_Reg 
w 
Outputs: Ml, M2 
The Control Register to supply addresses 
Data word W 
Data words Ml and M2 
REGISTER M_Mar0[4], M_Mar1[4], M_Mdrw[17], M_Wrt; 
MEMORY M _ Ra~ ( 4) [ 1 7] ; 
/* The ROM is available to supply predefined data*/ 
85 
ROM M Rom(4) [16] 
-
= { OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, 
OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, OOOOH, OOOOH }; 
Begin 
M MarO,-Clock := C_Reg[31 .. 28]; 
M_Marl,-Clock := C_Reg[27 .. 24]; 
M Mdrw,-Clock := W; 
M Wrt ,-Clock := ORSUM(C_Reg[ll .. 8]); 
M_Ram(M_MarO) , M_Wrt * Clock := M_Mdrw; /* Writing to RAM*/ 
MO, Clock . - M_Ram(M_MarO); /* Reading RAMO */ . -
Ml, Clock . - M_Ram(M_Marl); /* Reading RAMl *I . -
MR, Clock . - M Rom (M Marl) ; I* Reading ROM */ . -
- -End; 
/* 
L A T C H E D C E L L U L A R ARRAY March 5, 1988 
I 
I 
---------------------------------------1 I This part of the program describes a 16bit latched iterative cellular array I 
multiplier for signed integers. Enhanced features include simulataneous f 
addition of a third operand, subtraction and summing a sequential series I 
of products together. Additional circuitry detects overflows. I 
-------------,-------------------: 
Baugh-Wooley's method is used to handle twos-complement arithmetic I 
-----------------------------------------1 
Inputs: A, B, C 
C REG 
Outputs: P 
Operands incl overflow bits. A*B+C will be computed 
Control signals from control unit 
Product incl overflow bit 
-,------------------------------------------* I 
/* Control Signals and Overflow Indicators: 
These signals pass through LICAM stages before actually used*/ 
REGISTER C SP, C SPl, C SP2, C SP3, C SP4, C SPS, 
- -
- - - -C FI, C Fil, C FI2, C FI3, C FI4, C _FIS, C _FI6, 
C _NP, C NPl, C NP2, C _NP3, C NP4, \ ( : '··:, C NC, \ ' . 
,,l C HI, C Hil, C HI2, C _HI3, C HI4, C HIS, 
-
-p ov, p _ovo, p 
_OVl, p OV2, p OV3, p _OV4, p OV5; 
-
WIRE p OV6, p Ovlo; 
-
,. 
/* The Input terminals A, Band Care directly connected to stage O wires. Since the array multiplier is latch_ed, the contents of A, B and the sign bit of c must be latched into further stages*/ 
REGISTER A_Stg0[16], B_Stg0[16], c_stg0[16], 
A_Stg1[16], B_Stgl[l6], c_sgnl, 
A_Stg2[16], B_Stg2[16], c_sgn2, 
A_Stg3[16], B_Stg3[16], c_sgn3; 
/* The following wires represent contents of operand "A" AND-gated with 
the nth bits of opernad "B". */ 
WIRE A_ w_BO [15], 
A_w_B4 [15], 
A_w _B8 [15], 
A_w_Bl2 [15], 
A_w_Bl [15], A_w_B2 [15], 
A w ~5 [15] ,. A w B6 [15], 
A-w-B9 [15], A-w-BlO[lS]; 
. - - - -Aw Bl3[15], Aw Bl4[15], 
- - - -
A_w_B3 [15], 
AwB7 [15], 
- -A_w_Bll [15], 
A_w_B15 [15], 
/* Partial Sums generated by every adder inside the cellular array*/ 
' •, , .. 
86 
I 
I 
I 
I 
I 
I 
sum_l (16], sum_2 (16], 
sum 5 [16], sum_6 (16], 
Sum_9 (16], Sum_10[16], 
Sum_13[16], Sum_14[16], 
sum_3 [16],ksum_4 [16], 
Sum_? [16], Sum_S (16], 
Sum_11[16], Sum_12[16], 
Sum 15[17], Sum 16(17], 
- -
/* Partial Carrys generated by every adder inside the cellular array*/ 
Carry_l [16], Carry_2 [16], Carry_3 [16], Carry_4 [16], 
Carry_S [16], Carry_6 [16], Carry_? [16], Carry_8 (16], 
Carry_9 [16], Carry_10[16], Carry_11[16], Carry_12[16], 
Carry_13[16], Carry_14[16], Carry_15[17], Carry_16[17]; 
/* Partial products generated in intermediate stages and latched partial 
sums and carries are passed on to the next stage*/ 
REGISTER P_Prodl[ 4], Sum 4L [16], Carry_4L [16], 
P Prod2[ 8], Sum SL [16], Carry_SL [16], 
P_Prod3[12], Sum 12L[16], Carry_12L[16], 
P_Hi0rder[17]; /* Upper 16 bits of Product*/ 
REGISTER Prod_PS[32], Prod PC[32], Next_PS[32], Next PC[32]; 
WIRE Intm_PS[32], Intm_PC[32], 
Finl_PS[32], Finl_PC[32], P_Prelim[32], P_Carry; 
REGISTER MSB Intro PC; 
- -
I* 
Arithmetic Subunit: Latched Iterative Cellular Array Multiplier 
Inputs: A,B,C 
C NC 
CNP 
CSP 
C FI 
. 
-
Outputs: Prod PS 
Prod PC 
-C NP4 
C SP4 
C FI4 
*/ 
Operands 
Negate Operand C before addition 
Negate intermediate product before summing up 
·control Signal Passed to summer: Add with prev. product 
Final Result Indicator 
Partial Sum component of AB+C 
Partial Carry component of AB+C 
CNP forwarded 
c SP forwarded 
C FI forwarded 
/* Preparation of Inputs, Operand C is complemented if necessary. Since 2s complementation is used, C is incremented by 1 at the 1st CSA module*/ 
Begin 
/* Latch before using*/ 
B_StgO,Clock . - B[lS .. OJ; . -
A_StgO,Clock . - A[lS .. O]; . -
c_stgO,Clock . = XORGATE(C_Reg[%[NC]],C[15 .. OJ); p ova ,Clock . - A[l6]+B[16]+C[l6]; I* Overflow indicators . -C NC ,Clock . - C_Reg[%[NC]]; . -
CNP ,Clock . - C _ Reg [ % [NP] ] ; . -
CSP ,Clock . - C_Reg[%[SP]]; . -
-C FI ,Clock . - C_Reg[%[FI]]; . -
CHI ,Clock . - C_Reg[%[HI]]; . -
/* Begin of Array -- Stage O */ 
Aw BO 
Aw Bl 
:= ANDGATE( B_StgO[O], A_Stg0[14 .. O] ); 
:= ANDGATE( B_StgO[l],· A_Stg0[14 .. O] ); 
87 
' ; /, 
*/ 
) 
I 
\ 
(' 
CSA( {A_Stg0[15]*-B_Stg0[0],A_w_BO}, {A_w_Bl,C_NC}, c_stgO, 
Sum 1, Carry 1); 
- -
Aw B2 := ANDGATE( B Stg0[2], A Stg0[14 .. O]); 
CSA( {A_Stg0[15]*-B_Stg0[1],Sum_l[15 .. l]}, {A_w_B2,0}, Carry_l, 
Sum 2, Carry 2); 
- -
Aw B3 := ANDGATE( B Stg0[3], A Stg0[14 .. O]); 
CSA( {A_Stg0[15]*-B_Stg0[2],Surn_2[15 .. l]}, {A_w_B3,0}, Carry_2, 
Sum 3, Carry 3); 
- -
A_w_B4 := ANDGATE( B_Stg0[4], A_Stg0[14 .. O]); 
CSA( {A_Stg0[15]*-B_Stg0[3],Sum_3[15 .. 1]}, {A_w_B4,0}, Carry_3, 
Sum_ 4, Carry_4) ; 
/* 1st Latch*/ 
A_Stgl, 
B_Stgl, 
c_sgnl, 
Sum 4L, 
Carry_4L, 
P Prodl, 
C_NPl, 
C SPl, 
C Fil, 
C Hil, 
P_OVl, 
Clock 
Clock 
Clock 
Clock 
Clock 
Clock 
:= A_StgO; /* Operand A passed down 
:= B_StgO; /* Operand B 
:= C_Stg0[15]; /* Sign bit of Operand C 
:= Sum 4; /* Partial sum 
:= Carry_4; /* Partial carry 
*I 
*/ 
*/ 
*/ 
*I 
:= {Sum_4[0],Sum_3[0],Sum_2[0],Sum_l[O]}; 
/* Partial Product */ 
Clock:= CNP; 
:=CSP; 
/* Negate produce */ 
Clock 
Clock:= C FI; 
Clock:= CHI; 
Clock : = P OVO; 
/* Add w/ previous sum */ 
/* Final Result Indicator*/ 
/* Use upper 16 bits of P */ 
/* Overflow signal */ 
/* Stage 1 */ 
A_w_BS : = ANDGATE ( B_Stgl [SJ, A_Stgl [14 .. o·]); 
CSA( {A_Stg1[15]*-B_Stg1[4],Sum_4L[l5 .. l]}, {A_w_BS,0}, Carry_4L, 
Sum_S, Carry_S); 
A_w_B6 := ANDGATE( B_Stg1[6], A_Stgl[14 .. O]); 
CSA( {A_Stgl[l5]*-B_Stgl[5],Sum_5[15 .. 1]}, {A_w_B6,0}, Carry_S, 
Sum_6, Carry_6); 
A_w_B7 := ANDGATE( B_Stgl[7], A_Stgl[l4 .. 0]); 
CSA( {A_Stgl[l5]*-B_Stgl[6],Sum_6[15 .. 1]}, {A_w_B7,0}, Carry_6, 
Sum_ 7, Carry_ 7); 
Aw B8 := ANDGATE( B Stg1[8], A Stg1[14 .. O]); 
- - - -CSA( {A_Stg1[15]*-B_Stg1[7],Sum_7[15 .. 1]}, {A_w_B8,0}, Carry_?, 
Sum_8, carry_8); 
/* 2nd Latch*/ 
A_Stg2, 
B_Stg2, 
c_sgn2, 
Sum_8L, 
Carry_8L, 
P_Prod2, 
C_SP2, 
C_NE'2, 
C FI2, 
C_HI2, 
P_OV2, 
Clock:= A_Stgl; /* See Notes in 1st Latch*/ 
Clock:= B_Stgl; 
Clock:= C_Sgnl; 
Clock := Sum 8; 
Clock:= Carry_8; 
Clock:= {Sum_8[0],Sum_7[0],Sum 6[0],Sum 5(0],P Prodl}; 
Clock:= C SPl; 
Clock : = c __ NPl; 
Clock:= C Fil; 
Clock:= C Hil; 
Clock:= P_OVl; 
/* Stage 2 */ 
A_w_B9 := ANDGATE( B_Stg2[9], A_Stg2[14 .. OJ); 
CSA( {A_Stg2[15]*-B_Stg2[8],Sum_8L[l5 .. l]}, {A_w_B9,0}, Carry_SL, 
Sum 9, Carry 9); 
- -
A_w_BlO := ANDGATE( B_Stg2[10], A_Stg2[14 .. 0]); 
88 
CSA( {A_Stg2[15]*-B_Stg2[9],sum_9[15 .. 1]}, {A_w_BlO,O}, Carry_9, 
Sum_lO, Carry_lO); · 
I 
' . ../ 
Aw Bll := ANDGATE( B Stg2[11], A Stg2[14 .. O]); 
- -
- -CSA( {A_Stg2[15]*-B_Stg2[10],Sum_10[15 .. 1]}, {A~w_Bll,O}, Carry_lO, 
Sum 11, Carry 11); 
- -
Aw B12 := ANDGATE( B Stg2[12], A Stg2[14 .. O]); 
CSA( {A_Stg2[15]*-B_Stg2[11],Sum_ll[l5 .. 1]}, {A_w_B12,0}, Carry_ll, 
Sum_12, Carry_12); 
/* 3rd Latch*/ 
A_Stg3, 
B_Stg3, 
c_sgn3, 
Sum 12L, 
Carry_12L, 
P Prod3, 
C SP3, 
Clock:= A_Stg2; /* See Notes in 1st Latch*/ 
Clock:= B_Stg2; 
Clock:= c_sgn2; 
Clock:= Sum 12; 
Clock:= Carry_12; 
Clock:= {Sum_12[0],Sum_ll[O],Sum 10[0],Sum 9(0],P Prod2}; 
Clock:= c SP2; 
.. -
/* 
C NP3, Clock:= C NP2; 
C FI3, 
C HI3, 
Clock : = C FI2; 
Clock : = C HI2; 
.-
-P_OV3, Clock:= P OV2; 
/* Stage 3 */ 
A_w_B13 := ANDGATE( B_Stg3[13], A_Stg3[14 .. 0]); 
CSA( {A_Stg3[15]*-B_Stg3[12],Sum_12L[15 .. l]}, {A_w_Bl3,0}, Carry_12L, 
Sum_13, Carry_13); 
A_w_Bl4 := ANDGATE( B_Stg3[14], A_Stg3[14 .. 0]); 
CSA( {A_Stg3[15]*-B_Stg3[13],Sum_13[15 .. 1]}, {A_w_B14,0}, Carry_13, 
Sum_14, Carry_14); 
A_w_BlS := ANDGATE( B_Stg3[15],-A_Stg3[14 .. OJ); 
CSA( {A_Stg3[15]*B_Stg3[15],A_Stg3[15]*-B_Stg3[14],Sum_14[15 .. 1]}, {-A_Stg3[15],A_w_B15,0}, {-B_Stg3[15],Carry_l4}, Sum_lS, Carry_lS); 
CSA( {1,Sum_15[16 .. 1]}, Carry_lS, {XORGATE(C_Sgn3,0:16),A_Stg3[15]}, 
Sum_16, Carry_l6); 
/* 4th Latch, End of Array*/ 
Prod_PS, Clock:= {Sum 16,Sum_15[0],Sum_14[0],Sum_13[0],P_Prod3}; 
Prod_PC, Clock:= {Carry_16[15 .. 0],B_Stg3[15],0:15}; /* Incl. LSB */ 
c_SP4,Clock := c_SP3; 
C_NP4,Clock := c_NP3; 
C FI4,Clock := c FI3; 
-
-C~HI4,Clock := C HI3; 
-
-P_OV4,Clock := P OV3; 
Arithmetic Subunit: Carry Save Adders, Carry Propagation Adders, 
Overflow Detection 
Inputs: 
Outputs: 
* /-
Prod PS 
Prod PC 
C NP4 
C SP4 
C FI4 
p 
P DLYD 
Partial Sum component of AB+C 
Partial Carry component of AB+C, already shifted left k .. 
CNP forwarded 
CSP forwarded 
C FI fQrwarded 
Final Result of AB+C + Previous Products if requested 
Low-order result delayed by one additional clock cycle 
89 
CSA(XORGATE(C NP4,Prod PS),XORGATE(C NP4,Prod PC), 
- - - -ANDGATE(C SP4,Next PS), Intro PS, Intm PC); 
- - - -CSA(Intm PS, <Intm PC+{0:31,C NP4}, <ANDGATE(C SP4,Next PC)+{0:31,C NP4}, 
- - -
- -Finl PS, Finl PC); 
- -
/* 5th Latch after CSA for sununing */ 
Next PS 
Next PC 
,clock 
,clock 
MSB Intm PC,Clock 
:= Finl PS; /* Partial Sum passed on*/ 
:= Finl PC; /* Partial Carry passed on*/ 
:= Next_PC[31]*C_SP4 ! Intm_PS[31] ! Intm_PC[31]; 
/*Passon a signal for overflow detection*/ 
C SPS, Clock . - C SP4; . -
- -C FIS, Clock . - C FI4; .-
-C HIS, Clock . - C HI4; . -
p ovs, Clock . - p OV4; . -
/* Final Carry Propagation Addition*/ 
CPA( Next PS,<Next PC,0,P Prelim,P Carry); 
- - - -
/* Overflow Detection: 
Necessary MSBs are XOR-gated together. The .selection of these signals 
are based on extending the carry-save adders beyond their 32 bits. 
Overflow does accumulate from previously affected values and 
previous products 
P Ov - Overflow occurred in 32 bit word 
P Ovlo - Overflow occurred in 16 bit word 
*/ 
P OV6 := MSB_Intm_PC ! Next_PC[31] ! P_Carry ! P_Prelim[31] 
+ P Ov*C SPS + P Ov5; 
P Ov ,Clock := P Ov6; 
P OvLo := P Ov6 + ORSUM(P_Prelim[31 .. 16]) * -ANDSUM(P_Prelim[31 .. 16]); 
r 
c_FI6 ,Clock := C_FI5; /* 6th Stage to pass Final Indicator on*/ 
P_HiOrder,Clock := { P_Ov6,P_Prelim[31 .. 16] }; 
P ,Clock := Mux( C_HIS, {P_OVLo,P_Prelirn[lS .. OJ}, P HiOrder ); End; ~·. 
/* 
I I I OUTPUT INTERFACE March 6, 1988 I 
I I I I I All results will be copied into the output file called "RESULTS. OTP". I 
I I I I I Inputs: P Product with overflow indicator I I C FI6 Final Result Indicator, activates output to file I 
I I 
*I 
Begin 
Write ("RESULTS. OTP", c _FI6*-Clock, P [16] , P [15 .. 0] ) ; 
/*.That's all you need to have the LICAM-based processor be working! */ 
End. /* The End*/ 
Figure 5-11: SLSL Listing of LI CAM Processor 
90 
. 
uchapter6 . 
Applications of the LI CAM Processor 
6.1 Introduction 
\ 
,• 
i 
This chapter describes a f~w programming examples for the LICAM 
processor. The examples include simple algorithms such as computing the inner 
product of vectors, squaring a 2x2 matrix, calculating the determinant of a 3x3 
matrix, and overlapped algorithms such as computing an eighth-degree polyno-
mial and demonstrating quadruple-overlapped multiplications of four operands. 
These examples are chosen demonstrate both the strengths and drawbacks of 
the processor. 
After a brief introduction to the pseudo-code representation of the al-
gorithms, the listed examples is presented. For each example, a general ap-
proach is proposed which best suits the processor. Following this, a listing of 
the pseudo-code and the actual microcode is given. 
6.2 Pseudo-code Representation 
A very simple pseudo-code representation can be used to describe the im-
plementations clearly. The code consists of three principial primitives: Label, 
SEND-instruction and write-access to scratchpad memory. The label indicates 
current subsystem (m, O~m~3) and microcode memory location number (n, 
O~n~F[hex]): 
m,n: 
At every label, one SEND and Write-access instruction is allowed. The 
SEND-instruction submits ·operands into the Arithmetic Unit and specifies op-
91 
1' . 
,, 
.,_,.. ____ ... 
,j 
tional commands. The eight optional commands represent directly the eight 
single-bit control signals. Note that the Arithmetic Unit has a delay of seven 
clock cycles until the results become available in output register "P". Thus, a 
typical Send instruction has the following format. 
Send( a, b, c, [optional commands]) 
where a = Operand A source 
b = Operand B source 
c = Operand C source to calculate AB+C 
Optional Commands: 
NC= 
NP= 
SP= 
HI= 
FI= 
Al= 
AP= 
DN= 
Negate Operand C before adding to AB 
Negate Product AB 
Add previous result with AB 
Instruct the array to return the upper 16-bit 
halfword of the previous result 
The returned result is a final value and will 
be sent to the outside 
Permit one straightforward multiplication 
For example, Al is used when the next statement 
does not involve the arithmetic unit. 
Permit initiation of a new process. AP is used 
to enable overlapped processing. 
Done! This process will relinquish after this 
line is executed. 
The Write-access instruction specifies the address of the scratchpad 
memory location and source w from where the data is obtained as 
M[addr] := w 
92 
1 
( 
\ 
·, 
a, b, c for the Arithmetic Unit and w for the Memory Block may be as-
signed to one of the following registers: 
0 = Constant 0 
1 = Constant 1 
X = Input Interface X 
Y = Input Interface Y 
M[addr] = Scratchpad Memory 
P = Output of Arithmetic Unit (Product) 
p-l = Output delayed by one clock cycle (Delay Register) 
W = Intermediate Buffer for data being sent to memory. 
R[addr ],,--- ~ ROM Look-up table 
Since the microcode format has been designed to be compact, the two 
available memory addresses are shared by read and write access to the 
scratchpad memory and the ROM look-up table. Therefore, accessing memory 
blocks may cause a few problems. 
Since two clock cycles are required to access memory, any data sent to 
memory will not be accessible in the next clock cycle. Therefore, the register 
, "W" has been introduced to maintain access to data all time. In the actual 
microcode, please note that the corresponding memory addresses must be 
specified in the PREVIOUS microcode locations in order to allow sufficient time 
to access data. 
All pseudo-code examples incorporate comments in order to make the al-
~'?1 
gorithms easy to understand. harts have been attached to facilitate under-
standing the pipelined operation of the Arithmetic Unit. The chart is divided 
into a series of rows, which relate to interconnected registers, and a series of 
columns referring to clock cycles. 
93 
/· 
\ 
( 
\ \ __ 
/ 
( 
6.3 Inner Product of Vectors 
The computation of inner products from vectors can be implemented 
ideally into the processor with the SP signal. Given the vectors 
X = (x[O], x[l], x[2], ... x[n]) and Y = (y[O], y[l], y[2], ... y[n]) 
whose successive components are available to the processor at one per 
clock cycle, the following pseudo-code shows how the inne\}r product is obtained: 
0,0: Send (X, Y, 0) to compute x[O]y[O]; 
0,1 to O,n-1: Send (X, Y, 0, S~) to obtain x[O]y[O]+x[1]y[1]; 
O,n: Send(X, Y, 0, SP, FI, DN) to comp1ete 
x[O]y[O] + x[l]y[l] + ... + x[n]y[n]; 
Finished. 
Corresponding Microcode: 
Subsystem· Addr Contents (hex) 
0 0 
0 1 
• 
• 
0 
• 
• 
n 
0 0 8 9 0 0 0 0 
0 0 8 9 0 0 0 4 
• 
• 
0 0 8 9 0 0 4 6 
If n is less than seven, a new inner product will be started automatically 
at the top of the Arithmetic Unit while the previous sum of products is not yet 
available. If n~16, then the repeated sequence continues with the first 
microcode location in subsystem 1. 
6.4 Squaring a Matrix 
The next example shows how the processor uses the scratchpad memory 
. 
,,, 
to square a 2x2 matrix. 
I a b I 2 
I I 
I C d I 
-
I aa + be 
I 
I ac + de 
94 
ab + bd I 
I 
be+ dd I 
Temporary storage in the scratchpad memory becomes necessary because 
. 
each operand is referenced more than once. Some extra delays are also used 
since only one operand can be written to scratchpad memory at a time. The 
following pseudo-code describes the approach in detail: 
0,0: Send (X, X, 0) to calculate a 2 ; 
M[O] := X to store a; 
Assume a is avai1able to the processor at this time. 
0,1: Send (0, 0, 0, SP) to ho1d aa; 
M[l] := X to store b; 
Assume bis avai1able at clock 1. 
0,2: Send (W, X, 0, SP, FI) to ca1culate aa+bc; 
M[2] := X to store c; 
Assume c is avai1able at clock cycle 2. 
0,3: Send (X, M[l], 0) to ca1culate bd; 
M[3] := X to stored; 
Assumed is avai1able at clock cycle 3. 
0,4: Send (0, 0, 0, SP) to hold bd because the next 
instruction accesses two memory 1ocatiqns 
simultaneously, where the addresses are specified 
in this microcode 1ine. 
0,5: Send (M[l], M[3], SP, FI) to calculate ab+bd; 
0,6: Send (M[Q], M[2], 0) to calcu1ate ac; 
0,7: Send (M[3], M[l], 0, SP, FI) to calculate ac+dc; 
0,8: Send (M[2], M[3], 0) to calcu1ate be; 
fr 
0,9: Send (M[3], M[3], 0, SP, FI, DN) to calculate bc+dd; 
Finished. 
95 
.l,..i· 
/ 
• ,, 
Microcode Listing: 
,-
Subsystem Addr Contents (hex) 
0 0 0 0 8 8 0 8 0 0 
0 1 1 0 0 0 0 8 0 4 
0 2 2 1 8 5 0 8 0 6 
0 3 3 0 8 3 0 8 0 0 
0 4 0 1 0 0 0 0 8 4 
0 5 0 2 2 3 0 0 0 6 
0 6 3 2 2 3 0 0 0 0 
0 7 1 2 2 3 0 0 0 6 
0 8 3 0 2 3 0 0 0 0 
0 9 0 0 2 2 0 0 2 6 )~ 
i 
6.5 Detern1inant Calculation 
This simple application demonstrates the capability of the proce-ssoz:._ to 
~. 
-~'-r 
feed intermediate products back and to perform subtractions. 
Given the matrix 
I a b c I 
I d e £ I 
I g h i I 
We obtain the determinant 
D = i(ae-bd) + h(cd-af) + g(bf-ce) , ' 
as shown below. Variables a through fare stored in memory locations 1 through 
6 respectively. 
0,0: M[l] := X to store a; 
0,1: M[S] := X to store e; ~ 
0,2: Send (M[l], W, 0) to calculate ae; 
M[2] := X to store b; 
0,3: Send (X, W, 0, NP, SP) to calculate ae-bd; 
M[4] := X to stored; 
0,4: Send (X, W, 0) to calculate cd; 
96 
M[3] := X to store c; 
0,5: Send (X, M[l], 0, NP, SP) to ca1c.u1ate cd-af; 
M[6] := X to store£; 
0,6: Send (W, M[2], 0) to ca1cu1ate bf; 
0,7: !Send (M[3], M[S], NP, SP) to ca1cu1ate bf-ce; 
0,8 to 0,A: No operation. Six mu1tip1ications are stil1 
• in progress. 
0,B: ae-bd is now available at p-1 . 
Send (P-1 , X~ 0) to calculate i(ae-bd); 
0,C: cd-af is now avai1able at P. 
Send (P, X, 0, SP) to calculate i(ae-bd) + h(cd-af); 
0,D: bf-ce is not yet available 
Send (0, 0, 0, SP) to hold i(ae-bd) + h(cd-af); 
O,E: bf-ce is now available in P 
Sub-
· Send (P, X, 0, SP, FI, DN) to calculate 
i(ae-bd) + h(cd-af) + g(bf-ce); Finished. 
The corresponding microcode is listed below: 
., Sub-
system Addr Contents (hex) system Addr Contents (hex) 
0 0 1 0 0 0 0 8 0 0 0 8 0 0 0 0 0 0 0 0 
0 1 5 1 0 0 0 8 0 0 0 9 0 0 0 0 0 0 0 0 
0 2 2 0 3 5 0 8 0 0 0 A 0 0 0 0 0 0 0 0 
0 3 4 0 8 5 0 8 0 C 0 B 0 0 7 8 0 0 0 0 
0 4 3 1 8 5 0 8 0 0 0 C 0 0 6 8 0 0 0 4 
0 5 6 2 8 3 0 8 0 C 0 D 0 0 0 0 0 0 0 4 
0 6 3 5 3 5 0 0 0 0 0 E 0 0 6 8 0 0 2 6 
0 7 0 0 2 3 0 0 0 C 
6.6 Polynoniials 
The following example demonstrates the initiation of a second process 
while the first one is not yet compl~te. The coefficients ai's of an 8th degree 
t polynomial are assumed to be in the ROM look-up table. 
97 
"I 
The simple approach to compute polynomial ftx) by calculating all powers 
Of X befo~
1
\Illrltiplying them with the coefficients is inefficient since it does not 
permit e)fi.~i:nt overlapping. We evaluate this polynomial through the four par-
titions of f(x) [52]: 
This approach allows initiation of five multiplications with odd-indexed 
coefficients as well as squaring x right away. The following table shows that the 
major workload is in the first half of the process. Further, the last multiplica-
tion may be delayed in order to pipeline additional evaluations. 
Pass 1: xx a7x aSx a3x alx 
2 2 2 2 2 
Pass 2: xx a8x a6x a4x a2x 
4 2 
Pass 3: * X * X + aO 
6 
Pass 4: * X 
The following pseudo-code will initiate the multiplications in the sequence 
shown above and performs necessary additions to assemble the solution. The 
algorithm is designed in a way that data coming out leaving the Arithmetic Unit 
will be used immediately so no temporary storage in the scratchpad memory is 
necessary. 
t 
Note: Coefficients aO through a8 in the pseudo-code refer to 
ROM 1ocations O through 8. 
0,0: Send (X, X, 0) to ca1cu1ate x 2 ; 
M[O] := X fetch from the input interface and store it; 
0,1: Send (X, a7, 0) to ca1cu1ate a7x; 
0,2: Send (X, aS, 0) to ca1cu1ate aSx; 
98 
. ·, 
.... --- - . 
"-''-' 
0,3: Send (X, a3, 0) to calculate a3x; 
0,4: Send (X, al, 0) to calculate alx; 
0,5 to 0,6: No operations. The five multiplications are still 
• in progress. 
0,7: x 2 is now available in P 
Send (P, P, 0) to calculate x 4 ; 
M[l] := P to store x 2 ; 
0,8: a7x is now available in P 
Send (W, a8, P) to calculate a8x2 +a7x; 
0,9: aSx is now a~ailable in P 
Send (M[l], a6, P) to calculate a6x2+a5x; 
0,A: a3x is now available in P 
Send (M[l], a4, P) to calculate a4x2+a3x; 
0,B: alx is now available in P 
Send (M[l], a2, P) to calculate a2x2+alx; 
0,C-0,D: No operations. The five multiplicati~ns are still 
• in progress. 
0,E: x 4 is now available in P 
M[2] := P to store x 4 ; 
0,F: (a8x2+a7x) is now available in P 
M[3] := P; 
Exceeded microcode memory in subsystem 0. 
continue in subsystem 1. 
Execution will 
1,0: (a6x2+a5x) is now available in P 
Send (M[2], P, 0) to calculate a6x6+aSx5 ; 
1,1: (a4x2+a3x) is now available in P 
Send (M[l], P, 0, SP) to obtain a6x6+a5x5+a4x4+a3x3 
1,2: (a2x2+alx) is pow available in P 
Send (aO, 1, P, SP) to.calculate 
a6x6+a5x5+a4x4+a3x3+a2x2+a1x+a0; 
1,3: Send (M[l], M[2], 0, AP) to calculate x 6 ; 
r 
Note: If operands are available, initiate a new process 
1,4 to 1,9: No operations. Four multiplications are still 
99 
; 
\ 
.. 
• in progress. 
1,A: a6x6+a5x5+a4x4+a3x3+a2x2+a1x+a0 is now in p-l 
x 6 is now available in P 
Send {P, M[3], p-l, DN) into array to get final result 
Finished. 
The corresponding microcode is shown below: 
Sub- Sub- g 
system Addr Contents (hex) system Addr Contents (hex) 
0 0 0 7 8 8 0 8 0 0 0 7 1 8 6 6 0 6 o ~o 
0 1 0 5 5 4 0 0 0 0 0 8 1 6 5 4 6 0 0 0 
0 2 0 3 2 4 0 0 0 0 0 9 1 4 ·2 4 6 0 0 0 
0 3 0 1 2 4 0 0 0 0 0 A 1 2 2 4 6 0 0 0 
0 4 0 0 2 4 0 0 0 0 0 B 0 0 2 4 6 0 0 0 
0 5 0 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 
0 6 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 
0 E 2 0 0 0 0 6 0 0 1 5 0 0 0 0 0 0 0 0 
0 F 3 2 0 0 0 6 0 0 1 6 0 0 0 0 0 0 0 0 
1 0 0 1 3 6 0 0 0 0 1 7 0 0 0 0 0 0 0 0 
1 1 0 0 3 6 0 0 0 4 1 8 0 0 0 0 0 0 0 0 
·1 2 1 2 4 1 6 0 0 4 
,.,,) 1 9 3 0 0 0 0 0 0 0 
.~ 
1 3 0 0 2 3 0 0 4 0 1 A 0 0 6 2 7 0 2 2 
1 4 0 0 0 0 0 0 0 0 
6. 7 Multiple Overlapped Multiplication A *B*C*D 
The rather simple multiplication algorithm a*b*c*d to produce a 32-bit 
product demonstrates that four overlapped processes can be run simul£lneously 
if it is written as: 
(~ 
f := (ab)(cd) 
In the beginning, the operand pairs (a, b) and (c, d) are sent into the ar-
ray. The intermediate products x and y are available after seven clock cycle a-nd 
are fed back to produce f. The pseudo-code is listed below: 
0,0: Send(X, Y, 0) to calculate x=ab; 
100 
-, 
.,._-f 
\_ 
0,1: Send (X, Y, 0, AP) to calcu1ate.y=cd; 
Since the next slots are not used by this process, AP 
lets subsystem Oto allow the initiation 0£ another 
£our-operand multip1ication. I£ the other subsystems are 
inactive or give the same permission, the processed will 
be passed immediately into subsystem 1 so a new process 
can start in subsystem 0. Else, the current process 
continues in this subsystem. 
0,2 to 0,7: No operations. The two multiplications are still 
• in progress. 
0,8: xis now available in p-l and y is now available in P 
Send (P, p-l, 0, FI) to calcu1ate £=xy; 
0,9: Send (BI, FI, ~AP, DN) into array to send out the 
upper 16-bit h !£word to the outside. Finished. 
I 
1, 0 :1 
1, 
) 
No ~era· The multiplications are still in progress. 
Send 
Allow nitiation of an overlapped process. If not all 
subsystems allow it, execution does not skip to subsystem 
2 but continues with 1,2:. 
1,2 to 1,5: No operations. The two multiplications are sti11 
. in progress. 
1,6: xis now available in p-l and y is now •vailable in P 
Send {P, p-l, 0, FI) to calculate f=xy; 
1,7: Send (BI, FI, DN) into array to send out the upper; 
16-bit halfword to the outside. Finished. 
2,0: No operation. The multiplications are still in progress. 
2,1: Send (AP); Read note in 1,1: 
2,2 to 2,3: No operations. The two mu1tiplications are still 
. in progress. 
2,4: ;xis now available in p-l and y is now available in P 
Send (P, p-l, 0, FIL to calculate f=xy; 
2,5: Send (BI, FI, DN) into array to send out the upper; 
16-bit halfword to the outside. Finished. 
3,0 to 3,1: No operations. The two multiplications are still 
• in progress. 
101 
·.:...-
0 
(' 3,2: xis now avai1ab1e in p-l and y is now avai1ab1e in P 
Send (P, p-l, 0, FI) to ca1cu1ate f=xy; 
3,3: Send (BI, FI, DN) into array.to send out the upper; 
16-bit ha1fword to the outside. Finished. ··, 
The corresponding microcode is listed below: 
Sub-
system 
0 
0 
0 
0 
1 
1 
1 
1 
Addr 
0 
1 
8 
9 
0 
1 
6 
7 
Contents· (hex) 
0 0 8 9 0 8 0 0 
0 0 8 9 0 0 4 0 
0 0 6 7 0 0 0 2 
0 0 0 0 0 0 E 3 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 4 0 
0 0 6 7 0 0 0 2 
o·o o o o o 2 3 
The remaining codes are zero. 
Sub-
system 
2 
2 
2 
2 
3 
3 
3 
3 
Addr 
0 
1 
4 
5 
0 
1 
4 
5 
Contents (hex) 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 4 0 
0 0 6 7 0 0 0 2 
0 0 0 0 0 0 2 3 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 6 7 0 0 0 2 
0 0 0 0 0 0 2 3 
The pipeline scheduling table shown in Fig 6-1 demonstrates how over-
lapped multiplications are initiated and how they proceed through the arith-
/ 
metic unit. This table show/ how eight groups of four operands each are 
. 
processed. The groups are referenced with numbers 1 through 8 and the 
operands pairs are referenced with a and b. In the beginning, four processes are 
initiated at every other clock cycle. After the fourth process is initiated, the in-
termediate products of the first group become already available and are mul-
tiplied together. After all four intermediate products have been fed back, four 
new processes will be initiated. The arithmetic unit is busy 75% of the time and 
the throughput for four-operand multiplications is one every four clock cycles. 
6.8 Concluding Relllarks about Applications 
The five examples of previous sections demonstrate most features of the 
LICAM processor. All microprograms listed in this chapter have been im-
102 
'-
LICAM 
Processor Application: Quadruple overlapped A*B*C*D 
~' 
. 
-
---
--
..... 
-i (D -· Stage 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Input Interface 1a 1a 1b 2a 2b 3a 3b 4a 4b Sa Sb 6a 6b 7a 7b Ba Sb ~ 
I Latches X, Y 1a 1b 2a 2b 3a 3b 4a 4b Sa Sb 6a 6b 7a 7b Ba Sb ~ 
•• 
~ 
... AU: Stage 0 1a 1b 2a 2b 3a 3b 4a 4b 1 2 3 4 5a Sb 6a 6b 7a 7b Ba Sb 5 6 7 
..... 
'"d AU: Stage 1 -.,___, 1a 1b 2a 2b 3a 3b 4a 4b 1 2 3 4 5a 5b 6a 6b 7a 7b Ba Sb s 6 \ (T) 
...... s· AU: Stage 2 1a 1b 2a 2b 3a 3b 4a 4b 1 2 3 4 5a 5b 6a 6b 7a 7b Ba Sb 5 6 (T) AU: Stage 3 1a 1b 2a 2b 3a 3b 4a 4b 1 2 3 4 Sa 5b 6a 6b 7a 7b Ba Sb 5 00 . 
--~ AU: Stage 4 1a 1b 2a 2b 3a ·' 3b 4a 4b 1 2· 3 4 Sa Sb 6a 6b 7a 7b Ba Sb 5 -. 
,,, 
,. (l) §"' AU: Stage S 1a 1b 2a _}b 3a 3b 4a 4b 1 2 3 4 Sa Sb 6a 6b 7a 7b Ba 8b 
. ...... 
..... Stg6: Prod. P 1a 1b 2a 2b 3a 3b 4a 4b 1 2 3 4 - Sb 6a 6b 7a 7b Ba :>a ~ 
,_. (Jq 
0 ~ CA., ~ 
Delayed P"-1 1a 2a 3a 4a Hi Hi Hi Hi 5a 6a 7a 
CONTROLLER 
C"" 
·-
...... (T) Subsystem 0 1 1 2 ,2 3 3 4 4 4 4 4 4 4 4 4 s s 6 6 7 7 8 8 8 8 8 8 8 8 8 ~ ~ 
"1 Subsyst~m 1 1 1 2 2 3 3 3 3 3 3 3 ~ ' 5 5 6 6 7 7 7 7 7 7 7 
,;--
~ 
I Subsystem 2 1 1 2 2 2 2 2 5 5 6 6 6 6 6 0 
'"d 
. 
Subsystem 3 1 1 1 5 5 5 (T) 
"1 
-~ 
COMMENTS 
..... , ~ 
' p... --
a= 1 .. 8 ref er to 8 sets of 4 operands ~ 
...... 
c+ 
..... a and b found after digits refer to operand pairs a~ and cd 1-t:S 
...... 
respectively 
..... 
(") 
~ 
..... 
0 
boxed numbers: Lower half word of result is sent out 
boxed "HI": Upper half word of result is sent out 
~ 
·~ 
' 
plemented and simulated using the simulator of Chapter 3 to produce the the 
correct results. 
These few examples are not enough to demonstrate the entire spectrum of 
capabilities of the processor of Chapter 5. For example, the upper 16 bits of the 
product, accessed with the HI-signal, can be used to provide 32-bit and higher 
range arithmetics. However, the examples of this chapter show that various im-
portant problems using addit~ons, subtractions and multiplications can be 
programmed to run at a very high throughput on the proposed LICAM proces-
sor. 
104 
Chapter 7 
Conclusion and Future Work 
This .project involved designing and verifying a LICAM processor. 
. 
Automated tools to express and simulate large parallel digital designs on a PC 
\ 
were developed. The design was shown to be flexible, running a variety of al-
gorithms, and yet simple and modular which is suitable for VLS.I implemen-
tation. 
7.1 Design SUD1D1ary 
The simulator software developed here consists of a preprocessor and a 
compiler which support the proposed hardware description language SLSL and 
a document generator. Besides the simulator itself, the combination of 
preprocessor and compiler permits compact a imple descriptions of large 
digital systems. The simulator is designed for oth interactive and batch use 
and exhibits high performance even for large hardware descriptions. Finally, 
I 
the docum,ent generator may be __ used to customize output listings and numeric 
representations in different bases. 
The proposed LICAM processor provides a functional environment for a 
modified latched iterative cellular array multiplier in order to compute small 
microcoded algorithms which consist of additions, subtractions and multiplica-
.,, 
tions. The auxiliary components in the processor include a scratchpad memory, 
look-up tables and a controller. The controller was designed to exploit the full 
capacity of the cellular array and achieves very high throughput by permitting 
.. 
multipl~ simultaneous operations. A variety of microcoded algorithms for this 
processors are designed and tested successfully. 
105 
. / 
,I 
I 
I. 
7.2 Future Work 
I 
,·· 
; 
The works reported in this thesis is open for further improvements and · 
e 
expansion. The following sections list some likely candidates for further enrich-
ing the hardware description language SLSL, the simulation package and the 
LICAM processor. 
7.2.1 Future Work on SLSL and Design Automation Software 
SLSL was designed mainly for a dedicated simulator to prove the 
functionality of the LI CAM processor. SLSL is implemented with TURBO-C on 
PC compatible machines and has proven to be capable of simulating large 
hardware descriptions at high speed. However, the present limitations of SLSL 
0 . 
can be alleviated by future improvements. A few suggestions for improvements 
are described below. 
Currently, SLSL accepts compact descriptions of hierarchically organized 
-and structured hardware. The preprocessor convertes the structured design 
irito a flat, non-hierarchical layout, and the compiler generates the necessary 
object files for the simulator. A significant drawback of- this is that the 
hardware description consists of mutually separated preprocessor insertions, 
which are in charge of hierarchical structures, repetitive logic, and compiler 
statements describing discrete hardware elements. Too frequent use of 
preprocessor inserts leads to a loss of readability. Improvements can be made to 
reduce this syntax gap as shown below: 
Current syntax: 
&[ Macro BCD_Adder( %a, %b, %c, %sum, %carry) ] 
Begin 
• 
• 
End· 
' &[ End 1; 
106. 
/ 
i 
----- - .------- ..... 
/ 
·.~ 
Proposed improvement: 
Macro BCD_Adder(a, b, c, sum, carry); 
Begin 
• 
• 
End· 
' 
, 
A timing verifier to support SLSL should be written. This program would 
measure the signal propagation delays through combinational logic and return 
estimated worst-case delays. Gate delays, fan-ins and other physically-related 
parameters should be obtained from technology files. Some problems may occur 
with high-level reserved functions such as adders, where the architecture is not 
fully defined. An additional file may be needed to supply the missing architec-
tures. Besides the timing verifier, the simulator should be expanded to provide 
,_ .,, 
an option to incorporate propagation delays. 
Presently, the simulator supports two discrete data levels O and 1. 
However, many digital systems incorporate tri-state buffers and bi-directional 
busses. Advanced data levels such as high-impedance, unknown and ambiguous 
should be added. The introduction of these levels may reduce simulation speed 
significantly, but provides a better understanding about signal propagation. 
Given the new data levels, new constructs and reserved fun.~tions such as tri-
state buffers should become available to support high-impedance. 
. 
Currently, SLSL allows both low-level structural descriptions of discrete 
logic elements and high-level functional descriptions of larger blocks such as ad-
ders and memory arrays. Yet, the middle levels including the capability to 
separate the controller(s) and use a different description methodology are not 
supported at all. Some improvements are suggested to support one alternative 
approach to describe the controller, where the added language constructs do not 
107 
,., 
deviate too far from direct hardware descriptions. 
Further design automation software such as silicon compilers, placement 
and router software and other programs which support a specific VLSI technol-
ogy can be incorporated in order to obtain the fmal chip from a SLSL descrip-
tion. 
\, 
,. 
The approach of the simulator to imitate hardware parallelism can be im-
proved with compb.ter systems which use parallel processing. Multiple execu-
tion machines can execute statements simultaneously. Ideally, a computer ar-
chitecture can be designed which is specifically made for running simulations. 
An already existing system is the MARS system [53] which is a slide-in circuit 
board for Sun Microsystems. 
In conclusion, it may be noted that the Symbolic Logic Simulator (SLS) 
and its language (SLSL) can be expanded widely. However, ~~- a certain point, 
adding .more · features, such as taking propagation delays into account, may 
result in loss of performance. 
7.2.2 Improvements to the LICAM Processor 
The proposed LICAM processor is designed to demonstrate a simple ar-
chitectural structure which can be expanded in order to make it mature for com-
mercial use. Some suggestions for improvements and expansions are listed 
below. 
The current design employs one dual-access scratchpad memory and one 
look-up table which share only two microcode-supplied addresses. The examples 
shown in chapter 6 still show bottleneck situations ·where .,three different ad-
dresses are required, but two clock cycles are necessary to complete all data 
108 
.... 
' } 
/ 
·; 
transfers before the next arithmetic operation can be scheduled. A more flexible 
addressing scheme is necessary to achieve better throughput. 
Modifications to the LICAM processor can be made to incorporate divi-
,r-/' 
sions and square rooting. One method would be to add an independent and 
throughput-optimized division/square rooting array or to replace the present cel-
lular array by a generalized array. Generalized array support all fundamental 
arithmetic operations including square rooting. With this modification, the 
processor will be able to handle virtually any fundamental mathematical opera-
tion. 
A modified LICAM processor which supports floating-point rather than in-
teger arithmetics becomes very suitable for running Taylor series to obtain ap-
proximate solutions of elementary functions. The look-up table can be expanded 
to hold a series of Taylor series coefficients so no time is wasted to transfer coef-
ficients and the processor may directly compute the desired trigonometric func-
tion. 
Prospective floating-point processors must take both exponents and man-
tissas into account. Multiplications are handled very easily, but a significant 
amount of hardware is required to permit adding a sequence of one incoming 
floating-point summand per clock cycle and come up with the final sum very 
quickly. The challenge lies in aligning the mantissas. 
The input and output interfaces must be designed to allow smooth inter-
action with external processors and peripheral hardware. Possible communica-
tion methods range from clocked data transfers. and classic handshaking 
methods up to high-speed packet transmissions. In packet transmissions, host 
109 
'i 
/ 
J 
processors would place the requested operation, the operands, and the return 
address into a packet and expect the LICAM processor to return the results in a 
different packet. 
7 .2.3 Interfacing the LICAM Processor to the Outside 
The following paragraphs suggest some methods to integrate the LICAM 
processors into existing computing equipment. This processor is intended to 
serve the entire domain of computers which range from PCs to mainframes. 
The most primitive application would be to introduce an interface module 
which connects one LICAM processor to one host processor. The interface 
module reserves memory or I/0 port locations to specify the type of operations, 
operands and read the returned results. After a set of operands have been sent 
out, the host processor can work on other tasks until the results become avail-
able. 
L 
Typically, a single host-processor does not fully exploit the)LICAM ·proces-
1 
sor since mathematical operations are not requested all time. In order to allow 
multiple host processors to share the same 'co-processor', a more adva~;d,ihter-
facing unit becomes necessary. The main problem concerns designing optimum 
interface logic and protocols to permit maximum performance. One example, 
which was mentioned in the previous section, recommends fast packet-switched 
communications. 
rhe assumption for the previous suggestions was that only sing.le com-
putations are performed. However, precious host-processor time is wasted if 
whole arrays of operands are _ent to the.LICAM processor. In this case, a dif-
ferent interface unit should e designed .. This unit occupies one memory bank, 
110 
{ 
\ 
' \ 
.· 
\ 
i.e. 64K bytes, which is used to deposit operands. The available DMA controller 
,, can take care of the :giassiv:e data transfers. After all operands are downloaded /' ~ 
,, I f' • 
to the reserved ":m;mory space} the interface will be in charge to send operands 
' ·, into one or multiple LICAM processors running iri parallel and retrieve the 
results. This architecture would shift the burden of scheduling a task from the 
host processor to thE! inte_!face unit to improve the system performance sig-
nificantly. 
l . 
•. 
111 
References 
; 
[1] Hayes, John P., Computer Architecture and Organization, McGraw-Hill, 
New York, 1978. 
[2] Suzuki, Norihisa, "Concurrent Prolog as an Efficient VLSI Design 
Language", Computer, pp. 33-39, Vol. 18, No. 2, February 1985. 
[3] Maruyama, Fumihiro, "Hardware Verification", Computer, pp.· 
22-32, Vol. 18, No. 2, February 1985. 
[4] Sewart, J. H., "LOGAL: A CHDL for Logic Design and Synthesis", 
Computer, pp. 18-26, Vol. 10, No. 6, June 1977. 
[5] Su, Stephen Y. H., "Hardware Description Language Applications -- An 
Introduction and Prognosis", Computer, pp. 10-13, Vol. 10, No. 
6, June 1977. 
[6] Su, Stephen Y. H., "A Survey of Computer Hardware Description Lan-
guages in the U.S.A.", Computer, pp. 45-51, Vol. 7, No. 
12, December 1974. 
[7] Bell, C. Gordon; Newell, Allen, Computer Structures: Readings and 
Examples, McGraw-Hill, New York, 1981. 
[8] Siewiorek, Dan, "Introducing ISP", Computer, pp. 39-44, Vol. 7, No. 
12, December 1974. 
[9] Hill, Frederick J.; Peterson, Gerald R., Digital Systems: Hardware Or-
ganization and Design, John Wiley & Sons, New York, 1978. 
[10] Chu, Yaohan, "Introducing CDL", Computer, pp. 31-3~, Vol. 7, No. 
12, December 1974. 
[11] Piloty, Robert, "Hardware Description Languages in the Federal Repblic 
of Germany", Computer, pp. 57-59, Vol. 7, No. 12, December 1974. 
[12] Dietmeyer, D. L., "Introducing DDL", Computer, pp. 34-38, Vol. 7, No. 
12, December 197 4. 
[13] Microsim Corporation, PSPICE Manual, 23175 La Cadena Drive, 
Laguna Hills, CA, 1986. 
[14] Vladimirescu, A., Department of Electrical Engineering and Computer 
Science, University of California, SPICE Version 2G User's Guide, 
·~"Berkeley, CA, 1986. 
[15] German, Steven M; Liebherr, Karl J., "Zeus: A Language for Expressing 
Algorithms in Hardware", Computer, pp. 55-65, Vol. 18, No. 
2, February 1985. 
[16] Dasgupta, Subrata, "Hardware De~cription Languages in Microprogram-
ming Systems", Computer, pp. 67-76, Vol. 18, No. 2, February 1985. 
112 
/ 
[17] 
[18] 
[19] 
[20] 
[21] 
[22] 
[23] 
Lipovski, G. J., "Hardware ;Description Languages: Voices from the 
Tower of Babel", Computer, pp. 14-17, Vol. 10, No. 6, June 1977. 
Mano, M. Morris, Digi,tal Logic and Computer Design, Prentice-Hall, 
Englewood Cliffs, NJ, 1979. 
Hill, Frederick, J., "lntrod11cing AHPL", Computer, pp. 27-30, Vol. 7, No. 
12, December 1974. 
Duley, James R., "A Digital System Design Language (DDL)", IEEE 
Transactions on Computers, pp. 850-861, Vol. C-17, No. 
9, September 1968. 
) 
Piloty, Robert, et al, CONLAN Report, Lecture Notes in Computer r;i .. Science, Springer Verlag,_Berlin (West), 1983. ..--~ 
Piloty, Robert, "The CONLAN Project: Concepts, Implementations and 
Applications", Computer, pp. 81-92, Vol. 18, No. 2, February 1985. 
Shadad, Moe, et al, ''VSHIC Hardware Description Language", 
Computer, pp. 96-103, Vol. 18, No. 2, February 1985. 
[24] Proceedings of the 23rd ACM I IEEE Design Automation Conference, 
1986, Paper 17.1 
[25] zur Bonsen, Georg A., Lehigh University, Symbolic Logic Simulator, an 
Introduction, Bethlehem, PA, 1988. 
[26] Hwang, Kai, Computer Architecture and Parallel Processing, McGraw-
Hill, New York, 1984. 
[27] Hellerman, Herbert, Digi,tal Computer System Principles, McGraw-Hill, 
New York, 1967. 
[28] Hwang, Kai, Computer Arithmetic, Principles, Architecture, and Design, 
John Wiley & Sons, New York, 1979. 
[29] Sloan, Martha E., Computer Hardware and Organization, SRA (Science 
Research Associates), Chicago, 1976. 
[30] Ahmad, S. Imtiaz, Fung, Kwok T., Introduction to Computer Design and 
Implementation, Computer Science Press of University of Windsor, 
Windsor, U.K, 1981. 
[31] Kline, Raymond M., Structured Digi,tal Design including MS1 I LSI Com-
ponents and Microprocessors, Prentice-Hall, Englewood Cliffs, NJ, 1983. 
[32] Abd-Alla, Abd-Elfattah, Principles of Computer Design, Prentice Hall, 
Englewood Cliffs, NJ, 1976. 
· [33] Booth, Andrew D., "A Signed Binary Multiplication Technique", Quarterly Journal of Applied Mechanics and Mathematics, pp. 
235-240, Vol. IV, No. Part 2, 1951~ 
113 
[34] Wallace, C. S., "A Suggestion for a Fast Multiplier", IEEE Transactions 
on Computers, pp. 14-17, Vol. EC-13, No. 2, February 1964.· 
[35] Hallin, Thomas G., Flynn, Michael 
Functions", IEEE Transactions on 
C-21, No. 8, August 1972. 
J., "Pipelining of Arithmetic 
Computers, pp. 880-886, Vol. 
[36] Habibi, A, Wintz, P. A., "Fast Multipliers'', IEEE Transactions on 
Computers, pp. 153-157, Vol. C-19, No. 2, February 1970. 
[37] Pezaris, Stylianos D., "A 40ns 17-Bit by 17-Bit Array Multiplier", IEEE 
Transactions on Computers, pp. 442-448, Vol. C-20, No. 4, April 1971. 
[38] Baugh, Charles R., Wooley, Bruce A., "A Tow's Complement Parallel Ar-
ray Multiplication Algorithm", IEEE Transactions on Computers, pp. 
1045-1047, Vol. C-22, No. 12, December 1973. 
[39] Agrawal, Dharma P., "High-SPeed Arithmetic Arrays", IEEE Trans-
actions on Computers, pp. 21'5-224, Vol. C-28, No. 3, March 1979. 
[ 40] Hwang, Kai, "Global and Modular Two's Complement Cellular Array 
Multipliers", IEEE Transactions on Computers, pp. 300-306, Vol. 
C-28, No. 4, April 1979. 
[4~] Nakamura, Shinji, "Algorithms for Iterative Array Multiplication", IEEE 
Transactions on Computers, pp. 713-719, Vol. C-35, No. 8, August 19.86. 
[ 42] Davio, M., et al, Digital Systems with Algorithm Implementation, John 
Wiley & Sons, Chichester, 1983. 
[43] Denyer, Peter; Renshaw, Daniel, VLSI Signal Processing - A Bit-serial 
Approach, Addison-Wesley, Reading, MA, 1985. 
[ 44] Guild, H. H., "Fully Iterative Fast Array for Binary Multiplication and 
Addition", Electronics Letters, pp. 263, Vol. 5, No. 122, June 1969. 
[45] Takagi, Naofumi; Yasuura, Hiroto, "High-Sped VLSI Multiplicatio11 Al-
gorithm with a Redundant Binary Addition Tree", IEEE Transactions 011, 
Computers, pp. 789-796, Vol. C-34, No. 9, September 1985. 
[46] Stenzel, William J., et al, "A Compact High-_Speed Parallel Multiplication 
Scheme", IEEE Transactions on Computers, pp. 948-957, Vol. C-26, No. 
10, October 1977. 
[ 4 7] Deverell, John, "Pipeline Iterative Arithmetic Arrays", IEEE Trans-
actions on Computers, pp. 317-322, Vol. C-24, No. 3, March 1975. 
. [ 48] Kamai, A. K., et al, "A Generalized Pipeline Array", IEEE Transactions 
on Computers, pp. 533-536, Vol. C-23, No. 5, May 1974. 
[49] Chen, Tien Chi, "A Bin?Ty Multiplication Scheme Based·on Squaring'', 
IEEE Transactions on Computers, pp. 678-680, Vol. C-20, No. 
6, June 1971. 
114 
------- ··------ - ~-
~~~---·· 
--- ~------~--- --- -
., 
[50] Johnson, Everett L., "A Digital Quarter Square Multiplier", IEEE Trans-
actions on Computers, pp. 258-261, Vol. C-29, No. 3, March 1980. 
[51] Hall, Ernest L.,. et al, "Generation of Products and Quotients Using Ap-
proximate Binary Logarithms for Digital Filtering Applications", IEEE 
Transactions on Computers, pp. 97-105, Vol. C-19, Np. 2, February 1970. 
[52] zur Bonsen, Georg A., Lehigh University, Optimum Use of Latched Cel-
lular Array Multipliers to Compute Polynomials, Bethlehem, PA, 1987, 
Term Paper for ECE 415 - Numeric Processing 
(53] Agrawal, Prathima et al, "MARS: A Multiprocessor-Based Programmable 
Accelerator", IEEE Design and Test of Computers, pp. 28-36, Vol. 4, No. 5, October 1987. · 
[54] Alfred V. Aho et al., Compilers - Principles, Techniques and Tools, Ad-
dison Wesley, Reading, Massachusetts, 1986. 
. 
' 
1, 
115 
• 
( 
Appendix A 
Summary of Symbolic Logic Simulation 
Language SLSL 
A.I Declaration of Identifiers 
Type of Identifiers · Associated Information 
WIRE (Si \gle strands, busses) Bit width, initial value 
Example: WIRE, A, Data_bus[16], Addr_bus[24]=1; 
REGISTER (Po\. edge triggered) Bit width, initial value 
Example: REGISTER Stack_Ptr[16]=0FFFFH; 
MEMORY (Po\. edge triggered) Addr & data bit widths 
Initial contents 
Example: MEMORY Ram64KByte(16)[8]; 
ROM (Re\d-only memory) Addr & data bit widths 
Initial contents 
Example: ROM Prime_Num(3)[16] = {2,3,5, 7,11,13,17}; 
Memory elements are referenced by their identifie~s and expressions to 
specify their addresses: 
MEMl(Addr) 
Expressions for the clock inputs must be supplied when data is to be writ-
ten to registers and memory arrays. 
REG 1, Clkexpr 
MEMl(Addr),Clkexpr 
A.2 StateIDents 
• := expression; 
• 
:= expression; 
SLSL 1Statements assign combinations of source identifiers to one or more 
destination ~Iltifier. The compiler assures that the bit widths of the destina-
tion identifiers are consistent with the bit widths of the logic combinations. 
A.3 Constants 
dst := nA:w; 
where dst = destination identifier 
n = value 
116 
• 
. ·., ~ ... , 
A = ·radix· (binary, base ... 4, octa1, decimal, hex) 
w = bit width (defaults to 1 if _not specified) 
Example: 10100B = 110N = 240 = 20D = 20 = 14H 
/ A.4 Operat· rs 
' I 
I 
[ Binary Operato s: 
\ 
I 
dst :=a+ b; 
dst :=a ... b; 
dst := a ! b; 
dst :=a= b; 
Unary Operators: 
Symbol 
-
... 
< 
> 
<< 
>> 
AND gate 
OR gate 
XOR gate 
XNORgate 
Description 
l's complement 
2's complement 
Shift left. Most significant bit will be discarded 
Shift right. Most significant bit becomes zero 
Shift left without loosing most significant bit. 
Word size increases by one bitf 
Shift right. Word size decreases by 1 bit. 
C 
List of Operator Precedences: 
Highest: 
Lowest: 
1:. -, -, <, >, <<, >> 
2: * 
3: !, = 
4: + 
A.5 Splicing and Concatenation 
(un \ry operators) 
(AN\ gate) 
(XO\ and XNOR gates) 
(OR\gate) 
For an n ... bit word, bit n-1 is the most significant bit and bit O is the least 
significant bit. Assume the variables src, srcl and src2 are defined as eight bit 
busses. 
117 
0 . 
' • 
Statement Width Description 
dst := src[ 5]; 
dst := src[3 .. 0]; 
dst := src[3,5, 7]; f 
dst := src[6 .. 0, 7]; 
dst := (srcl + src2)[6 .. 0, 7]; 
dst := { srcl, src2, src3 } ; 
dst := { src1[7], src2[6 .. 0] }; 
" 
1 
4 
3 
8 
8 
24 
8 
A.6 Structural Reserved Functions 
Array of two-input gates with one comn:ion input: 
dst := ANDGATE(common, src) 
dst := ORGATE(common, src) 
dst := XORGATE(common, src) 
dst := XNORGATE(common, src) 
Use bit 5 only 
Lowest fou1"' bits 
Resplice 3 bits 
Boolean rotate.left 
Combination with OR gate 
Concatenation 
Combination of the above 
Single horizontal gates connecting all strands of specified busses: 
dst := ANDSUM(erc) 
dst := ORSUM(src) 
Miscellaneous Functions: 
dst := EVENPARITY(src} 
dst := ODDPARITY(src) 
dst := DECODE(src) 
dst := MUX( select, srcO, srcl ... ) 
A. 7 High-Level Functional Reserved Functions 
Carry-propagation and Carry-save adders: 
CPA( a, b, carry_in, sum, carry_out) 
CSA( a, b, c, partial_sum, partial_carry) 
I 
Generalized Adders: 
Generalized CPAs and CSAs accept both positive and negatively weighted 
inputs. The input type_sel selects between type 0/3 and type 1/2 adders . 
. Detailed information is available in [28].· · 
CPAC( a, b, carry_in, type~sel, sum, carry_out) 
CSAC( a, b, c, type_sel, partial_sum, partial_carry) 
~ ..• 
.. 
'\ 
\ 
u 
. 
: 
i- ' 
Miscellaneous Functional Blocks: 
dst := WEIGHT(src) 
dst := INC(src) · -
.Ham.ming Weight, counts active bits 
Incrementer 
dst := DEC(src) Decrementer 
A.8 Additional Support Functions 
Bus Function emulating bi-directional busses: 
:· dst := BUSFUNC( oel, srcl, oe2 , src2, ... ) 
Binary Communication Channel with Random Noise: 
dst := CHANNEL( src, tOl, tlO) 
src = \ata being transmitted 
tOl = transition probability index from Oto 1 
tlO = transition probability index from 1 to 0 
File Access Facility: 
READ( file name, clock, dst, dst, ... ) 
WRITE(file name, clock, src, src, ... ) 
A.9 Coinpiler Parsing Syntax 
The modified Backus-Naur explained below is used to describe the parsing • 
syntax of SLSL [54]: 
Program 
··-.. 
I 
{ } 
[ ] 
( \ 
Var Dec1 
-
Mem Dec 
-
.. -
.. -
. ·-. 
.. -
.. -
Assignment symbol: 'is defined as ... ' 
Parse left OR right item 
) 
Enclosed items may repeat zero or more times 
Enclosed items are optional 
Used to enclose multiple items ·as one item 
"SYSTEM" Id Unknown";" Var Deel Block 
- -{ " ; " Block } " . " 
(Note: Identifier is name of system) 
{ Mem Deel I Reg Deel I Wire Deel " . " } I 
-
- -
[ ''MEMORY'' I "ROM" ] Mem Element { " " ' I 
-Mem Element 
-
} 
119 
; 
) 
I 
y 
Mem E1ement 
-
Reg Dec1 
-
Wire Deel 
-
Reg E1ement 
-
Block 
Assignment 
C1ock Spec 
-
Bin Mem 
-
Bin Expr 
-
Merged Expr 
-
Simp1e Expr 
-
Summand 
Product 
.. -
.. - :Cd Unknown " ( " Number " ) " [ " [ " Number " ] "· ] 
-[ "=" " { " Number { " , " Number } " } " 
.. -
.. - "REGISTER" Reg E1ement { 
-
" " 
' 
Reg Element} 
-
: : = "WIRE" Reg_ E1ement { ", " Reg_ Element } 
.. -
.. - :Cd Unknown [ " [" Number "] '' ] [ "=" Number ] 
-
::= "BEGIN" { Read Op I Write pP I Cpa Op I 
- - -Csa Op I Cpac Op I Csac pP I Assignment 
- - -
" ; " } "END" 
: : = Id Wire I ( Bin Mem C1ock Spec ) I 
- - -( Id Register Clock Spec ) ":=" Bin Expr 
-
- -
: : = " , " Bin Expr 
-
: : = Id Memory " (" Bin Expr ") " 
- -
: : = Simple Expr I Merged Expr 
- -
.. -
.. -
.. -
.. -
" {" Simple Expr { ", " Simple Expr } 
-
-
Swmnand { "+"Summand} 
: : = Product { "I" I "=" Product } 
: := · Signed Factor { "*" Signed Factor } · 
-
-
"}" 
Signed Factor::= 
-
{ "-" I 
Factor 
"N" I "<" I ">" I "<<" . I ">>'' } 
Factor ::= Num Expr I Id Wire I Id Register 
- - -1· Bin Mem I " (" Bin Expr ") " I Res Funes 
- - -I Merged Expr [Subscripts] 
-
Num Expr : : = Number [ " : " Number ] 
-
Subscripts : : = " [" Subrange { ", " Subrange } "] " 
( 
Subrange : : = Number [ " . . " Number ] 
-
::= Gating Funes I Basics I Mux I Busfunc I 
-
Res Funes 
Channel I Weight 
Gating Funes::::; "ANDGA'l'E" I "ORGATE" l."XORGATE" I "XN'ORGATE" . -
"(" Bin Expr "," Bin Expr ")" 
- -
: 120 
.. 
/ 
,· 
'. ( 
' . 
. 
. ,ii 
Basics 
\ 
Mux 
Busfunc 
Channe1 
Weight 
Read Op 
-
Write Op 
-
Cpa Op 
-
Csa Op 
-
Adder Argsl 
-
Cpac Op 
-
Csac Op 
-
Adder Args2 
-
: : = "ANDSOM" I "ORSON" I "ODDPAR.ITY" 
I "EVENPAR:tTY" I "DECODE" I ":INC" · 1 · "DE.C" 
.. -
.. -
" ( " Bin Expr " ) " 
-
"MOX" " (" Bin Expr 
-
{ "' " Bin Expr} ")" 
-
: : = "BOSFtJNC" " ( " Bin Expr { " , " Bin Expr } " ) " 
- -
.. -
.. - "CHANNEL""(" Bin Expr "," Bin Expr 
- -Bin Expr ")" 
-
: : = "WEIGHT" " (" Bin Expr ")" 
•-"') -
" " I 
.. -
.. - "READ'r " (" Quote Clock Spec " " I Id Wire . -'· 
-{ ",_,~/,.Id Wire } ") " 
-
.-
I 
I 
.. -
-• • ?'WRITE" " ( " Quote C1ock _ Spec 
,; " " I Bin Expr 
- . { " , " Bin Expr } " ) " 
-
: : = "CPA" Add.er Argsl 
-
: : = "CSA" Adder Args 1 
-
.. -
.. - "(" Bin expr "," Bin expr 
- -Id Wire"," :td Wire")" 
- -
: : = "CPAC". Adder Args2 
-
: : = "CSAC" Adder Args2 
-
" " I Bin expr 
-
::="("Bin expr "," bin expr "," bin expr 
- - -bin expr "," :td Wire"," Id Wire")" 
- - -
" " I 
" " I 
Number ::= Hexdigits { Hexdigits} [Radix] 
Id Unknown ::= Identifier 
- (Note: Must be undefined at that point) 
Id Wire ::= Identifier 
- (Note: Must be previous1y dec1ared as WIRE) 
:Cd_Register .. - Identifier • • -
(Note: Must be dec1ared as REGISTER) 
-r -
:td_Memory 
-Identifier . ·- 'i> . 
t· (Note: Must be dec1ared as MEMORY or ROM) 
Identifier ::= Letters { J;.etters I Digits} 
' 
121 
. '· 
Quote 
Chars 
Letters 
Digits 
Bexdigits 
Radix 
.d: 
. ·-. " " " I Chars {Chars} """ 
::= A11 visible ASCII characters except 
.. - ''A" I "B" I 
. ·- "0" . I "1" I 
: := "0" I "1" I 
.. - "B" .. I 
{ 
l 
' •, 
.I 
i 
/ 
.. ,• 
"N" I 
/ 
• • • I "Z" I "$" I " " 
-
••• I "9" 
• • • I "9" I "A" I "B" I 
"0" I "D" I "B" 
1D·122 
• 
n I H 
) 
• • • I "F" 
( 
/· ( \ 
i 
I 
\ 
AppendixB 
Summary of Preprocessor Commands 
B.1 Preprocessor Control Fields 
\ 
&[ ... ] Preprocessor control statements 
%[ ... ] Enclosed expression is calculated and inserted 
into the flat SLSL file 
%Identifier Reference to MACRO, local identifier, 
transfer identifier 
?, @, # ·--- Values of "Query", "Atsign" and "Sharp" are 
inserted into the flat SLSL file 
B.2 Identifiers and Operators 
The only identifier types used by the preprocessor are unsigned integers, 
macro block names and local and transfer identifiers. The integer and macro 
0. 
block ·identifiers are absolutely independent from any· generic SLSL statement 
or identifier. ·For example, "Index" declared as an 8-bit.r~gister in SLSL and as 
an integer in the pre:erocessor_ code, are treated separately. No separate decla-
ration block for identifiers is required. Integer variables are · automatically 
defined at the first occurrence at the left side of an assignment. 
Following operators are available for the unsigned integers: 
Inversions: 
(Top Precedence) 
Dot Operations: 
(2nd Level) 
Intermediate Ops: 
(3rd Level)-
Line Operations: 
- 2's complementation (negation) 
- l's complementation (bit inversion) 
* 
I 
& 
' 
• 
<< 
>> 
% 
+ 
• 
Multiplication 
Division 
Logical AND 
Exclusive OR 
Binary shift left 
Binary shift right 
Modulo 
Addition 
123 
\ 
1 
~';.:'· 
I 
(4th Level) 
-
I 
Relational Ops: < 
(Least Precedence) <= 
(Returns -
-
1 if true, >= 
0 if false) > 
<> 
Subtraction 
Inclusive OR 
Smaller than 
Smaller than or equal to 
Equal to 
Greater than or equal to 
Greater than 
Not equal to 
') Parentheses can be used to circumvent precedence rules. 
B.3 Preprocessor Statem.~nts 
Assignments: 
IF clause: 
IF-ELSE clause: 
WHILE-loop: 
REPEAT-loop: 
FOR-loop: 
FOR-STEP loop: 
HALT-statement: 
MACRO begin: 
MACRO end: 
I 
&[ dest_integer := expr ] 
&[ IF expr THEN ... END ] 
&[ IF expr THEN ... ELSE ... END ] 
&[ WHILE expr DO ... END ] 
&[ REPEAT ... UNTIL expr ] 
&[ FOR assignment TO expr DO ... END] 
&[ FOR assignment TO expr STEP expr DO ... END ] 
&[HALT] 
&[ MACRO Macro_name( %varl, %var2, ... ) ] 
&[ ENDM] 
The dotted fields ref er to SLSL statements or temporarily closing the 
preprocessor control field to incorporate SL_SL code. 
A note to Macro Blocks: 
Declared parameters can be referenced throughout the macro blocks by 
adding a "%" at the beginning. Undeclared variables with a preceding"%",, are 
considered as local variables. 
B.4 Preprocessor Parsing Syntax ·· 
.,,-
The following parsing syntax applies to preprocessor control fields only. 
Note that the regular SLSL code is considered as a/ statement separator and is 
.. y·' • 
treated similat· o a";". 
124 
\ 
~ 
,,,_, -. 
• !".'-
-/ ~ 
0 
I 
The parser differentiates among four different preprocessor control blocks: 
1. Preprocessor statements, enclosed by &[ ... ] · 
2. Numeric expression for SLSL code, enclosed by%[ ... ] 
3. Macro names and local variables, starting with a %-sign 
4. Alternative Numeric expression with"?","#" and"@". 
The parsing syntax contains some references to the table shown above. 
These references, described as a numbers from 1 to 4 and enclosed in angular 
brackets, indicate the type of control block the parsed statement must be lo-
cated. 
The modified Backus-Naur notation described in section A.9 on page 119 
is used to describe the parsing syntax of the preprocessor language: 
Sourcecode ::= { Macro I Statement } EOF 
Macro 
Parameter 
Statement 
Model 
Mode2 
Mode3 
Mode4 
Condition 
For Loop 
-
.. -
-• • "MACRO"<l> Id Unknown"(" Parameter { "," 
-Parameter")" Statement "ENDM" Separator 
: : = "%" Id Unknown . 
-These parameters can be referenced inside the 
b1ock with Single percent symbo1s. 
,--~- C,..:i 
::= { Model I Mode2 I Mode3 I Mode4 I Separator} 
For 1oop I Whi1e 1oop I 
- -
.. - Condition I • • -
Repeat 1oop 
-
I Assignment I "HALT" <Separator> 
(Note: Now parse SLSL code) .. - Expression .. -
::= Id Unknown I Id Transfer I 
- -( Id Macro " (" [ "' " Statement "' " { 
-
"' " Statement "' " } ] ") " 
(Note: SLSL code continues now) 
: : = Id Integer 
-(Note: SLSL code continues now) 
" " 
' l>' •• 
::= "IF" ExpreS$iOn "THEN" Statements [ "ELSE"" 
Statements] "END" 
::= "FOR" Assignment "TO" Expression [ "STEP" 
Id Integer] "DO" Statements "END" 
-
125 
--
-. 
';,,. 
' ' 
'• 
Whi1e Loop : : = "WHILE" Expression "DO" Statements · "ENP" 
-
Repeat Loop::= "REPEAT" Statements "UNTIL" Expression 
-
Assignment 
Expression 
::= [ Id Integer I Id Unknown] ":=" Expression 
- -
: : = Siq,1e Expr { "=" I "<" I "<=" I ">" I 
-
">=" I "<>" Siq,le Expr } 
-
Simp1e Expr : = Summand { "+" I " - " I " I " Summand } 
-
Summand 
Product 
Factor 
.. -
.. -
.. -
.. -
Product { "!" I 
Factor { "*" I 
: := [ "-" I ",v" l 
I Number 
"%" I "<<" I ">>" ~ Product} 
"/" I "&"Factor} 
Id Integer I "("Expression")" 
-
Id Unknown ::= Identifier 
- (Note: Identifier must not be declared) 
Id Integer ::= Identifier 
- (Note: Identifier must be declared as Integer) 
Id Macro : : = Identifier 
- (Note: Identifier must be declared as Macro) 
Id Transfer : : = Identifier· 
-
Id Local 
-
(Note: Identifier must be declared as transfer 
identifier which references transferred 
parameters) 
: : = Identifier 
(Note: Identifier must be pre-declared as 
local variable) 
Identifier ::= Letters { Letters I Digits} 
Number ::= Hexdigits { Hexdigits} .[Radix] 
Letters 
Digits 
Hexdigits 
Radix 
: := "A" I "B" I 
.. - ''0" I "1" I 
.. - ,, o" I 
.. - "1" I 
.. - ''B" I "N" I 
• • • 
.. ·• 
• • • 
"0" 
126 
I "z" I-'} "$" I " " 
-
I "9" 
I "9" I "A" I "B" I 
; 
-' I "D" I "B" 
• •• I "F" 
r· .. 
'. / 
\j 
' 
I 
I 
. ...~, 
Vita 
' I 
The author was born on February 17, 1965, to Doris J. and Rudolf F. zur 
Bonsen in Duesseldorf, West Germany. ,. 
He received his high-school diploma with honors at. the American Inter-
national School of Duesseldorf on June. of 1983. He matriculated at Lehigh 
University in Fall of 1983 and graduated in June of 1986 with a Bachelor of 
Science degree with great honors in ·computer Engineering and a minor in 
Economics. He continued his studies at the same university in the field of Com-
puter Architecture under the direction of Professor Meghanad D. Wagh. He ex-
pected to receive his Master of Science degree in Electrical Engineering in June 
of 1988. The author is member of Tau Beta Pi, Eta Kappa Nu, and the Institute 
of Electrical and Electronics Engineers. 
,. 
f.__1, 
127 
