Research in the design of high-performance reconfigurable systems by Slotnick, D. L.
  
 
 
N O T I C E 
 
THIS DOCUMENT HAS BEEN REPRODUCED FROM 
MICROFICHE. ALTHOUGH IT IS RECOGNIZED THAT 
CERTAIN PORTIONS ARE ILLEGIBLE, IT IS BEING RELEASED 
IN THE INTEREST OF MAKING AVAILABLE AS MUCH 
INFORMATION AS POSSIBLE 
https://ntrs.nasa.gov/search.jsp?R=19860002430 2020-03-20T16:10:23+00:00Z
\' . 
DEP ARTMENT OF COMPUTER SCIENCE 
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
URBAN.\, ILLINOIS 61808 
TITLE 
Research in the Design of High-Performance 
Reconfigurable Systems 
Fourth 
Semiannual Status Report 
April I, i 985 -- September 30, 1985 
NASA Grant # NAG 5-377 
(hASA-CR-176275) RESEAHCH IN THE DESIGM OF N86-i1897 
HIGH-PERFORMANCE RECONFIGURABLE SYSTEMS 
Seaiannual Status RepoI:t, 1 ApI:. - 30 Sep. 
1985 (Illinois UniY., Orbana-Chaapaign.) Unclas 
38 p HC lJ3/"1 101 CSCL 098 G3/61 28461 
Project Personnel 
Graduate Research Assistant 
Scott D. McEw311 
Gregory j. Smith 
Andrew j. Spry 
Principal Investigator 
~~ 
D. L. Slotnick 
, 
I 
i 
______ ~z;;::~::;:.:::::=~.-'--' .... ~~--n ... '.'~~ 4~_W .. & '''~ •...-, ~~. '--'--'-~~ 'p-.r . --, ............ y ...... ,:T1-
I
, 
-----,<-. -<----.,,-.'""-< ,,- -- -,.-.. • ->- ............. ~- -... _- •• --"-~ ,-- ' '41 .,: 
Table of Contents 
1. Introduction and Summary .............. .. 
. ................................................... . 1 
2. Control Levels of the RILAPSE......................................................................... 2 
3. Latest Bit Processor............................................................................................ 6 
3.1 Bit Processor Layout................................................................................ 6 
3.2 Oueue Register and Scratch Pad Memory............................................ 10 
... Richards' Method Update.................................................................................. 13 
,. Multiple Input Adders .................................................................................... '" l' 
Appendix: MPP Simulator Program11"ler'S Documentation......................... 21 
-
~ 
I 
I 
,~ , 
, 
! . 
'~~ 
" 
I 
, 
'. 
~!'I""'IIIIII""'Il"~r-"""I_""""""'I""IIIIi""---"'-~~.-'''F"'---~''''''''''-'''''-'-'-'''''''''''''''''''''''--'''''''''~ __ -'-~'''''''''-~ -~~'~~~~.-:.;~- ~~-- .. -.~ *» 
- .4# 
1 
t. 1.,roducUo.D. aa. Su •• ary. 
The initial control and programming philosophies oC the RELAPSE are 
discussed in ~'-" :tion 2. A block diagram showing the relationship of the Arithmetic 
Units (coml'osed of Stages and Bit Processors), to the Functional Units, and other 
components oC the RELAPSE is used to guide this discussion. The latest version oC the 
Bit Processor design is presented in Section 3. Included in this section is a detailed 
discussion of the Bit Processor's new scratch pad memory component. The section 
also clarifies the usage oC the Bit Processor's proceSSing registers, and Input/Output 
functions. The final deSign phase of the Arithmetic Unit is underway by a srudy of 
the Proposed IEEE Floating Point Standard. The decisions on conformation to this 
standard will be used as inputs into the finalization of the deSigns of the Bit Processor, ~ 
Stage, and Arithmetic Units of the RELAPSE. 
Section", discusses an update to the previous Semi-Annual Status Report. It 
details the failure of the method by Richards to do multiple input addition. Section ~ 
oC the report deals with two practical multiple input adders and compares them to a 
two input uniproctssor in both time and cost constraints. 
An appendix containing the detailed documentation of the MPP Simulator 
Programmer's Documentation concludes the report. The more general simulation 
tools of the ASW simulator from which the MPP Simulator is derived will be used to 
simulate the Functional Units of the RELAPSE system. The construction of simulations 
of the RELAPSE will be done using the functions oC the ASW described in the 
do cu men tation . 
.~ I 
, " 
It. I 
- -'------~ ... -.-~"- £.·4 ;:eqs:.-! ~-.- _ ............. __ ., .--.. ""-~""~-..,.---" -::.~-- -L.... _______________ ~___ • ...:.: ___ ~ _____ ~ ______ _______ .......... _ ....... _..I.I_ 
2 
2. Co.trot L ••• ts or til. RELAPSE. 
Filure 2.1 shows a block. dialram of the functional decomposition of the 
RELAPSE computer system. The dialra 11 will be used to detail the current thinkinl 
and areas of study in the control structt.rp and prolramming of the RELAPSE system. 
The figure shows three levels of control: the system level. the functional unit level. 
and the arithmetic unit level. An earlier stage level control has been determined 
unnecessary and has been dropped to simplify the overall design. 
The system control level is responsible for arbitrating communications and 
data flow between functional units. scheduling input and output. and scheduling 
usage of the shared memory resources. Although the system controller is shown as a 
centralized control in Figure 2.1 it will in practice be a distributed control. The 
functional unit control level is responsible for communication within the functional 
units. This includes the sequencing and synchronization of operations on the 
arithmetic units and on the usage of any internal functional unit buses. The 
arithmetic unit control level is responsible for controlling and sequencing the 
micro-operations of the stages. routing logic. and bit processors to perform the basic 
arithmetic of the RELAPSE system. 
The existence of three distinct levels of control implies that a Single language 
implementation would be impractical for the RELAPSE system. Such a single 
language implementation would have to provide the ability to program the system 
from the level of the single bit operations of the stages and bit processors up to the 
high level1inear algebra oriented user interface. A three "language type" system 
under consideration would parallel the control level structure of the RELAPSE. This 
language structure will allow the individual languages of the RELAPSE to be more 
easily tailored to specific tasks and provide control level dependent features and 
optimizations that would be impractical in a single language system. 
t . 
., 
,. 
\ 
" 
RBLAPSB 
.,...... 00aIIVl Unia 
1 
- 110 Volt ( ..... 
-
1O 
I 
---
I I VO Unil ~ 0 I , 1 
--4 0 
P 
0 B 
- 110 Volt r ...... u 
, 11-2 • --4 • 
• 
---
110 Vail f.-. 
, a-I 
-
I 
-
CoaIIoI .. 
......... 
Uait Ii 
: .. , ..••...•.•• ~ ..... ~ 
· · · · · · · 
PunctioaIl Unit 
· 
..;.... 
· CoaIroIJer 
· · · · 
· · · · 
· 1 1 · · · · 
· · · · · · · · · 
....~ Iru"~1 · · . · · . · : I I 
· · (RON. RAM) · · · · · · · • 
· ~...;.. 
· Iru~1 · · · · MtmOry · 
· · • · 
· · 
(IWI) 
· 
· 
· · 
· 
· 
· .~ •••••••• e ••••••••••••• 
•• 
. .•.......................... , 
• • 
• 
· 
· 
• 
· 1~~ ...-. · • • • MImDry · • • • ~ · • • • I · • · • 
· · · • • • • • · • • • I U-Anayol .... \-f. · • • 
• 1 I ... 1 1 ~ • • • 
· • ~ 1+ · ..... MemodIe • • 
• 
· 
· · 
, ......•.....................• 
•......•..................... , 
· :. : 11~~~Y ~ 
i I i 
: II. .~: 
1 I u..wAnayol ...... H--
! I 1 .eo 1 1 i 
! -+I hap MemorM 1+ 
· . 
· . , .........••.••....••....••..• 
•••••........••••........•••. 
· 
· • 
· 
· 
· 
· 
.. ..: I u....r Array of s.,..!+ 
1 I ... 1 : ! ! L.j ,... M.morieI 1+ 
· . · . ...........................••. 
........................•..... 
· • t : l rt~F~Y ~ 
i l : 
1 ~. ..: ! I u..r Array of SIal" !+ 
i It e .. J t ! i L.j $1atI Memori.. 1+ 
· . 
· . a •••••••••••• ~ •••••••••••••••• 
.... 2.1: CodIaI 811 ....... of .... IUlLAPSB. 
'tl:l 
3 
, .. 
I 
l 
The arithmetic unit language type is a microcode language that will allow the 
expression of the full capabilities of both the horizontal mode operation of stages and 
the vertical mode operation of bit processors. This language provides the basic 
arithmetic operations of the arithmetic units and the capability to custoJJtize this 
basic arithmetic. Arithmetic unit level programs consist oC ROM based micro code. 
The customizations provided will allow multiple word formats. numerous roundoff 
techniques. and access to important sub-arithmetic operations such as logical 
operations. shifting operations. routing operations. and operation on sub word 
formats such as floating point mantissas and exponents. The set of customizations will 
allow the programming of arithmetic operations on new and nonstandard word 
formats to be done from the functional unit level. 
The functional unit language type is a high level assembly language that must 
provide the convenient expression of scalar. vector. and matrix operations as well as 
the inter-arithmetic unit communication within the functional unit. The language 
will provide primitives for synchronization and be extensible to different functional 
unit architectures. As shown in Figure 2.1. the functional unit controi memory is 
divided i.nto two segments: a primitive memory. and a loadable memory. The primitive 
memory is used to store the machine code for :.he basic operations of the functional 
unit such as bus control protocols and basic vector or matrix arithmetic operations. 
This control mem.ory defines the basic functionality of the unit and will be either 
stored in ROM or loaded at system boot time. The second segment of the memory is a 
toadable memory. This segment allows the extension of a basic functional unit to 
some specific tast. In a RELAPSE system there may be a number of functional units 
that have the same basic underlying architecture and primitives. It is the loadable 
memory that will hold the algorithms that distinguish these otherwise identical 
1.~ 
.. 
· , 
· I 
• t 
i 
· 
, 
functional units. This memory can also be used to reconfi,ure a sinlle functional 
unit to do different processing tasks for different jobs. This memory will usually be 
loaded at the start of some processin, job. 
The system level language type will provide synchronization and control now 
primitives that allow the full use of the multiprocessing cr.pabilities of the functional 
units. It will also allow the development of a mathematically oriented user interface 
t~at wHI contain as its heart a linear algebra oriented programming language for the 
RELAPSE system. It is this user interface and linear algebra programming language 
that will provide "he Unear Algebra processing system I)f the RELAPSE. 
Programming the RELAPSE system will require the generation of programs 
for two of the three control levels of the system. For most applications a new system 
level program will be required in the linear algebra programming language. A 
number of system wide parameters ( such as round off algorithm) will be set via the 
user interface. Modification of these parameters will be allowed either during 
processing or between jobs without recompiling the application. For most 
applications the built in programs of the functional units will be used without 
modification. If some function required by an algorithm is not provided by a 
functional unit a new algorithm may be 'Written at the functional unit level. 
---.-.--. ---
, ¥.: 
.. 
,. 
l., 
, 
H 
~1 , . 
f"' 
~.: 
~--'-
6 
3. L&&e.t Bit Proc .... r. 
3.1 Bll Proc .... r Layout. 
The updated bit processor layout is shown in Figure 3.1. The major changes 
r~rm the layout presented in the third Semi-Annual Status Report is the reduction of 
the number of processing registers by L the introduction of a scratch pad register 
memory. and a clarification of register inputs. The scratch pad memory is discussed 
in detail in the nelt section. The remainder of the changes are summarized below. 
The general purpose registers of the bit processor have been designated as 
processing registers. It is assumed that the contents of these registers 'Will be 
destroyed and modified by any horizontal or vertical mode algorithm. It is for this 
reason that the scratch pad memory was added to the bit processor design as a location 
for temporary values. The use of the processing register inputs and outputs have 
been clarified. The speCific uses are given in Table 3.1. The input Jines to these 
registers have been divided up as follows: line 1 is for vertical mode arithmetic. line 
2 is for horizontal mode arithmetic. line 3 is for input from the a bus. and line" is for 
input from the b bus. Any register (elcept the mask. register) can be placed on the 0 
bus. The remaining output 'lses of the registers are specific to vertical and horizontal 
mode arithmetic and are explained in Table 3.1. and Table 3.2. 
The remaining clarification deal 'With the input and output from the bit 
processor itself. The bit processor has four input/output ports. Two of these ports are 
devoted to the two bank memory. One port is devoted to 110 directly with the 
processing section of the BP. This port makes the registers of the BP directly 
accessible from outside the BP. The final 110 port is devoted to 1/0 'With the memory of 
the BP·s. This port provides an 110 path to the memories without using any BP 
processing registers. 
= ~ =~' "~-'-' 
, 
I 
, 
, 
, 
= 
7 
-----_ .... 
-----
L ......................... : ............................................................ . 
(a) BP Da .. CocaedoaI. 
1 .. 
0"-
(II) BP PmceIIar Sectiaa. 
PI ..... 3.1: 81& PI'OCMIOr LayoU&. , 
. ' 
Data. Source and Destination of the BP Processing Registers. 
Reg. Sources o( Input Destinations of Output 
The carry bit from the sum carry The 0 bus. one bit of the add and 
1"0 adder. one bit of "..he sum from the multiply ROM address. and 1 bit 
add ROM. the a bus. and the b bus. of the sum carry adder input. 
The sum bit from the sum carry The 0 bus. one bit. of the add and 
1"1 adder. one bit of the low order muWply ROM address. and the byte of the product. the a bus. q register input. 
and the b bus. 
The q register. one bit of the high The 0 bus. one bit of the add and 
rZ order byte of the product. the a multiply ROM address. and 1 bit 
bus. and the b bus. of the sum carry adder input. 
The 0 bus. one bit of the add and 
The routing logic. the a bus. and multiply ROM address. the equi-1"3 valence function. the routing the ~ bus. logic. and I bit of the sum carry 
adder input. 
ID. The bit and stage level masks. the The equivalence (uncilon. and 
a bus. and the b bus. the bit processor masks. 
Table 3.1: Source aad nesUaatioa oC Processia& Re&ister Da1&. 
• 
, 
I 
. 
\ 
Input/Out.put. Number. 
1 
2 
3 
6 
7 
8 
9 
10 
11 
BP Iapu&. aad Outpu&. Poiab. 
Bit. is To or From. 
To sum-or tree. and zero detect logic. 
OAe bit. of the high order byte of the add or 
multiply ROM address (horizontal mode). 
One bit of the low order byte of the add or 
mult.iply ROM address (horizontal mode). 
To the routiAg logic. 
OAe bit of the Sum from t.he add ROM 
(horizontal mode). 
OAe bit of the low order byte of the product from 
the multiply ROM (horizontal mode). 
OAe bit of the high order byte of th~ product. from 
the multiply ROM (horizontal mode). 
From the routing logic. 
CurrenUy unused. 
Bit Processor bit level masks. 
Stage and arithmetic unit. level masks. 
Table 3.2: Inpu&. and Output Points or the BP Shown in Filure 3.1. 
. -- ... -
9 
... 
, . 
m 
r 
! , 
~ w- %51' r=-t:' ----r'---"- {I. ,iF A. 
- ........ ~ ~~ "'--::., ,' .... - ------" 
to 
3.2 Queue ae.ister .ad Scralcla Pad lIeaory. 
The Queue and Scratch Pad Memory (Q/SP), shown in Fi.ure 3.2, is a 
component oC the Bit Processor. It is a combination oC a filed len.th queue reamer 
and a scratch pad me'lIlory. The proposed total size of the queue and scratch pad 
portions of the unit is 32 bits. The boundary between the two portions is software 
reconfilurable by setlinl the len.th of the queue portion. The queue's lenlth is 
decoded from a l bit control value supplied by the control unit. 
The queue portion of the unit receives its inputs from the BP's processinfl 
register rl and sends its output to the BP's processing register rZ (Figure 3.2(c». On 
each shift cycle oC the queue the input is plact.d in the queue '5 tail bit and the 
queue's head bit is available as the output. The bit output can ~ither be loaded into 
the r2 register or neglected (i.e.when the ::{ueue is being filled for the first time). 
Each internal bit oC the queue receives the value of the preceding bit in the queue 
and makes its contents available for the succeeding bit in the queue during each shift 
cycle (Figure 3.2(a». The tail of the queue is considered bit 0 and the head is 
considered bit 1-1 where I is the length of the queue. Thus it requires N shift cycles 
of the queue to move a bit through a queue oC length N. 
The scratch pad portion of the unit receives its input from the BP's output bus 
(o-bus) and its output can be placed on either of the BP's input buses (a-bus or 
b-bus). Each bit of the scratch pad memory is individually addressable. The read and 
write addresses of the scratch pad memory are decoded from ~ bit values supplied by 
the external c(\ntrol unit. A write control signal is set on any cycle that will o.:dorm 
a write operation while the read operation can b\. JJne on any cycle. Rei','! ::.nd write 
operations of the scratch pad memory can be performed simultaneously. If the same 
cell of the scratch pad is read Crom and wrillen into on the same cycle the value read 
will be the previous contents of the cell and the new contents of the cell will be the 
• _. _M_ 
.4,t;11 
i. 
- . 
11 
..... 
•• A .... . .... .. 
.... ~ 81 •• , 
Q", I , Q .... 
... .... a. 
A .... 
(') 0-- ...... Pal UaK C ....... 
(a) Que. " ...... PU <A8. 
(c) DaIa C CIIDaIID tilt BP .... 
" 
... 0 
• ' .. ~ 
'" I , 
Wft. I 
SP_ 
Qc-l 
Q" 
, 
:::::==:-:-:_~ _ -'~_-W_·~.~_'_~~ __ -C_--if_¥_*_~_-_·_-----_·_- M--"o.~_-_-_~ __ -_---_ .. --__ -_-~. ::--::=-~-:':.~':~'~'~.~~~~.::~::_-_-~=-~--:~-::~_-~~:::::::::::::::::::::JI 
#¥AUtU¢ \U -. 
12 
value written in that cycle. The scratch pad memory thus acts like a small dual ported 
random access memory. 
The software configurable boundary between the queue and scratch pad 
memory portions of the Q/SP unit provides hardware protection against writting into 
the queue via a scratch pad memory write. In the event of a scratch pad write 
operation to the portion of the Q/SP unit that is currently the queue the write 
operation will simply fail and the contents of !hat bit of the queue will remain 
unchanged (provided no queue operation also occurred on that cycle that would 
change the value). Figure 3.2(a) shows that no other hardware boundary checking 
done in the Q/SP itself. It is up to the controller to generate an interrupt Signal a 
scratch pad write operation is in the legal address range. There is currently no 
hardware protection against reading from the queue portion of the Q/SP unit via a 
scratch r.:'\d read. This is not a deSign feature however and may not be retained in the 
final design of the Q/SP unit. 
In summary the Q/SP unit of the Bit Processor (Figure 3.2(b» is a 32 bit 
register that can runction as a 1-32 bit "shift register U queue, as a 1-32 bit scratch pad 
memory, or as a 1-32 bit combination of both. The register has two data inputs and " ~ 
"-two data outputs (black arrows in the Figure), one pair for the queue and one pair for 
the scratch pad. Control lines (grey Arrows in the Figure) are provided for 
designating the read address. the write address, setting the queue's length, 
performin~ a shift operation, performing a write operation, and asynchronously 
clearing the entire unit. To reduce the overall number of control lines to the Q/SP 
the 32 bit control signals (the read and write addresses, and the queue boundary) will 
be decoded from, bit input control signals. 
13 
.c. Ric .... d.· lIetbod U,d.te. 
The addition method by Richards (Page 3 a.nd Figure 1.1 in the thesis proposal 
and SAR -3) does not work in general. Briefly, the method was proposed to allow the 
addition of more than two numbers with a single adder circuit using half adds, carry 
saves, and a final full addition to propagate any remaining carries. The method relied 
on the fact that the carries resulting from a haif addition can be added to the "half 
sum" to generate an unpropagated carry. The u.npropagated carries are then 
propagated during a final full addition step. The method does guarentee that two 
carries cannot occur jn the same digit position on subsequent steps (half adds) but it 
does not guarentee t".at multiple carries into the same digit position will .not occur on 
alternate steps. J.n example where the method fails is shown in Figure 4.1. 
Form t.he sum of 
Al - 0 0 0 1 1 1: A2 - 0 0 0 0 1 1: A3 - 0 n 0 1 1 0 
using Richards met.hod. 
000111 +AI Half ADD first 
000011 +-A2 operands. 
000100 
000110 Half ADD carries. 
(Ooiooo) 000010 Half ADD final 
000110 +-A3 operands. Unpropogated 
carries to be 000100 propogated Half ADD carries. by the final 000100 
fuU addition! ~) 000000 
Figu .. e 4.1: F.ilure of Richa .. ds Method with Three Inputs. 
;q ;. :: x, 
.., 
1 
~ ", '. 
... ..,.~~~"'"'Ir".--.-~-.".- .~~~~~~. ..-- "''' -H •. _ ....... --~-........ _-.--; ...... ,_.-• -.-. -'.,-'.-,. i.-y-
--.- ~.-" l~, 
1" 
In the example two unprcpagated carries are generated in the 23 digit 
position. This implies that t ... o carry propagate additions will be needed to propagate 
these carries and the method fails. In general on each alternating step (half add of 
carries) a Jlew carry can be generated for each digit position so at most N:.l 
unpropala.ted carries can accumulate at any digit position when there are N input 
numbers. Therefore. the methlld will still .need N-l carry propagate additio.ns in 
general to propagate the carries across the final "half sum." 
: , 
~---.----.-~-~-----------"-- -_ .. ,,- ---
j .. 'I '. 
--,-',' 
,. Multip.e lapul Adders. 
A number of different multiple input adders can be built using binary trees of 
two input adders. The individual adders in these trees can be bit serial 01" bit parallel 
and they can use any addition speedup techniques such as carry look ahead addition 
or ROM assisted addition. In order to distinguish the speedup provided by multiple 
input addition from the speedup provided by bit parallel versus bit serial addition. it 
is necessary to study how multiple input adders built from both bit serial and bit 
parallel adders compare in the solution of a suitable problem. One such suitable 
problem can be stated as follows: compute the fixed point sum of k numbers of word 
length 1 in a time less than some constant T and at a cost less than some constant J) 
A J; input adder can be constructed from a set of bit serial full adders with 
separate 1 bit carry registers. If the basic word format of the machine is an 1 bit 
parallel word, a parallel to serial conversion register (of length 1) will be needed to 
convert each input operand, and a serial to parallel conversion register (also of 
length 1) will be needed to convert the output. At each level in the addition tree a 
single 1 bit register for each adder wjJJ also be needed to store the bits of the .. partial 
sum" for the next level. Since an overflow can result from any of the intermediate 
sums, the carry out of each of the bit serial adders must be considered when detecting 
overflow of the k input sum. An example of a such an adder tree for .k • " inputs is 
shown in Figure '.1. 
A J; input adder can also be constructed from a set of bit parallel adders (of 
word length 1) where each adder has a set of input registers for its input operands. 
Thus at each level of the tree there are twice as many 1 bit registers as there a!'e 
parallel adders. Since an overflow can result fro11\ any of the intermediate sums, the 
carry out of all the parallel adders must be considered when detecting overflow of the 
k input sum. Figure ~.2 shows a .k • " input adder constructed from parallel adders. 
In order to achieve the fastest addition speed each parallel adder should use some 
,j 
. , 
'" . ..J 
,., 
\, 
I 
I , 
Io 
I 
I. : 
X 3 
: 
Load/Shift 
I' , 
-' f..".. 
OVerflow 
Logic OVerflow 
Load-Shift 
j 
.... .1 
3 
Y .. l: x. 
J 
i· 0 
Figure '.1: Four Illput Bit Serial Adder Tree. 
-3 
t---~>" Overflow 
3 
'------,r--S----- y = 1: X i 
i·O 
Fi«ure ~.Z: Four Input Parallel Adder Tree. 
16 
--"-" "------"--' "--'--'-""-'''-''~'-'-''''-'''-""---"-"~-''-''-''-''---'-...... --.~--"----:-~.'"'.~.- - --. -':-==."::":::'-:::::''".::''~'='~'''''''::'-=''''':''::'"--''-=-~-=::::'':':====~~--_ ... 
i 
, 
AI 
; 
~ (: 
Ii 
~ 
~ ,
, 
17 
technique to reduce carry propagation delays such as carry look ahead addition. 
The bit serial adder tree shown in Figure ~.1 can be thought of as a bit serial 
arithmetic pipeline containing rl082Lt)1 steps. After the setup time of rlog2Lt)1 
addition cycles the first bit of the k input sum is loaded into the output register. An 
additional 1- I addition cycles are then required to determine the remainder of the 
k input sum. Therefore. the total time required to compute the k input sum using the 
bit serial adder tree (denoted Tt(B» is 
Tk(B) = I (ADD A j) = (1- I + r I082Ck)1 ) X t (FULL ADD) 
The bit parallel adder tree shown in Figure ~.Z can also be thought of as an addition 
pipeline that has a setup time ofrlog2( 1'")1 steps. In the parallel adder tree pipeline. 
however. the final result is available after the setup time so the pipelining 
characteristic is only valuable if there are multiple 1: input additions to be 
performed. Using carry look ahead addition the time required by the bit parallel 
adder tree to solve the k input addition problem (denoted Tjc(L» is given by 
Tt(L) = t(ADDA j ) = rl082(1'")lx( Zflo82(.c)1+ t(FULLADD) ) 
The speedup of the bit parallel adder tree over the bit serial adder Sk(L) is found by 
dividing the time required to solve the problem on the bit serial adder by the time i. ~ :, 
required to solve the problem on the bit parallel adder. The speedup for the carry , 
look ahead paraUel adder tree is 
Analyzing this equatioti shows that the bit serial adder tree is faster only for very 
short word lengths (/ ( 16) when there are many inputs to be added (k) 128). In all 
other cases the bit parallel adder tree will be fasl't' "i'jan the bit serial adder tree. For 
example for a 16 input addition of 64 bit opf'r3.r..~'~ the parallel adder tree has a 
. ~-.-. - ~ --- ---.. ......... --
----~-.- _ ....... _- --_. - -_. -- ---
. 
t 
11 
speedup oC 6.7 over the bit serial adder tree. The advantage oC the bit parallel adder 
tr.,e over the bit serial adder tree is directly proportional to the word length I and 
inversely proportional to the number of inputs k. 
In order to determine the speedup oC the multiple input adders over two input 
adders the time required to solve the .k input addition problem on a two input adder 
must be known. Two different two input adders will be used as the base line for the 
speedup analysis: a carry propagate adder and a carry look ahead adder. The time 
required to do a single addition us-ing a carry propagate adder is IXI (FULL ADD) 
because, in the worst case, the carry has to be propagated across the entire I bit word. 
The time requireci to do a single addition using a carry look ahead adder is 
2 rlog4(1)l + I (FULL ADD) assuming logic with a fan in of .. is used. The 2 input 
adders solve the problem in an iterative manner so a total of k - I additions are 
needed to form tile k input sum. Therefore, the time needed to solve the addition 
problem using the carry propagate adder (denoted T I (P» and the carry look ahead 
adder (denoted T I (L» are given by 
and 
T .(P) = t (ADD A j ) = (1" - 1) X Ixt (FULL ADD) 
TI(L) = t (ADD A j ) = C,k - 1) X (2 rJog4(/)1 + t (FULL ADD) ) 
The 2 input carry look ahead adder will be used in both comparisons. The 2 input 
carry propagate adder will be used in the speedup comparison of the bit serial i:SClder 
tree because that adder tree suffers from the carry propagation delay. In that adder 
tree the carry propagation is done in parallel by all the adders in the tree but still 
requires O( I) cycles to propagate. 
The speedup of the bit serial adder tree over the carry propagate adder 
(denoted S I(B», determined by dividing T I(P) by Tt(B), is 
(k-1)xl 
... L. f ... 5 .. --" -...,. 
-----~~---------~ ....... -.--.- _. __ .. 
, 
I 
• I 
~ ~I 
\. 
>" 
,I 
II 
" 
19 
From this it is easy to see that for all word lengths' and two or more inputs (ok ~ 2 ) 
this adder is faster than a single carry propagate adder. The speedup increases 
rapidly as both 'and ok incerase because the addition time of the bit serial adder tree 
depends on the sum of the word length and number of inputs rather than their 
product. The same general results are obtained by analyzing the speedup of the bit 
serial adder tree over a 2 input carry ',Jot. ahead adder. The speedup function does not 
increase as rapidly as it does in th'" carry propagate adder case because the 2 input 
carry look ahead adder time depends on the log .. of the word length instead of 
linearly on it. 
The speedup of the bit parallel adder tree over the carry look ahead adder 
(denoted S I (L». determined by dividing T I (L) by T k(L). is 
(.1;-1) 
As with the bit serial adder tree the bit parallel adder tree is faster than a 2 input. 
adder. When comparing the carry look ahead adders the speedup in only a function 
of the number of inputs. If the speedup comparison is done with a 2 input carry 
propagate adder the speedup will be greater than that shown by a multiplicative 
factor of the word length. 
The final part of the problem statement deals with a cost analysis of the 
different adders. A detailed cost analysis would have to involve a determination of the 
hardware necessary to deliver all k operands to the k input adders. as well as a cost 
analysis of the adders themselves. and any conversion registers needed in the bit 
serial adder tree case. To get an order of magnitude estimate of the costs involved 
only the adders. their necessary associated registers. and data connections will be 
considered. As a basis for this comparison the cost of a single 1 bit adder. its 
associated data lines. a carry register. and two 1 bit inpuf f"l'gisters will be denoted 
t~.--
-------- • ._A ".""'u _~ .... $ 
" f 
I 
\ 
I 
I 
I 
20 
D(BIT ADDER). The 2 input carry propagate adder is composed of , oC these bit adders 
except that only one carry register is needed. Thus the cast of the 2 input carry 
propagate adder Dt(P) is O(Jx D(BIT ADDER». The 2 input carry look ahead adder is 
more difficult to analyze because of the carry look ahead circuitry, If the number oC 
gate delays is used as a comparison factor the carry look ahead circuitry Cor each bit 
position will be roughly as complex as the adder for reasonable word lengths and 
logic Can ins. The bit cost of a carry look ahead adder will therefore be about 1.' times 
the bit cost of the carry propagate adder so DI<L> • O{/ x D<BIT ADDER». The bit 
serial adder tree contains 1: - 1 bit serial adders each with an associated 1 bit carry 
register. two 1 bit operand registers. and 1 bit data connections. Thus the cost of the 
1; input bit serial adder tree Dt(B) is O( 1: X D(BIT ADDER». The 1: input carry look 
ahead adder tree contains 1: - 1 carry look ahead adders which have a cost of Dl (L) so 
the cost of the carry look ahead adder tree Dt(L) is Oel" X I X D(BIT ADDER». 
Comparing these estimates it is easy to see that the Jr input bit serial adder tree costs 
O( 1: I I) more than either of the 2 input adders and that the 1:: input bit parallel 
adder tree costs O( 1: ) more than the 2 input adders. It can also be seen that the 
1; input bit parallel adder is O( I) times as expensive as the 1: input bit serial adder . 
,\ 
...:--,. ~ - -.--------,---~-~~-----~ 
r 
~ , 
I 
Appcndix: MPP Simulator Pro«rammcr's Documentation. 
21 
", 
• 
1 
1 Introduction. 
The Architecture Simulation Workbench simulator of the MPP cc:..nsists 
of an MPP emulator, which provides full functional emulation of the Main 
Control Unit, PE Control Unit and the Array Unit, along with a program 
debugger and a set of routines that control the execution of the simulator 
modules and provide communication between modules. This document describes 
the structure, modules and individual files making up the simulator. 
Throughout this document the unix directory path conventions are used, 
where "dir/file" means the file named "file" in directory "dir". The directories 
used by the MPP simulator arc: 
mpp, which contains the main routin~s for the MPP emulator, 
libsim, which contains subroutines used to simulate ARU functions, 
debug, which contains the source for the debugger, and 
libasw, which contains the routines used to simulate parallel operation of the 
simulator components, provide I/O and inter-module communication. These 
routines act like a virtual operating system for the simulation. 
2 
2 Structure. 
The simulator is broken up into several modules, with a module being R. 
distinct functional entity which operates in parallel with the other modules of the 
system. There is a module corresponding to each of the MCU, PCU, IOGU 
(currently a null program, since the code to use the 10CU is in place but an 
emulator has not yet been written) and the simulator debugger. Since the C 
programming language docs not provide th,: capability of running multiple 
processes, a set of routines is prov; Jed to simulate parallel execution in the 
directory "libasw". Each module acts like a separate program, with calls to the 
routines in Iibasw for communication, I/0 and synchronization. This section will 
describe the Iibasw routines and how the modules of the simulator interact wit.h 
them. 
2.1 The Simulator Operating System - libasw. 
The directory Iibasw contains the routines which perform operating 
system functions for the simulator - creation of modules, communication between 
modules and I/O with the Vax file system and the user's terminal. Also included 
in this directory are routines for performing common simulator operations 
involving simulated MPP memories and register sets. These routines provide a 
common interface for all memory or register operations, as well as allowing the 
ASW debugger to transparently transfer data to and from the simulated 
memories. 
The file "libasw /multi.c" contains the heart of the ASW operating system -
the routines for simulating concurrent execution of the modules. Each module 
lf1 
I 
I 
I 
: l I .J. 1 1 
, I ~~ 
, , 
, 
4 
I 
has a corresponding module descriptor, a C structure defined in h/multi.h, which 
contains the module's name, its run time stack, the address to set the Vax stack 
pointer wht:n the module is run, and information about the module's state. The 
routine "sp_ exec" in libasw /multi.c defines a new module. Sp...exec creates a new 
module descriptor, allocates stack space, initializes the module's stack, and then 
places the descriptor in the syst.em run queue. The bottom of t.he st.ack is set. up 
so that if and when t.he main routine of t.he module finishes, t.he subrout.int ret.urn 
will cause control to jump to the rout.ine spJiie. This rnutine disposes of the 
module descriptor and returns control to the scheduler. The address of the main 
routine for t.he module is placed on t.he stack above spJiie, and t.he init.ial value of 
t.he module stack pointer is set t.o point. to this address, so that the first t.ime t.he 
module is scheduled the main rout.ine is called. 
The rout;ne "scheduler" (in Iibasw /multi.c) schedules t.he execut.ion of 
modules. Two queues are maint.ained for modules ready t.o be run and t.hose 
wait.ing for some event.. The scheduler first checks to see if there are any 
modules in the run queue. If t.here are, the first one is removed from the queue 
and rest.arted. If not., t.he wait. queue is t.raversed t.o find any modules t.hat. are 
ready to be t.ransferred t.o the run queue. A module is ready t.o run when it.s state 
matches the system st.ate variable and ,t if. not waiting for elapsed time. If there is 
no runnable module, the system clock is updat.ed, all modules have their wait. 
time count.ers updated and the wait queue is again t.raversed. 
When a module is scheduled to run, a pointer to the module descriptor is put. 
in the global variable "u" and the routine "sp.JIwap" is called. sP ..... t;wap alters the 
Vax stack frame so that the subroutine return address is subst.ituted with the 
i 
I 
t. 
, 
., 
I 
¥4- . 
_ ill.::: -z-;-
4 
address saved when the module was suspended. The Vax subroutine return then 
continues execution at this point. 
The routines "sp...llleep" and "waitJor" are used to luspend execution or a 
module. Each of these routines updates the module descriptor to reflect the event 
that is to wake up the module (either elapsed time or a change in the system 
state), places the module on the wait queue and swaps in the scheduler. Ip...llleep 
causes the module to sleep for a number of clock ticks. waitJor causes the module 
to sleep until a specified bit in the system state is set. A module can lignal an 
event (set a bit in the state) via the routine "sp~h". The bit can then be cleared 
by a module calling the routine'spJleen", to indicate t.bat the event has been 
seen. 
Each module or the simulator consists of a large loop in which an instructicn 
is interpreted, and then the module calls "sp .. :lleep" to sleep for the number of 
clock ticks corresponding to the time used by the instruction. Modules are 
automatically suspended when they try to perform some operation that cannot be 
done at that time: for instancc, when the PCU tries to take a command of the 
call queue, if the queue is l'mpty, tht! queue routine suspends the PCU until a 
signal is received indicating sonwthillg wa.'I put on the queue. (Since the signal 
only say~ that sornethiuK was put on a qUl~ue, without specifying which queue, the 
routine actually has a loop to check if there is something 011 the pro,l)er queue, 
resuspend if not, an.i calls "sp..JIeen" when it does get something.) 
The file "li~asw limc.c" contains routines for communication between 
modules. Tht:!e routines are used to define and acceSt' intermodule resources. 
The definition of a resourcc COUles from the "item[NITEMj" table. Currently it 
_,. ____ -~~==~-.--::'-. .-:.,~::_:::::::;:::::::;;;;-__ liiii- iitiJl.-------------=~ .. I!i .... !J,. ..
6 
delimits the classes "section of memory", "function", and bit plane of data. 
"l.,£rt:at" makes a resource in one module available to other modules. These 
resources can then be read or written to using the routines "uead" and "Lwrite". 
A resource can be loaded from or written to a file via the routine "IJile". lJile 
uses the routines in "libasw jload.c" for loading and unloading resources. 
The file "libasw jmemory.c" contains routines for creating and using 
simulated computer memories or register sets. These routines allow all memory 
accesses to be done through a common interface, which checks for illegal 
addresses and performs the trace and bre:1kpoint functions for the ASW 
debugger. The PCV, MCV and ARV memories and registers are all created and 
accessed with these routines. The routine "M_create" ci'eates a memory space by 
placing a descriptor for the memory ill a table. This descriptor contains the name 
of the memory (for example, MeV memory is called "lli<;paCe"j this name is for 
printing out messages, such as when a breakpoint is activated), t~)e tipe (scalar or 
array), access permission, a flag for turning on the trace function, word length, 
the memory's size, an array :>f breakpoints which can be set by the debugger, and 
a pointer to chunk of VAX memory to be used as the simulated memory. 
MJ:reate also calls "Lcreat" to make the memory available to the debugger. Once 
a memory has been created, it can be read or written to by the routines "MJead" 
and "M_write". "IvLaddress" returns a pointer to some address in a memory - this 
is used mainly by the array routines hncause it is more efficient to use this pointer 
to perform an operation directly on an ),j{U plane rather than reading it into a 
buffE:!r, performing the operation, then writting the re::'!llt back. 
-
+ 
The file "libasw /queue.c" contains routines for setting up and using queues. 
The routines are general-purpose; a queue can be created with any number of 
elements and the queue elements can be anything desired. The routines are used 
in the simulator for the PE call-queue. The file "h/queue.h" has the declaration 
for the queue structure. This structure has fields for the queue element size in 
bytes, the number of elements, the count of elements currently in the queue, the 
head element of the queue, and a pointer to the block of memory containing the 
f{ueue. The routine queuesize creates a new queue. The routine "enq" places an 
I lid" h I II If h e ement on a queue; eq removes t e top e ement. topq returns t e top 
element without removing it from the queue. "dumpq" empties the queue. When 
a module calls "enq" to place an element on a queue which is already full, it is 
automatically suspended, and the next time an element is removed from the 
queue, a signal is sent to wake up the module and the operation is completed. 
Similarly, when a module attempts a "deq" on an empty queue, it is suspended 
until an element is placed on the queue. 
2.2 Module Structure. 
The simulator currently has 4 modules: the MCV, PCV, IOCU(null 
program) and the ASW debugger. 
Each module (except the debugger) is of the form: 
mainJoutine 0 { 
<variable declarations> 
L~reat (" < mod ule name> ", LMEM, u, u, sizeof *u); 
• _- = ":----.. ---....,.......~~mr;;-~-----....... -~ 
.. 
\ 
" . 
7 
This makes the module descriptor available to the debugger. By 
manipulating this descriptor, the module can be stopped or started by 
the debugger, and the module status can be read. 
<space> = M_create ("<space>", <access>, <word length>, <sile»; 
< space> is the name of some memory space or register set. < access> 
is some combination of M-READ, ~WR.ITE, M.J)ATA, MJNST, 
M.J)P ACE or M..ARRA Y, ORcd together. Any combination is possible, 
except that M.J)P ACE and M..ARRA Yare mutually exclusive. This 
parameter specifies whether the memory is an array or not, and what 
kinds of access are permitted on it. < word length> is the length of a 
worcl of the memory. For array memories it is in bit-planes, for others, 
in bit.s. Note that for MCl] memory this is 8, because MCU addresses are 
byte addresses. The fact that the MeU works on 2 bytes at a time 
simply means that all memory accesses are in mllitiples of 2 bytes. 
<size> is the size of the memory, in words (i.e., bytes for MCV 
memory, bit planes for ARU memory). The size of a memory can be 
changed by the debugger at any tiIIi~. Some memories are created with 0 
words and then dynamically expanded at start up Qr when loaded with 
data. 
Memories declared with M ... ueate can be accessed with the memory 
management routines in libasw /memory.c, most notably MJead and 
M_write. It is not necessary for each module to use M.sreate for every 
memory that it uses, but using the memory management routines 
.' 
1 
,J 
oj 
\ 
'" 
-~ ... 
8 
provides several advantages. Bounds checking is done automatically, and 
the trace and breakpoint facilities of the debugger can be used for every 
memory (even registers) using the routines. 
Wait for start command from debugger. 
for (jj sp.Jlleep ( <speed») { 
Or similar loop, with "sp.Jlleep" of appropriate number of clock ticks at 
end. The rest of the program is enclosed in this loop. 
while (ERROR) { 
waitJor (S_CONTROL)j 
} 
As long as the module status ;,; "error", wait for signal. 
if (STATUS & SP ..BINGLE) { 
STATUS 1= SP _WAIT; 
} 
If single stepping, wait for control signal. 
while (STATUS & SP _WAIT) { 
waitJor (S_CONTROL); 
} 
.. 
\ 
.. , 
f 
9 
If status is "wait", wait for signal. 
M..read ( < space>, < pc >, < nwords >, &inst, MJNST)i 
<pc> += <nwords>; 
Read next instruction, increment program counter. 
The rest of the program consists of interpreting the instruction. 
, 
... ,. 
"11 ' • 1 n 
10 
3 Debugger structure. 
The debugger is the overall controller for the simulation. All I/0 to the 
terminal and the file system is handled by the debugger. The debugger controls 
and monitors the operation of the MPP emulator by manipulating emulator and 
system variables. Variables which have had been made inter-module resources by 
"" . d b ''l.A - It b d b h" " d L.creat or memOries create Y m....s;reate can e accesse y t e open an 
"assign" debugger commands. Certain emulator variables are placed in the 
debugger symbol table at start up. These variables can be seen in 
"/ " mpp mppsyms.c . 
The debugger main routine is "command" in "debug/command.c". This 
routine initializes some variables, uses the C library routines "setjmp" and "signal" 
to arrange for control-C interrupts to cause the debugger to restart (this works 
fine on unix but only about half the time on VMS), and then calls "yyparse", the 
debugger command parser. Input to the debugger is initially taken from the file 
"mpp.ini", which opens channels to emulator resources and defines several 
debugger variables and procedures. Input is then taken from the terminal. 
The debugger uses the routine "read....term" in "libasw /termio.c" to perform 
non-blocking reads on the terminal. If there is no input available when 
attempting to read from the terminal, "sp..sleep" is called. 
The parser is in "debug/y.tab.c". This file was created by the parser 
gencrator program "yace" from thc file "debug/ c.y", a BNF -like grammar for the 
debugger command language . 
.. __________________________ ~~~~ __ ~-w.~-.~._ _~~-J_. __ ) ____________________ ~ ...... ~ 
.... -----------------------." .. " .. ---"-~-"- -" 
" 
11 
The parser action is to form a parse tree of the statement, if currently inside 
a block or procedure definition, add the parse tree to the structure, otherwise 
j 
., 
execute it and print the result. The routine "eval" in "debug/eval.c" evaluates 
parse trees. It recursively evaluates each sub-expression in the tree until the 
entire tree has been evaluated, and passes the result to the calling program (which 
is either itself, one of the routines in "eval.c" which evaluates particular functions 
or expression types, or the parser). 
The routines in "libasw lime.e" are used to transfer data between the 
debugger and the other modules of the simulator. 
--------~----------"-.--" .. --"-" -
"'.l .; 
•• '*'-~ ".~ ..... '. 
, i' \. 
12 
4 Simulator execution. 
.. 
The file "debug/master.c" contains the main routine for the simulator. 
This routine does the following: 
Calls the C library routine "signal" to set up signal handling. The "trap" 
and "pipe" (unix signals) signals are ignored; the "terminate" (control-C) 
signal causes a call to "quit", which halts the simulator. 
Interprets the command line options, which, in unix fashion, are of the 
form "-< option> ". The options are "-i < dir > ", which tells the the 
simulator to use the directory" < dir >" rather than the default directory 
for the debugger initialization file, "_p", which causes writes to the 
terminal to be held until after a carriage return when something is being 
typed by the user, and "-w", which is the opposite of "--p" (and the 
default) - writes to the terminal are sent immediately. 
Calls "queuesize" to create the rcv call queue. i . , ~ . 
- ' .. 
Calls "sp~xec" to start each of the simulator modules running. 
Calls "Icxfile" to open the debugger initialization file. 
Calls "multLtask" to take over and start running the simulator modules. 
At this point, each of the modules described above (MCV, rcv and debugger) 
begin to operate. The MCV and rcv arc in a wait state, waiting for input. The 
user can now use debugger comm:tnds to load a program into memory and start 
the MCV running. The rcv will wake up upon a call from th,~ MCV. 
- ~.-..,. •• #4,? .. 44 
, " ,. 
13 
Each of the modules runs in a non-terminating loop, interpreting its 
instructions, or simply idling and waiting for input. An abort command to the 
debugger ha.lts the simulation. 
---~- --.-~---~. 
.. 
j 
j 
.~ 
! 
, 
-, 
------------------------------______ ~.~~ ~=-~rrzrnrr~~.~ ..• a~=~&a .................................. ~Qi~~~~ 
~ 
• 4 • " 
'\ 
" 
5 Adding a module to the s£mu/ator. 
To add a module to the simulator (for example, a staging memory 
emulator), one must first write a program for the new module. This can be done 
pretty much independently of the rest of the simulator, with only a few 
subroutine calls to connect it to the rest of the system. The call 
ureat ("<module name> ", LMEM, u, u, sizeof *u)j 
should be put at the top of the main routine. If any memories are to be created 
using the memory routines in memory.c, a call such as 
< space> = M_create (" < space> ", < access>, < word length>, < size> ); 
should be included for every such memory. This can be followed by 
STATUS = SP _W AITi 
t 
t 
>1 
to indicate that the module is to wait for a signal from the debugger to begin I 
,', 
execution. The following sequenw of code should be placed before each iteration: 
while (ERROR) { 
waitJor (S_CONTHOL)i 
} 
if (STATUS & SP ...BINCLE) { 
} 
t 
J 
• 
while (STATUS & SP _WAIT) { 
= .' ==: I . 10- : ... ~ 
, . . 
16 
waitJor (S_CONTROL); 
} 
Finally, the program should execute "sp_JJleep" after every iteration, to sleep for 
the dimulated amount of time used by the module. 
t · 
I. I"'. 
