Design of a massively parallel computer using bit serial processing elements by Khouri, Kamal S. et al.
(NASA-CR-197365) DESIGN OF A
MASSIVELY PARALLEL COMPUTER USING
BIT SERIAL PROCESSTNG ELEMENTS
(Bucknell Univ.) 23 p
N95-22406
Unc|as
G3/62 0040998
https://ntrs.nasa.gov/search.jsp?R=19950015989 2020-06-16T08:10:50+00:00Z
NASA-CR-197365
Design of a Massively Parallel Computer
Serial Processing Elements
/j"
// ; ..... _2-3
Using Bit
Maurice F. Aburdene
Kamal S. Khouri
Jason E. Piatt
Jianqing Zheng
Department of Electrical Engineering
Bucknell University
Lewisburg, PA 17837
January 24, 1995
Abstract
A 1-bit serial processor designed for a parallel computer architecture is described.
This processor is used to develop a massively parallel computational engine, with a
single instruction-multiple data (SIMD) architecture. The computer is simulated and
tested to verify its operation and to measure its performance for further development.
1 Introduction
Recent trends for the design of massively parallel computers have shifted from em-
phasis on the design of single instruction-multiple data (SIMD) machines to multiple
instruction-multiple data (MIMD) machines. However, it is our contention that the
trend is based on inappropriately drawn conclusions. Because of the early rush to build
SIMD computers, inadequate attention was devoted to exploring the full potential of
these architectures. Current MIMD machines that are based on commercially avail-
able processors lack the scalability originally intended for massively parallel computers.
These computers employ processors on the order of a hundred or a thousand rather
than a million, and yet consume large amounts of power and space. One of the main
criticisms of these systems is the use of processors designed for workstation rather than
supercomputer applications. Placing such processors in a supercomputer environment
has reduced the individual performance of each processor to a fraction of their capa-
bilities and has required the use of co-processors to aid them adapt to the multi-node
environment.
The objective of our project is to design a massively parallel SIMD architecture
with greater than a million processors. A low power, 1-bit serial processor specifically
designed for a SIMD architecture will be utilized.
2 1-Bit Serial Processing Element
As shown in Figure 1, the processing element consists of a bit serial arithmetic and
logic unit (ALU), a distributed bit serial RAM, and a 2-dimensional router. The ALU
is further divided into:
• Computational logic
• Three registers or D-flip-flops - an accumulator register (h), a carry register (C),
and a mask register (M)
• Transmission gates
Transmision
Gate
\
FR
RI/MI
Data
Connector
A(
A"
A;
A_
A_
A._
,DO0 AtA]
A_
A,"
DI(
WE
CE
CE(
IK
_ Cin
Carry Register
÷5V
_ tMS
LM/MM
_. _] AA
LA
Figure 1: Registers and ALU of PE
In this design, a half adder is the fundamental component of the ALU. The results
of such operations may be stored in either the accumulator or carry register. The
use of a shift register may considerably reduce the execution time for functions such
as multiplication. However, to reduce the size of the processing element by at least
a factor of five, a shift register is not used. The mask register M enables certain
processors to perform a given operation while the others are "masked" and therefore
do not execute that operation. There are sixteen control signals from which functions
may be constructed :
• RR- Read RAM
- 0- 0isled toPE
- 1- loadvaluefrom memoryaddress
• LR- Load RAM
• RI- Read RAM Inverted
• MI - Mask Invert signal
• LA - Load A Register (Accumulator)
• MA - Mask load of A
• LC - Load C Register (Carry)
• MC - Mask load of C
• LM - Load M Register (Mask)
• MM - Mask load of M
• RC - Read from Carry
- 0 - Read from A
- 1 - Read from C
• OC - OR with Carry
LA/MA
LC/MC
LM/MM
LA
!MS
CLK
LC
IMS
CLK
LM
IMS
RI/MI
CLK
RI
!MS
Figure 2: Logic for PE Local Control Signals
• AC - LoadA/C input busesfrom:
- 00 - FromRAM (FR)
- 01 - FromAdder (FA)
- 10- Activate Router(RT)
- 11- ActivateGlobaloperationsnetwork(GB)
AGO
AC1
[ -C3 ACO
FA _ ACt
ACO
ACl
Figure 3: Logic for PE Network Level Control Signals
• DR - Router Direction
- 0 - Send Data To North and To West
- 1 - Send Data To South and To East
• IO - Controls external I/O via router
So/Ea
C_
No/We _ DR
Figure 4: Logic for PE Router Control Signals
Figures 2, 3, and 4 show the logic used to implement the different control signals
listed. Control signals for the PE may originate from a controller chip, a compiler
or control memory. In addition to the 16 control signals, there are 10 AA bits that
represent the address space of the distributed RAM. Communication with RAM is
controlled by RR, RC and LR. When RR=I, data in a selected memory address appears
on the data bus. RC controls where the data on the RAM write bus comes from, and
when LR=I, the selected memory address is written to.
The signals LA, LC and LM cause the A,C, and M registers to load respectively.
While the signals MA, MC and MM cause these same registers to be masked. Hence,
if LA=0 and MA=0 then the A register will not be loaded from its input bus. If LA=l
and MA=0, then the A register will be unconditionally loaded. On the other hand, if
LA=0 and MA--1 then the A register will be conditionally loaded from its input bus if
and only if the corresponding M register contains the value of 0. If LA=I and MA=I
then the A registerwill only be loadedif the correspondingM registercontainsthe
value1. This set of conditionsalsoholdtrue for the signalpairsLC/MC, LM/MM
and RI/MI. Unlike the othersignals,RI/MI do not controlthe loadingof a register.
Instead,they control the inversionof valuescomingfrom RAM. If RI--1, then the
output of RAM is inverted,and MI causesa conditionalinversiondependingon the
valueof M.
Designedfor CMOS implementation, the processor uses transmission gates to guar-
antee that there are no conflicting signals, and that no combination of control signals
will produce a short from the power supply to the ground of any device.
There are three levels from which the processing element may receive and transmit
data. A model is of this configuration is shown in Figure 5. These levels are:
1. Local RAM and ALU
2. Router Network
3. Global Operation Network.
Global Operation
Level 3 - nsfer
Level 2 _j_
_ Local Operation
Figure 5: Three Levels of PE Network Operation
Each processing element may only communicate with one of levels 2 and 3 at a
time. The first level is used to perform operations on data stored in the bit serial RAM
or any one of the registers. The second level is for inter-processor data transfer and
external I/O, and the third level is for global functions and operations. The signals
AC0 and AC1 control which level data is loaded from and sent to. If ACI=0, then
data transfer is local (ALU or RAM). When ACI=I, data is transferred from levels 2
and 3. When AC0=0 the data is sent and received from the router, and when AC0=I
the data is sent through the global operations network.
3 Communication Network and Router
The router network allows processors to transmit and receive information from each
other. Any data received may be stored locally and operated on. As mentioned above,
the router is a two-dimensional mesh shown in Figure 6. This means that each element
may communicate directly with any of its four closest neighbors - north, south, east or
west.
PE 9
PE 1
PE 4
Figure 6: Block Representation of a 3x3 Toroidal 2-D Mesh of Bit Serial PE's
Since there are two registers available to store data, we may use both for commu-
nication purposes. The a_cumulator register is chosen to accommodate east-west data
transfer, while the carry register is used for north-south data transfer. This scheme
optimizes the use of a two-dimensional router, and if need be, can allow for two sets of
data to be transmitted at a time. For example one command may be "send carry reg-
ister contents north" or it may be "send carry register contents north and accumulator
register contents west."
Each processing element is connected to two routing switches - one switch for north-
south communication and another for east-west communication. As shown in Figure 7,
each switch consists of four transmission gates in a square formation, where opposing
gates are controlled by a common signal - So/Ea and No/We. The corners of each
switch are connected to four data transmission lines as shown in Figure 8 - input of a
register, output of a register, and the switches from the two adjacent processors. The
control signals dictate which transmission gates are on and hence which direction the
data is transmitted. When signal RT=I, then the router is activated. The DR signal
controls the direction of data transfer as seen in Figure 4. When DR=0, values in the
C register move south, and values in the A register move east. When DR=l, values
6
CIn
AT
No/We +
E> -"
To South
SolEs SolEs
I_ To North AIn I_
Routing No/We No/We Routing
Switch Switch
I_ Cout To West SolEsi
SolEs
To East
f
_[_- No/we
Aout
Figure 7: E-W Switch and N-S Switch for Router
in the C register move north, and values in the A register move west. Figure 8 is an
example of data travelling north and west.
--I
' '-" ' I
I
I I
t L___I:'
i F____-
I I; '-" -I-'T-
I. , E/W i '
' Switch '
I__ I
i.,ll :%l---!-t:i ,
I ,/-Sw_toh,
L _-1..-.
--I
i_-' Ir,-,[- 7,,
I , I
< t 'l'Switch"i"
I
I
I
I
I
'i N/S i "i
ISwitch i. _i
-_7-'__-
-: J
Figure 8: Three PE's Connected with a 2-D Router Switches
4 Global Operations
Often it is necessary to inquire about the status of all the processors. These types of
operations are described as global operations, and require the use of an independent
network. This is the third level network for data transmission that consists of a linear
mesh of OR gates. Figure 9 shows one row of OR gates for N processors. The areas
surrounded by a dashed square, are blocks repeated for each processor in the row. A
similar set up is used for columns of processors, and connect the C register and the
north/south routing switches. The register contents of each PE are ORed and the
result of the operation placed in each processor.
It may be seen in Figure 9, that on the edges of the mesh, a series of external data
lines have been added to allow for external I/O. I/O may be accessed from either the
east or south ends of the mesh. When the external I/O is activated, the router network
remains on, but any toroidal connect must be removed to allow data to enter from only
one side of the mesh.
I I IA/Gin1 A/Cin2
I I
I
Wel/Nol I _ J. I I
I I
I I
I I A/tout2
.... I I
J ToWestl/
ToN_hl
Nr
I IAJCinN
I I-I I
I U I I/O
I I
I aBl_ _. _ EaN/SoN
ToSouthN 0
RT
Figure 9: Global OR Function Network
The global AND function may also be implemented using the global OR network.
This is done by inverting the inputs to the OR gates, and inverting the output. The
inversion process may be done locally by each processor. This eliminates the need for
a separate network.
5 Suggested Operating Parameters
Parameters such as the number of floating point operations per second (FLOPS), in-
structions per second and execution speed are difficult to describe without using a
working prototype. The following is an approximation to these parameters assuming
we have 1 million processors in a 1000xl000 2-dimensional mesh and a clock running
at 100MHz. It is also assumed that the delay associated with a command reaching a
processor from the control unit is the same for all processors.
The commands multiply and processor-to-processor crossover exchange are the two
longest commands to execute, each requiring 11 clock cycles for one bit. In other words,
one bit requires a total of 0.11# seconds to be processed. For a million processors, this
produces 9.09 TeraFLOPS. To achieve PetaFLOPS operation, at least 110 million
processors would be required.
The band-width of such a mesh would be 1000 bits in any one direction. For a bit
to travel to its closest neighbor would require 3 clock cycles to complete. The worst
case condition in a torus connected mesh, would require a data bit to travel 1000 steps
(500 for each dimension). Hence it would take 30# seconds to transfer data, producing
a slowest data transfer rate of 33 million bits per second.
6 Progress to Date
The group has successfully simulated a 3x3 toroidal mesh of processing elements using
circuit design software. The simulation included all local operations. In addition, the
router and global networks have been designed, and we are currently in the process of
simulating them. Plans are to:
• Simulate a larger network to measure propagation delay for a millions processors
- this requires the use of a more powerful software package and/or computer.
• Begin to develop a VLSI prototype - test processor performance and reliability
under real conditions.
• Replace the global OR with a tree structure - speed global functions
• Add more global functions such as XOR, ADD
• Develop a fault detection and avoidance system - detect faulty processors and
bypass them on the network
• Develop a multiple controller SIMD structure - this may be a solution to propa-
gation delay.
9
Appendix
A Primitive Function Descriptions
The following section contains functions and the lines of code needed to describe their
execution. Each line of code consists of a list of signals, a memory address location
(hA), and a comment line (comment lines start with "//") or the loop statements
"repeat" and "while (i)." It is assumed that "repeat" does not take any time, and
"while (i)" is an end of loop comparison combined with the last instruction of the loop.
All other lines of code take one clock cycle to execute. If a signal is represented in
the line of code, then it takes on a value of 1, otherwise it is 0 for that instruction.
Names written in lower case, such as opl, op2 and sz, are counter and register values
generated by the array control unit, when an instruction is being executed by the array.
The notation AA(opl++) means address "opl" will be used will be used in the current
instruction to access array memory address AA, while it is being incremented in the
control unit.
Addition:
LC // clear C
repeat
RR ACO RC LC LA
RR ACO OC LC LA
LR
while (in)
Subtraction:
RI LC // set C
repeat
RR ACO RC LC LA
RR ACO OC LC LA RI
LR
while (in)
hh(opl++) in-- // Ist half add
hh(op2++) // 2nd half add
AA(op3++)
AA(opl++)
hh(op2++)
AA(op3++)
in-- // ist half add
// 2nd half sub
10
Multiplication:
// opl -- LSB of multiplier
// op2 -- LSB of multiplicand
// op3 -- LSB of product
// szl -- size of multiplier
// sz2 -- size of multiplicand
// load multiplicand into product
// load A with ist bit of multiplier
RR LA AA(opI++) (sz->in) (szl->inn) (op3->tp3) (op2->tp2)
// Load M from A, clear A and C
LM LR LC (op3++) (inn--)
repeat // load product masked
RE LAMA AA(tp2++) (in--)
LR AA(tp3++)
while (in)
LR RC LA LC AA(tp3)
// perform multiply
repeat
RR LR
LM
repeat
RR ACO RC LC LA
RR ACO OC LC MC LAMA
LR
while (in)
LR RC LA LC
while (inn)
AA(opl++) (sz->in) (op3->tp3) (op2->tp2)
(op3++) (inn--)
// add in multiplicand masked
AA(tp3++) in--
AA(tp2++)
AA(tp3++)
AA(tp3)
11
Divide:
// opl -- LSB of divisor
// op2 -- MSB of dividend
// szl -- size of divisor
// sz2 -- size of dividend
// clear (szl) bits more significant than op2
LA ((op2+1)->tp2) (szl->inl)//clear A
repeat
LR AA(op2++) (in1--)
while (inl)
LA RR AA(op2) (op2->tp2) (sz2->in2) (opl->tpl)
repeat
LC LM LR AA(tp2) (szl->inl) (op2->tp2) in2--
repeat
RR ACO RC LC LA
RR ACO RC LC LA RI MI
LR
while (inl)
RI ACO LA (op2--) (opl->tpl)
while (in2)
LR AA(tp2)
AA(tp2++) inl--
AA(tpI++)
AA(tp2++)
// op2
// tp2
-- LSB of remainder (size=szl)
-- LSB of quotient (size=sz2)
12
Logic Functions
And:
repeat
RR LC
RR RC ACO LC
RC LR
while (in)
Or:
RI LC
repeat
RR LC
RR OC ACO LC
RC LR
while (in)
Xor:
repeat
RR LA
RR ACO LA
LR
while (in)
Not:
repeat
RR RI LA
LR
while (in)
AA(opl++) in--
AA(op2++)
AA(opI++)
AA(opI++) in--
AA(op2++)
AA(opl++)
AA(opI++) in--
AA(op2++)
AA(opI++)
AA(opI++) in--
AA(op2++)
iS
Comparison
// opl -- MSB of Ist operand
// op2 -- MSB of 2nd operand
// C -- greater than flag
// M -- equal flag
LA Kl
repeat
LC MC LAMA RR LM
RR RI LAMA
while(sz)
AA(opl--) sz--
AA(op2--)
Equal:
// A -- result flag
perform Comparison
RI MI LA
GreaterThanOrEqual:
// A -- result flag
perform Comparison
perform Equal
RC ACO MA
LessThanOrEqual:
// A -- result flag
perform Comparison
perform Equal
RI KC ACO MA
NotEqual:
// A -- result flag
perform Comparison
MI LA
GreaterThan:
// A -- result flag
perform Comparison
perform NotEqual
RC ACO LAMA
LessThan:
// A -- result flag
perform Comparison
perform NotEqual
RI RC ACO LAMA
14
Memoryto MemoryMoves:
Move:
// opl -- LSB of source
// op2 -- LSB of destination
repeat
LA RR AA(opl++) sz--
LR AA(op2++)
while (sz)
Move (masked):
// opl -- LSB of source
// op2 -- LSB of destination
// ms -- exchange mask
LA RR
LM
repeat
LA RR
LAMA RR
LR
while (sz)
AA(ms)
AA(op2) sz--
AA(opl++)
AA(op2++)
Memory to Memory Exchange:
Exchange:
// opl -- location I
// op2 -- location 2
repeat
RR LA
RR LC
LC RC
LR
while (sz)
AA(opl) sz--
AA(op2)
AA(opI++)
AA(op2++)
Exchange (masked):
// opl -- location 1
// op2 -- location 2
// ms -- exchange mask
LA RR AA(ms)
LM
repeat
RR LAMA MC
RR MA LC MC
LR RC
LR
while (sz)
AA(opl) sz--
AA(op2)
AA(opI++)
AA(op2++)
15
Processor to Processor Moves:
MoveN:
// opl -- LSB of source
// op2 -- LSB of destination
// sz -- size of operands
// ds -- distance
repeat
LC,RR
repeat
LC,ACI ds--
// LC,AC1,DR
while (ds)
LR,RC AA(op2++)
while (sz)
AA(opl++) sz--
//if moving north
//if moving south
MoveW:
// opl -- LSB of source
// op2 -- LSB of destination
// sz -- size of operands
// ds -- distance
repeat
LA,RR
repeat
LA,ACl
// LA,AC1,DR
while (ds)
LR AA(op2++)
while(sz)
AA(opl++) sz--
ds-- // if moving west
// if moving east
16
Processor to Processor Moves (masked):
MoveN (masked):
// opl -- LSB of source
// op2 -- LSB of destination
// sz -- size of operands
// ds -- distance
// ms -- exchange mask
LA RR AA(ms)
LM
repeat
LC RR AA(opl++) sz--
repeat
LC ACI ds--
// LC ACI DR
while (ds)
MA RR AA(op2)
LR AA(op2++)
while (sz)
//if moving north
//if moving south
// ds
// ms
LA RR
LM
repeat
MoveW (masked):
// opl -- LSB of source
// op2 -- LSB of destination
// sz -- size of operands
-- distance
-- exchange mask
AA(ms)
LA RR
repeat
LA ACl
// LA AC1 DR
while (ds)
MC RR AA(op2)
LR RC AA(op2++)
while(sz)
AA(opI++) sz--
ds-- // if moving west
// if moving east
17
Processor to Processor Exchange (masked):
ExchangeNS (masked):
// opl -- LSB of source
// sz -- size of operands
// ds -- distance
// nm -- northern mask
// em -- exchange mask
LC,RR AA(ms)
LM
repeat
LC,RR
repeat
LC,AC1
while (dss)
RI,ACO,LC
LC,RR AA(opl)
repeat
LC,ACI,DR
while (ds)
LC,RR
LM
LC,MC,ACO
LC,RR
LM
MC,RR,RC
LR
while (sz)
AA(opl) sz-- (ds->dss) II load C
dss-- // north
// move C to A
(ds->dss) // load C again
ds-- I/ south
AA(nm) // load western mask
// load C from A in northern PEs
AA(em) // load exchange mask
// load C masked if not exchanging
AA(opl++) // store A
18
Processor to Processor Crossover Exchange:
CrossOverNS:
// opl -- LSB of source
// op2 -- LSB of source
// sz -- size of operands
// ds -- distance
// ms -- northern mask
LC,RR AA(ms)
LM
repeat
LC,RR
repeat
LC,ACI
while (dss)
RI,ACO,LA
LC,RR AA(op2)
repeat
LC,ACI,DR
while (ds)
LC,MC,RC AA(opl)
LR AA(op1++)
LA,MA AA(op2)
LR AA(op2++)
while (sz)
AA(opl) sz-- (ds->dss)
dss--
(ds->dss)
ds--
// load A opl
// north
// move C to A
// load C op2
// south
// load C from AA(op1)
// store C
// load A from AA(op2)
// store A
19
Global Operations:
Global OR:
// opl -- parallel operand
// op2 -- pointer into scalar operand register (SS)
repeat
LC RR AA(opl++)sz--
ACO ACI SS(op2++)
while(sz)
repeat
RC FA AA(opl++)sz--
ACO ACI SS(op2++)
while(sz)
2O
B Simulation Results and Timing
The following is an example of a simulation for the ADD function. Figure 10 is the
timing diagram. Two binary numbers, 1 and 0, are added in that sequence, and the
result is stored in memory. The ToRAM line shows the result of this addition 01.
FromRAM
ToRAM
AIn
Cout
Gin
Aout
! I/O
I10
MI
MM
MC
MA
RI
LM
LC
LA
RT
RR
CC
IRC
PC
[]R
No/We
WE
Ct.K
LR
FA
AG0
FR
AC1
I I I
_J
2?0 4?oI I I I i I i I t l i I I I
----L_._J_
LJ
_J'_...A'-LA--2._%.J '
__.j=-----=-_
Figure 10: Simulation Timing Diagram for ADD function
The process takes 4 clock cycles to complete - compare this to the primitive com-
mand description, which is composed of 4 command lines each requiring a clock cycle
to complete.
21
Figure 11, represents a continuous addition process, where 1 is repeatedly added to
itself. The result may be seen as a 10101010... sequence.
2('0 4_)0I I I I I I I I I ! I I | I I I
FromRAM ' _ _ :
"--'L.._J---__,_J""n_...r--'-L_=_ __'---=t__..r'-___..r-T_ _ .
_,=---L__j---_,_.._J =-
ToRAM
Ain
Cout
CIn
Aout
t I/O
I/O
MI
MM
MC
MA
RI
LM
LC
LA
FIT
RR
CC
!RC
PC
DR
No/We
V_E
CLK
LR
FA
AC0
FR
AC1
! i
_ !.................................!....
- -_,....................................t .
........................ _
i
........... i ......... i
Figure 11: Simulation Timing Diagram for continuous ADD function
22
References
[1] Bouaziz, S. Pissaloux, E.E., Merigot, A., Devos, F., "Some Hardware and Software
Considerations for Multi-SIMD Control Strategy of Massively Parallel Machines",
Proceedings. Advanced Computer Technology, Reliable Systems and Applications.
5th Annual European Computer Conference CompEuro '91, pp. 180-183.
[2] Chiang, C., Sarrafzadeh, M., Wong, C.K., "A Powerful Global Router: Based
on Steiner Min-max Trees", 1989 International Conference on Computer-Aided
Design. Digest of Technical Papers, pp. 2-5.
[3] Dorband, John E., "Processing Element Description", NASA Goddard Space
Flight Center, 1989.
[4] Fischer, James R., Schaefer, David H., Wang, Pearl Y., Dorband, John E., "Mas-
sively Parallel Computation", Encyclopedia of Computer Science and Technology,
Ed. Allen Kent, James G. Williams, Carolyn M. Hall, Rosalind Kent, Volume 26,
Supplement 11, 1992, pp. 271-332.
[5] Greenberg, R.I., Ishii, A.T., Sangiovanni-Vincentelli, A.L., "MulCh: A Multi-layer
Channel Router Using One, Two, and Three Layer Partitions", IEEE International
Conference on Computer-Aided Design, ICCAD-89. Digest of Technical Papers,
pp. 88-91.
[6] Gross, T., Hinrichs, S., O'Hallaron D.R., Stricker T., Hasegawa A., "Communica-
tion Styles for Parallel Systems", IEEE Computer Journal, December 1994, pp.
34-44.
[7] Morley, R.E., Jr., Christensen, G.E., Sullivan, T.J., Kamin, O., "The Design of
a Bit-serial Coprocessor to Perform Multiplication and Division on a Massively
Parallel Architecture", Proceedings. The 2nd Symposium on the Frontiers of Mas-
sively Parallel Computation, pp. 419-422.
[8] Rose, J., "LocusRoute: A Parallel Global Router for Standard Cells", 25th
ACM/IEEE Design Automation Conference. Proceedings 1988, pp. 189-195.
23
