Supercomputing on massively parallel bit-serial architectures by Iobst, Ken
N87129135
Supercomputlng on Massively Parallel Bit-Serial Architectures
Consider the idea that supercomputing is a synergy of generic
algorithms, languages and architectures and that real breakthroughs in
parallel computing will be achieved by considering all three together in a
simulated software environment. Engineering tradeoffs could be made between
performance, machine transparency, standardization and program portability
before any new machines are actually built. Standardized languages could be
developed for generic subclasses of parallel machines; languages that really
give high peformance and encourage free parallel expression and "thinking in
parallel".
My own research on the Goodyear MPP (Massively Parallel Processor),
suggests that hlgh-level parallel languages are practical and can be
designed with powerful new semantics that allow algorithms to be efficiently
mapped to the real machines. For the MPP these semantics include parallel/
associative array selection for both dense and sparse matrices, variable
precision arithmetic to trade accuracy for speed, micro-pipelined "train"
broadcast, and conditional branching at the PE control unit level.
The preliminary design of a FORTRAN-like parallel language for the MTP
has been completed and is being used to write programs to perform sparse
matrix array selection, mln/max search, matrix multiplication, Gaussian
elimination on single bit arrays and other generic algorithms. The MPP
timing estimate for Gausslan elimination of a 4K by 4K single bit matrix is
under one second -- the equivalent of approximately 64 billion scalar
operations. Parallel Gauss-Jordan matrix inversion is also being investi-
gated. The estimated time to invert a 128 X 128, 32 bit real matrix using
full pivoting on the MPP is 50 msec. This is roughly equivalent to a I00
MFLOP scalar rate.
The MPP is a SIMD machine of 16384 single bit processors arranged in a
128 X 128 array. Individual PE's are interconnected with their four nearest
neighbors. Each PE can address 1024 bits of its own local memory. A 32 bit
shift register in each PE allows for micro-pipelining of long words and
faster partial sum accumulation for multiplication. The machine can execute
160 billion mlcro-instructions per second which translates to 800 GOPS for
some instructions. Operations include single bit logical, shift, and add as
well as column I/0 and one or two dimensional routing in a spiral,
cyclinder, or torus. All operations can be directly or indirectly masked.
The logical "or" of one bit per PE (SUMOR) can be used to pass array
information back to the PE control unit for broadcast to other PE's, scalar
I/0 or conditional branching. If a second MPP were ever built, it might
look considerably different than the current MPP. For example, it would
certainly have greater memory depth -- at least 64K bits per PE. It might
also have a reconfigurable bit/byte serial ALU, staged PE's for table lookup
arithmetic, and pipellned SUMOR logic.
PR_-_DING PAGE BLANK NOT FILMED
Ken lobst
4115/85
_GE_ II_t EI_TIONALL Y BLANII
1-145
https://ntrs.nasa.gov/search.jsp?R=19870019702 2020-03-20T09:58:02+00:00Z
SUPERCOMPUTING ON MASSIVELY PARALLEL
BIT-SERIAL ARCHITECTURES
e SUPERCOMPUTING DOMAIN
| NEW DIMENSIONS IN PARALLEL COMPUTING
| SOME GENERIC ALGORITHMS
| THE GOODYEAR MPP
SOME MPP SPECIFIC ALGORITHMS CODED IN A FORTRAN-LIKE
BIT-SERIAL PROGRAMMING LANGUAGE
| WHAT MIGHT A SECOND GENERATION MPP LOOK LIKE?
1-146
SUPERCOMPUTING DOMAIN
PARALLEL
PROGRAMS
ALGORIII_
LANGUAGES
CREATIVE
THOUGHT _s
ARCHITECTURES
STANDARDIZATION
PARALLEL.G
FORNEWCOMPILERS
I
HARDWARECAPABILITIES
DESIGNSPEC'S
FORNEWMACHINES
SIMULATED SOFTWARE ENVIRONMENT
1-147
NEW DIMENSIONS IN PARALLEL COMPUTING
BIT-SERIAL
DIVISIONBY 2"± I
MIN/MAXSEARCH
MATRIXMULTIPLICATION
COLUMNBROADCAST
[ GAUSSIANELIMINATION
{
1
I PARALLEL
I
I /
/
/
LINEARRECURRENCE
1-148

PERFORMANCEWITH INTEGEROPERANDS
10,000
5851,
--_ 1862
;0 128 x 128 ARRAY
10 MHz CLOCK RATE
;340
ADD ARRAY TO ARRAY
2240 SUBTRACTARRAY FROM
1680 ARRAY
1350
,120
850
MULTIPLYARRAY
BY ARRAY
146 _
82
58
34
8 16 24 32 4O 48
OPERAND WORDLENGTHS. BITS
56 64
1-150
DIVISION BY 2"_"_I EXAMPLE
FROM THE BINOMIAL THEOREM,
I Xz.-
I--.7c
BY A CHANGE OF VARIABLE _=_x THEN
NOW LET _--Z"_ AND DIVISION BY 2 _1
REDUCES TO A SHORT SEQUENCE OF BINARY
SHIFTS AND ADDS (AND/OR SUBTRACTS),
"Lr -if-
2"± I -Z_ D z,., -I-
FOR EXAMPLE, LET V = 237658 AND N = 10
THEN
"I/'- 7_ 37&_-_
-- : ZZ_. _15-
2 - / j oz_
AFTER 3 SHIFTS AND 2 ADDS
1-151
THE GOODYEAR MPP
0 SIMDMACHINEOF 16384SINGLEBIT PROCESSORSAR_NGED IN A
128 X 128 ARRAY
I NEARESTNEIGHBORINTERCONNECTIVITY
! 1024BITSOF MEBORYPER PE
I 32 BIT SHIP-[REGISTERALLOWSFOR MICRO-PIPELININGAND
FASTERMULTIPLICATION
I EXECUTIONSPEEDOF 160 BILLIONMICRO-INSTRUCTIONSPER SECOND
WHICHTRANSLATESTO 800 GOPS FOR SOME INSTRUCTIONS
OPERATIONSINCLUDESINGLEBIT LOGICAL,SHIFT,AND ADD AS
WELLAS COLUMNI/OAND ONE OR TWO DIMENSIONALROUTINGIN
A SPIRAL,CYLINDER,OR TORUS
l ALL OPERATIONSCAN BE DIRECTLYOR INDIRECTLYMASKED
l THE LOGICAL"OR"OF ONE BIT PER PE (SUMOR)CAN BE LLSEDTO
PASSARRAYINFORMATIONBACKTO THE PE CONTROLUNIT FOR
BROADCAST,SCALARI/O,OR CONDITIONALBRANCHING
1-152
AZ
LJLJ
,--J
W
Z
N
L_J
,-..4
U-
Z
O.
A
T
3
t_
t._
1-153
PARALLEL/ASSOCIATIVE ARRAY SELECTION
MPP
CONTROL UNIT
REAL S(8:24),A[64,256](8:24)
S=SUMOR(A[64,256])
1-154
MAXIMUM OF 32 BIT INTEGER ARRAY
(OF UNIQUE VALUES)
BIT MAX[ ]
INTEGERA[128,128](0:32)
MAX=I
DO 1 I=1,32
IF (SUMOR(A[MAX](1)))MAX=A[MAX](1)
! CONTINUE
; DECLARE MAX AS BIT MASK
OVER ALL PE's
; DECLAREA AS A 128 X 128
UNSIGNED INTEGER ARRAY
; INITIALIZE MAX TO 1 OVER
ALL PE's
SCAN BITS IN A FROM MOST
TO LEAST SIGNIFICANT BITS
REPLACE MAX WITH A NEW
SUBSET OF MAXIMUM VALUES
FOR EACH NON ZERO BIT
PLANE OF A
MAXIMUM OF 32 BIT INTEGER ARRAY
(GENERAL CASE)
BIT MAX[ ],T[](46),INDEX[](14)
INTEGERA[128,128](0:32)
C0_ON /INIT/INDEX
MAX=I
T=A.CON.INDEX
DO 1 I=1,46
IF (SUMOR(T[MAX](1)))MAX=T[MAX](1)
I CONTINUE
SAME ALGORITHM AS BEFORE
EXCEPT A ARRAY IS FIRST
CONCATENATED WITH THE
PE ADDRESS FIELD TO INSURE
UNIQUENESS OF RESULT
1-155
MATRIX MULTIPLICATION EXAMPLE
IZp
|
• |
f
J f
2
REAL
&
A[8,16,128](8:32),B[8,16,128](8:32),
C[8,16,128](8:32),T[8,16,128](8:32)
READ A[,,1],B[1,,]
T=A[,,I...]*B[I...,,]
C=T[,+,]
PRINT C[,I,]
1-156
COLUMN BROADCAST EXAMPLE
T 128X 128
REAL A[128,1281(8:32)
A-A[,J...]
OR
REAL A[128,128](8.32)
BIT M[ ]
M=[128,128;,J]
A=A[ •NOT.M ][,128-_1
1-157
COLUMN BROADCAST EXAMPLE
PROBLEM: TO BROADCAST A COLUMN OF FLOATING POINT NUMBERS
ACROSS THE MPP ARRAY
SOLUTION #I: WITH PE'S INTERCONNECTED IN AN E/W CYLINDER;
LOAD, SHIFT AND STORE THE 32 BIT VALUES
ACROSS THE ARRAY. THIS TAKES APPROXIMATELY
3 X 32 X 128 = 12288 CYCLES.
SOLUTION #2: WITH PE'S INTERCONNECTED IN AN E/W CYLINDER;
nTRAIN" BROADCAST THE 32 BIT VALUES ACROSS
THE ARRAY. THIS CAN BE VIEWED AS A MICRO-
PIPELINING OPERATION AND TAKES ONLY 207 CYCLES.
THE ALGORITHM IS AS FOLLOWS:
| GET nTRAIN" OF i STOP BIT + 32 BIT VALUES
OUT ONTO THE E/W PE CHANNEL ( _ 33 CYCLES)
I CIRCULATE "TRAIN_ ONCE AROUND ( _ 128 CYCLES).
DURING THIS PROCESS INDIVIDUAL PE'S WILL
STORE THE _TRAINn IN THEIR SHIFT REGISTERS.
SHIFTING STOPS WHEN THE STOP BIT ENTERS THE
CONDITIONAL MASK REGISTER OF EACH PE.
! STORE ALL SHIFT REGISTERS ( : 32 CYCLES).
1-158
GAUSSIAN ELIMINATION EXAMPLE
SINGLEBIT MATRIX
;_o,,IzIl_'
I I I
if'
I
I i t
,-. '_4000. Z 4000
"-. l OF I000
IT PLANES -__...
,oi,, I,=i,_1%i_,i%I"_.,1
- ,..._..
LPOLP
MPP ARRAY
1-159
128 X 128
GAUSSIAN ELIMINATION EXAMPLE
BIT A[4000,4](1000),M[4000,4],USED(4000)
INTEGERPIVOT(4000,O:14),J1(O:2),J2(O:12),J(0:14)
EQUIVALENCE(Ji,J(1)),(J2,J(3))
READA
DO 1 I=1,4000
USED(I)=0
1 CONTINUE
DO 7 I=1,4000
DO 2 J2=1,1000
IF (SUMOR(A[I,](J2)))GO TO 3
2 CONTINUE
GO TO 8
3 CONTINUE
DO 4 JI=1,4
IF (SUMOR(A{I,J1](J2)))GO TO 5
4 CONTINUE
5 CONTINUE
PIVOT(1)=J
USED(J)=1
; READ IN ARRAY
J INITIALIZE HISTORY MATRIX
; SEARCH FOR A 1 IN ROW I
IN STEPS OF 4 COLUMNS
Row OF ALL O's " EXIT
FIND WHICH COLUMN OF 4
; SAVE HISTORY INFORMATION
M=A[ ](J2).AND..NOT.[4000,4;I,J1]; SAVE PIVOTCOLUMNIN NEW
DO 6 J2=1,1000
A[ ](J2)=A[](J2).XOR.M[,JI...]
MATRIX M, ZEROING THE PIVOT
ROW VALUE
; ELIMINATE 4 COLUMNS AT A TIME
BY BROADCASTING THE PIVOT
COLUMN ACROSS THE M ARRAY
6 CONTINUE
7 CONTINUE
8 CONTINUE
1-160
GAUSS-JORDAN MATRIX INVERSION
WITH FULL PIVOTING
1-161
PARALLELDATA STRUCTURES
REAL ARRAYS
U = [ A : I ] AUGMENTED MATRIX
V = [ : ] WORKING ARRAY
W = L : ] WORKING ARRAY
BIT MASKS
PIVOTED ROW/COLUMNS
Y = [ i": i"] PIVOT ROW
WHERE I IS THE IDENTITY MATRIX
i"IS THE UNITY MATRIX
IS THE ZERO MATRIX
1-162
OTHER DATA STRUCTURES
SCALARS
DET - 1 PIVOT
1-163
PARALLELAPPROACHTO MATRIX INVERSION
REPEAT FOLLOWING STEPS N TIMES
| FIND NEXT PIVOT
I UPDATE DETERMINATE (OPTIONAL)
I ZERO PIVOT ROW AND COLUMN IN X
| ZERO PIVOT ROW IN Y
| NORMALIZE PIVOT ROW IN U
| BROADCAST PIVOT ROW N TIMES INTO V
t BROADCAST PIVOT COLUMN 2N TIMES INTO W
I PERFORM PARALLEL ROW OPERATIONS FOR A
SINGLE PIVOT
! RESET PIVOT ROW IN Y
THEN REORDER ROWS IN U TO FORM
U=[ I :A -1 j
1-164
PARALLEL MATRIX INVERSION ALEORITHM
FOR I - 1 TO N
PIVOT " MAxlul PER X
DET = DET " PIVOT
X
=0
Y -0
I[ = ] [ " ],U , - U , PIVOT
' I
V =U
 [11 I] [1' 1000 " U 'i
|
U = U - V " W PER Y
,[ ,']IllI = 1o
i
END I
FOR J - 1 TO N
FOR I = 1 TO N
IF U[I,J] = 1 THEN V[J,°] = U[I,']
END I
END J
U=V
1-165
MPP II:
WHAT MIGHT IT LOOK LIKE?
O MUCH GREATER MEMORY DEPTH: AT LEAST 64K BITS
PER PE, WITH AT LEAST ONE LEVEL OF INDIRECT
ADDRESSING.
; RECONFIGURABLE BIT/NIBBLE/BYTE SERIAL ALU
; STAGED PE'S FOR TABLE LOOKUP ARITHMETIC.
HOW MANY TABLES? WHAT SIZE? RAM OR ROM?
O PIPELINED SUMOR LOGIC
1-166
