The hypercluster: A parallel processing test-bed architecture for computational mechanics applications by Blech, Richard A.
I t  - 1  
NASA Technical Memorandum 89823 
. 
The Hypercluster: A Parallel Processing 
Test-Bed Architecture for Computational 
A 
Mechanics Applications 
1 .  (%$SA-la-89823) IEE EYFEPCfD52Eh: A 387-2C761 
EA3ALLEL E R O C E S S I B G  PES%-EE-C BhCtiITECTUBE 
ECB CCWPG3ATXCLBL BJBCBABXCS f i E € L I C A T I O I S  CSCL 09B U n c l a s  
( K A S A )  11 p 63/62 45316 
Richard A. Blech 
Lewis Research Center 
Cleveland, Ohio 
Prepared for the 
Summer Computer Simulation Conference 
sponsored by the Society for Computer Simulation 
Montreal, Canada, July 27-30, 1987 
https://ntrs.nasa.gov/search.jsp?R=19870011334 2020-03-20T11:54:16+00:00Z
THE HYPERCLUSTER: A PARALLEL PROCESSING TEST-BED ARCHITECTURE 
FOR COMPUTATIONAL MECHANICS APPLICATIONS 
Richard A. B lech 
Na t iona l  Aeronautics and Space A d m i n i s t r a t i o n  
Lewis Research Center  
Cleveland, Ohio 44135 
ABSTRACT 
The development o f  numer ica l  methods and s o f t -  
ware t o o l s  f o r  p a r a l l e l  processors can be a ided 
through t h e  use o f  a hardware test-bed. 
a r c h i t e c t u r e  must be f l e x i b l e  enough t o  support inves- 
t i g a t i o n s  i n t o  a r c h i t e c t u r e - a l g o r i t h m  i n t e r a c t i o n s .  
One way t o  implement a test -bed i s  t o  use a commercial 
," p a r a l l e l  processor.  Un fo r tuna te l y ,  most comnercial 
7 p a r a l l e l  processors a re  f i x e d  i n  t h e i r  i n te rconnec t ion  
and/or  processor  a r c h i t e c t u r e .  
desc r ibe  a m o d i f i e d  n-cube a r c h i t e c t u r e ,  c a l l e d  the 
hyperc lus te r ,  which i s  a superset  o f  many o the r  pro- 
cessor  and i n t e r c o n n e c t i o n  a r c h i t e c t u r e s .  The hyper- 
c l u s t e r  i s  in tended t o  suppor t  research i n t o  p a r a l l e l  
p rocess ing  of computat ional  f l u i d  and s t r u c t u r a l  
mechanics prclblems which may r e q u i r e  a number o f  d i f -  
f e r e n t  a r c h i t e c t u r a l  c o n f i g u r a t i o n s .  
how a t y p i c a l  p a r t i a l  d i f f e r e n t i a l  equa t ion  s o l u t i o n  
a l g o r i t h m  maps on t o  t h e  h y p e r c l u s t e r  i s  qiven. 
The test-bed 
I n  t h i s  paper, we 
An example o f  
INTRODUCTION 
Two research areas which a r e  c r i t i c a l  t o  t h e  
f u t u r e  progress o f  aerospace technology a re  computa- 
t i o n a l  f l u i d  mechanics (CFM) and computat ional  s t ruc-  
t u r a l  mechanics (CSM). The p r a c t i c a l  l i m i t s  of 
a p p l i c a t i o n s  i n  b o t h  o f  these areas a re  s e t  by t h e  
s ta te -o f - the -a r t  i n  computer a r c h i t e c t u r e  and so f t -  
ware techniques. P a r a l l e l  process ing i s  an a rch i tec -  
t u r a l  concept which has t h e  p o t e n t i a l  f o r  v a s t l y  
improv ing t h e  performance o f  f u t u r e  computer systems. 
However, t h e  use o f  para1 l e 1  process ing a rch i tec tu res  
w i l l  r e q u i r e  a reasessment o f  numer ica l  methods and 
so f tware  techniques t h a t  a re  c u r r e n t l y  used f o r  CFM/ 
CSM. 
p a r a l l e l  a r c h i t e c t u r e s .  
t i o n  o f  a system g f  non l i nea r  p a r t i a l  d i f f e r e n t i a l  
equat ions (PDE). 
i n g  systems o f  PDE's on computers. The i d e a l  alqo- 
r i t h m  f o r  a g i ven  a p p l i c a t i o n  min imizes computation 
t ime  and t h e  amount o f  memory requ i red .  A consider- 
ab le m o u n t  o f  research has been done i n  th is  area f o r  
un ip rocesso r  computers, r e s u l t i n g  i n  many accepted 
approaches f o r  s o l v i n g  va r ious  PDE systems. The con- 
t i n u i n g  demand f o r  more computing power and t h e  emer- 
gence o f  supercomputer a r c h i t e c t u r e s  employing 
p a r a l l e l  processors has prompted research i n t o  new 
approaches t o  s o l v i n g  systems o f  PDE's (Ortega and 
Vo ig t  1985). 
opment o f  h i g h e r  perforlnance CFM/CSM codes t h a t  can 
e f f e c t i v e l y  u t i l i z e  t h e  new p a r a l l e l  a rch i tec tu res .  
The development o f  a lgo r i t hms  f o r  p a r a l l e l  pro- 
cessors i s  no t  a s t r a i g h t f o r w a r d  task.  Algori thms f o r  
p a r a l l e l  processors must be ab le  t o  be p a r t i t i o n e d  
i n t o  independent tasks t h a t  can be a l l o c a t e d  t o  m u l t i -  
L ikewise,  CFM/CSM requi rements may impact f u tu re  
Most CFMICSM problems r e q u i r e  t h e  numer ica l  solu- 
There a r e  cany a lgo r i t hms  f o r  solv- 
The goal  o f  t h a t  research i s  t he  devel- 
p l e  processors f o r  s imultaneous execut ion.  
degree o f  p a r a l l e l i s m  does n o t  guarantee h ighe r  pe r -  
formance, however. 
r i t h m s  can be compl icated by t h e  hardware aspects of 
p a r a l l e l  processors.  The communication mechanism 
between processors i s  one example. The a l q o r i t h m  
should be analyzed t o  determine i f  f a s t ,  t i g h t l y  
coupled communication between processors i s  requ i red ,  
o r  i f  a slower, l o o s e l y  coupled mechanism w i l l  
s u f f i c e .  
impact t h e  performance o f  an a lgo r i t hm.  
a v e c t o r  processor a r c h i t e c t u r e  operates most e f fec -  
t i v e l y  by  pe r fo rm ing  a s i n g l e  mathematical  o p e r a t i o n  
on l a r g e  a r rays  o f  data. 
processor  i s  dependent on t h e  l e n g t h  o f  t h e  da ta  
ar rays,  o r  vectors .  Therefore,  i t  i s  d e s i r a b l e  t o  
develop a lgo r i t hms  which make use of l ong  vec to r  oper-  
a t i ons .  I f  p a r a l l e l  vec to r  processors a re  used, t hen  
any p a r t i t i o n i n g  o f  t h e  numer ica l  method should avo id  
sho r ten inq  t h e  v e c t o r  l e n g t h  t o  t h e  p o i n t  o f  deqrad- 
i n g  performance. 
The memory h i e r a r c h y  employed i n  a p a r a l l e l  pro-  
cessor  i s  another c o n s i d e r a t i o n  i n  t h e  development of 
p a r a l l e l  a lgor i thms.  The use o f  l o c a l  processor 
memory and/or  g l o b a l  shared memory a re  examples O f  
memory h i e r a r c h y  w i t h i n  a p a r a l l e l  a r c h i t e c t u r e .  
Cache memory, i n t e r l e a v e d  memory and mass s to rage  a r e  
l e v e l s  o f  t h e  memory h i e r a r c h y  l o c a l  t o  t h e  processors 
i n  a p a r a l l e l  process inq system. An e f f i c i e n t  p a r a l -  
l e l  a l g o r i t h m  must make optimum use o f  t h e  e x i s t i n g  
memory h ie ra rchy .  Th is  r e q u i r e s  maximiz inq t h e  amount 
o f  computat ion o c c u r r i n g  i n  t h e  lowest  (i.e., f a s t e s t )  
l e v e l  o f  t he  h ie ra rchy .  
r i t h m s  r e q u i r e s  coqnizance o f  a l a r q e  number o f  hard- 
ware and a r c h i t e c t u r a l  parameters. Th is  makes t h e  
e v a l u a t i o n  o f  a l q o r i t h m  performance a c r i t i c a l  s t e p  
i n  t h e  development process. To some extent ,  t h i s  can 
be done a n a l y t i c a l l y .  A d e t a i l e d  a n a l y t i c a l  perform- 
ance e v a l u a t i o n  would be cumhersome, however, espe- 
c i a l l y  i f  t h e  number o f  hardware and a r c h i t e c t u r a l  
parameters i s  hiqh. A p r e f e r a b l e  approach would he 
t h e  e v a l u a t i o n  u s i n g  a hardware test-bed. 
ware and a r c h i t e c t u r a l  parameters cou ld  be d i r e c t l y  
implemented, o r  e f f i c i e n t l y  emulated. 
A h i q h  
The development o f  p a r a l l e l  a lgo-  
The i n d i v i d u a l  processor a r c h i t e c t u r e  can a l s o  
Fo r  example, 
The performance o f  a v e c t o r  
To summarize, t h e  development o f  p a r a l l e l  alqo- 
Then hard- 
A research e f f o r t  a t  t h e  NASA Lewis Research 
Center i s  devoted t o  s tudy ing  t h e  a p p l i c a t i o n  of pa r -  
a l l e l  process ing t o  CFM/CSM. T h i s  e f f o r t  i s  an ou t -  
growth o f  work p r e v i o u s l y  done on t h e  Real-Time 
Mu l t i p rocesso r  S imulator  (RTMPS) p r o j e c t  (A rpas i  1985; 
Blech and Arpasi  1985; Cole 1985; Arpas i  and M i l n e r  
1985). To f a c i l i t a t e  t h e  i n v e s t i q a t i o n  o f  a lqo r i t hm-  
a r c h i t e c t u r e  i n t e r a c t i o n s  and t h e  e v a l u a t i o n  of s o f t -  
ware t o o l s ,  a r e c o n f i g u r a b l e  hardware test -bed i s  
1 
be ing  assembled. T h i s  paper d iscusses t h e  requ i re -  
ments d r i v i n g  t h e  des ign o f  t h e  p a r a l l e l  process inq 
test -bed and descr ibes the  test -bed a r c h i t e c t u r e  be ing  
implemented a t  NASA Lewis. An example o f  how a t y p i -  
c a l  PDE s o l u t i o n  a lgo r i t hm would map on t o  t h e  a rch i -  
t e c t u r e  i s  presented. 
Para1 l e 1  Process ing Test-Bed Requirements 
I n  general ,  t h e  purpose o f  a p a r a l l e l  process ing 
test -bed i s  t o  suppor t  the  development o f  p a r a l l e l  
a lgo r i t hms  and t h e  eva lua t i on  o f  so f tware  t o o l s .  
S ince many o f  t h e  a r c h i t e c t u r l l  requi rements f o r  a 
p a r t i c u l a r  a l g o r i t h m  o r  sof tware t o o l  u s u a l l y  a r e  n o t  
known, t h e  test-bed must p rov ide  a degree o f  f l e x i b i l -  
i t y  i n  c o n f i g u r a t i o n .  
f o l l o w i n g  d e s i r a b l e  c a p a b i l i t i e s  f o r  any p a r a l l e l  pro- 
cess ing  test-bed. 
. ( 1 )  A b i l i t y  t o  i nco rpo ra te  processors o f  va r ious  
a r c h i t e c t u r e s  w i t h i n  t h e  p a r a l l e l  process ing conf igur-  
a t i on .  T h i s  a l l ows  eva lua t i on  o f  how t h e  a r c h i t e c t u r e  
and performance o f  t h e  i n d i v i d u a l  process ing elements 
w i t h i n  a p a r a l l e l  system a r c h i t e c t u r e  can a f f e c t  over- 
a l l  performance. Some processor a r c h i t e c t u r a l  charac- 
t e r i s t i c s  t o  be considered a r e  vec to r  process ing 
c a p a b i l i t y ,  memory c o n f i g u r a t i o n  (cache memory, i n t e r -  
l eav ing ) ,  and s p e c i a l i z e d  coprocessors ( f l o a t i n q -  
p o i n t ,  graphics) .  
l e l  process ing a rch i tec tu res .  The impact of i n t e r -  
processor  communication overhead i s  a c r i t i c a l  i ssue  
i n  p a r a l l e l  process ing research. The a b i l i t y  t o  va ry  
t h e  system a r c h i t e c t u r e  (and thereby t h e  i n t e r p r o -  
cessor  communication paths) a l l ows  i n v e s t i g a t i o n s  i n t o  
a r c h i t e c t u r e - a l g o r i t h m  in te rac t i ons .  
p a r a l l e l  processor.  I npu t  and/or ou tpu t  process ing 
a re  t h e  dominant t i m e  consumers f o r  some app l i ca t i ons .  
The a b i l i t y  t o  mod i f y  o r  augment t h e  1/0 s t r u c t u r e  
a l l ows  research ing  o f  d i s t r i b u t e d  database techniques 
and p a r t i t i o n i n g  o f  t h e  1/0 task.  
( 4 )  
p a r a l l e l  system. T h i s  i s  necessary t o  eva lua te  algo- 
r i t h m s  r e q u i r i n g  a l a r g e  number o f  processors f o r  
e f fec t i veness .  
This suggests some o f  t h e  
( 2 )  A b i l i t y  t o  emulate a wide v a r i e t y  o f  p a r a l -  
(3 )  A b i l i t y  t o  modify t h e  1/0 s t r u c t u r e  o f  t h e  
C a p a b i l i t y  t o  expand t o  a l a r g e  s c a l e  
The usefu lness o f  a p a r a l l e l  process ing research 
test -bed hav ing t h e  above c h r i r a c t e r i s t i c s  was recog- 
n i z e d  by researchers invo lved i n  IBM's Research Para l -  
l e l  Processor P r o j e c t  (RP3) ( P f i s t e r  1985). However, 
t h e  RP3 a r c h i t e c t u r e  i s  n e i t h e r  commerc ia l ly  a v a i l a b l e  
n o r  easy t o  r e p l i c a t e .  Commercial ve rs ions  o f  some 
p a r a l l e l  process ing a rch i tec tu res  have r e c e n t l y  become 
ava i l ab le .  In most cases, t h e  a r c h i t e c t u r e  i s  f i x e d  
and /o r  t h e  user  has l i m i t e d  c a p a b i l i t i e s  f o r  a rch i tec -  
t u r a l  o r  processor mod i f i ca t i ons .  Fo r  example, 
A l l i a n t ' s  FX/8 machine ( A l l i a n t  Computer Systems 1985) 
has m u l t i p l e  vec to r  processors i n te rconnec ted  through 
shared memory. However, t h e  c u r r e n t  a r c h i t e c t u r e  i s  
l i m i t e d  t o  e i g h t  processors. The BBN B u t t e r f l y  
(Crowther e t  a l .  1985) ?as a l a r g e  number o f  s c a l a r  
processors communicating through shared memory, b u t  
l a c k s  v e c t o r  process ing c a p a b i l i t y .  
p u t e r s '  FLEX/32 (Manuel 1985) combines b o t h  message 
ipassing and shared memory communication mechanisms, 
b u t  again l a c k s  vec to r  processing c a p a b i l i t y .  
cube, i s  becoming a popular a r c h i t e c t u r e  due t o  i t s  
F l e x i b l e  Com- 
The n-cube a rch i tec tu re ,  a l s o  known as t h e  hyper-  
e x p a n d a b i l i t y  and c a p a b i l i t y  t o  emulate o t h e r  a r c h i -  
t ec tu res .  
c a p a b i l i t y  a t  each node ( two commercial ve rs ions  o f  
which a r e  d iscussed i n  Gustafson e t  a l .  1986 and 
Robinson 1985) meets seve ra l  o f  t h e  requi rements f o r  
a p a r a l l e l  orocess ing test-bed. The s t r o n g  p o i n t s  
are: 1) an a r c h i t e c t u r e  which i s  expandable i n  a 
sys temat i c  manner, 2 )  v e c t o r  and s c a l a r  process ing 
c a p a b i l i t y  a t  each node, and 3 )  t h e  a b i l i t y  t o  emulate 
a l i m i t e d  number o f  o t h e r  a r c h i t e c t u r e s .  Emulat ion 
o f  shared memory a r c h i t e c t u r e s  on t h e  hypercube i s  
d i f f i c u l t ,  however. T h i s  i s  e s p e c i a l l y  t r u e  f o r  
a p p l i c a t i o n s  e x h i b i t i n g  a f i ne -g ra ined  p a r a l l e l  s t r u c -  
tu re ,  such as l i n e a r  algebra.  The d i f f i c u l t y  can be 
t r a c e d  t o  t h e  i n te rp rocesso r  communications i n  t h e  
hypercube, which e x h i b i t  h i g h  overhead f o r  two rea-  
sons. F i r s t ,  a r o u t i n g  a l g o r i t h m  i s  r e q u i r e d  f o r  a l l  
b u t  those a p p l i c a t i o n s  which d i r e c t l y  map on t o  t h e  
hypercube network. T h i s  consumes processor  resources 
s ince  t h e  comnunicat ion p a t h  f rom one processor  t o  
another must be  ca l cu la ted .  Second, most commercial 
ve rs ions  o f  t h e  hypercube implement t h e  network i n t e r -  
connect ions w i th  f i x e d  s e r i a l  l i n k s .  The n e t  through- 
p u t  r a t e  on these l i n k s  i s  r e l a t i v e l y  low when 
p a c k e t i z a t i o n  and so f tware  p r o t o c o l  i s  t aken  i n t o  
account. I n  add i t i on ,  t h e  l i n k  connect ions cannot  b e  
recon f igu red .  
A hypercube t h a t  has v e c t o r  p rocess inq  
Hyperc lus te r  A r c h i t e c t u r e  
A m o d i f i e d  v e r s i o n  of t h e  hypercube a r c h i t e c t u r e ,  
c a l l e d  t h e  hyperc lus te r ,  i s  proposed t o  overcome t h e  
d i f f i c u l t i e s  descr ibed above. The h y p e r c l u s t e r  
r e t a i n s  t h e  hypercube network s t r u c t u r e  between pro-  
cessor  nodes, b u t  each node now c o n s i s t s  o f  m u l t i p l e  
processors communicating th rough  a shared memory. 
T h i s  concept i s  i l l u s t r a t e d  i n  F i g u r e  1, f o r  a dimen- 
s i o n  2 (D-2) cube. Each c i r c l e  l a b e l l e d  ' M '  repre-  
sents  a shared memory a t  a node. 
' P I  i s  a process ing element i n te rconnec ted  t o  t h e  
shared memory i n  some fashion.  Processors can have 
l o c a l  memory i n  a d d i t i o n  t o  shared memory. 
c a t i o n  l i n k s  between nodes fo rm t h e  h y p e r c l u s t e r  
network. 
i n te rp rocesso r  comnunicat ion v i a  shared memory ( w i t h i r :  
a node) and l o o s e l y  coupled communication throuqh t h e  
hypercube network (between nodes). 
i s  expanded i n  t h e  same manner as t h e  hypercube, w i t h  
processor  c l u s t e r s  r e p l a c i n g  t h e  normal s i n g l e  pro-  
cessor  node. An a r b i t r a r y  number of processors may 
he  assigned t o  a c l u s t e r ,  l i m i t e d  o n l y  by  t h e  hardware 
c o n s t r a i n t s  o f  t h e  shared memory i n te rconnec t  and/or  
power requirements.  
h y p e r c l u s t e r  c o n f i g u r a t i o n  be ing  implemented a t  t h e  
NASA Lewis Research Center. The nodes c o n s i s t  o f  
m u l t i p l e  board- level  computers i n te rconnec ted  by  a 
commercial bus. A l thouqh a bus i s  n o t  t h e  h iqhes t  
performance shared-memory i n te rconnec t  mechanism 
ava i l ab le ,  i t  does a l l o w  f o r  convenient  implementa- 
t i o n .  I n  add i t i on ,  t h e  use of a comnerc ia l  bus a l l o w s  
a v a r i e t y  o f  processor  a r c h i t e c t u r e s  t o  be  incorpo- 
r a t e d  w i t h i n  a node. Thus each node has an a r c h i t e c -  
t u r a l  ' p e r s o n a l i t y '  determined by  t h e  t y p e  of 
processor  boards connected t o  t h e  bus. 
D-2 hype rc lus te r  has t h r e e  nodes w i t h  a vec to r  per-  
s o n a l i t y  and one node w i th  a s c a l a r  p e r s o n a l i t y .  
o f  t h e  v e c t o r  nodes uses f o u r  board- level  vec to r  pro-  
cessors, w h i l e  t h e  s c a l a r  node uses f o u r  genera l  pu r -  
pose microcomputer boards. The i n c o r p o r a t i o n  of 
Each square l a b e l l e d  
Communi- 
The hyperc lus te r  suppor ts  b o t h  t i q h t l y  coupled 
The h y p e r c l u s t e r  
F i q u r e  2 shows a more d e t a i l e d  diaqram of t h e  D-2 
The NASA Lewis 
Each 
2 
v e c t o r  processors i s  c r u c i a l  i n  t h e  i n v e s t i g a t i o n  o f  
CFMlCSM a lgo r i t hms  because many CFM/CSM a lgo r i t hms  
c o n t a i n  l a r g e  a r rays  o f  independent computat ions t h a t  
a r e  bes t  handled by  a vec to r  a r c h i t e c t w e .  
a b i l i t y  o f  m u l t i p l e  vec to r  processors a l l ows  ve ry  
l a r g e  a r r a y s  o f  c a l c u l a t i o n s  t o  be broken up and d is -  
t r i b u t e d  f o r  a p a r a l l e l  so lu t i on .  
There a r e  two types  o f  comnunicat ion l i n k s .  
I n t e r n o d e  communication l i n k s  fo rm t h e  hyperc lus te r  
network as descr ibed before.  A d d i t i o n a l  l i n k s  p rov ide  
communication pa ths  between each node and a f ront-end 
processor  (FEP). The FEP a l l ows  a use r  t o  i n t e r a c t  
w i t h  t h e  hyperc lus te r .  Each communication l i n k  con- 
s i s t s  o f  two c o n t r o l  processors (CP) in terconnected 
b y  a dua l -po r t  memory. The CP's coo rd ina te  communi- 
c a t i o n  o v e r  t h e  l i n k s  and superv ise t h e  o p e r a t i o n  o f  
processors w i t h i n  a node. 
CP per forms these func t i ons .  F o r  t h e  D-2 hyper- 
c l u s t e r ,  i t  i s  bo th  p r a c t i c a l  and advantageous t o  have 
a communication l i n k  between each node and t h e  FEP. 
However, as t h e  hyperc lus te r  i s  expanded t o  more 
nodes, t h e  associated s i z e  and c o s t  c o n s t r a i n t s  maks 
t h i s  approach imprac t i ca l .  I n  t h a t  case, most nodes 
w i l l  n o t  have an FEP l i n k .  Sof tware w i l l  t hen  be 
necessary t o  r o u t e  i n fo rma t ion  f rom nodes wi th  an FEP 
l i n k  t o  those w i thou t .  
The ava i l -  
Execu t i ve  sof tware i n  each 
Shared memory wi th in  a node c o n s i s t s  o f  memory 
Dual-ported memory 
boards connected t o  t h e  node's bus, and/or  dual-ported 
memory o n  t h e  processor  boards. 
has become a s tandard f e a t u r e  on  many commercial com- 
p u t e r  boards, and i s  p a r t i c u l a r l y  u s e f u l  i n  t h e  $yper- 
c l u s t e r  environment. 
ments can be  a l l o c a t e d  as l o c a l  t o  a processor,  g loba l  
t o  a l l  processors i n  a node, o r  a combinat ion o f  l oca l  
and g l o b a l  segments. T h i s  a l l ows  emulat ion o f  t h e  
d i f f e r e n t  memory h i e r a r c h i e s  used i n  p a r a l l e l  pro- 
cess ing  systems. 
Each node o f  t h e  hyperc lus te r  can have i t s  own 
l o c a l  I /O c a p a b i l i t y .  F o r  exavple, each node can have 
a d i s k  c o n t r o l  processor and hard  d i s k  d r i ve .  Th is  
arrangement would a l l o w  research i n  d i s t r i b u t e d  1/0 
and database techniques, aimed a t  e l i m i n a t i n q  t h e  1 / 0  
b o t t l e n e c k  present  i r !  many a p p l i c a t i o n s .  
Through sof tware,  memory seg- 
An A lgo r i t hm Example 
The a l t e r n a t i n g  d i r e c t i o n  i m p l i c i t  ( A D I )  algo- 
r i t h m  i s  a technique comnonly used f o r  t h e  s o l u t i o n  
o f  p a r t i a l  d i f f e r e n t i a l  e q u a t i o x  (PDE) (Gera ld  1980). 
The two stages o f  t h e  A01 a l g o r i t h m  a re  shown i n  
F i g u r e  3 f o r  a 4 by 4 g r i d .  
a r e  i m p l i c i t  (i.e., depend on c u r r e n t  t i m e  s tep  i n f o r -  
mat ion)  i n  t h e  X d i r e c t i o n  on ly .  Thus a c o e f f i c i e n t  
m a t r i x  A and v e c t o r  b can be generated t o  fo rm t h e  
system Ax = b which descr ibes one row o f  po in ts .  
Severa l  such systems a r e  formed t o  desc r ibe  t h e  e n t i r e  
g r i d .  The m a t r i x  A i s  a t r i d i a g o n a l  m a t r i x  ( t h e  
m a t r i x  i s  b lock  t r i d i a g o n a l  i f  seve ra l  PDE's a r e  
so lved a t  each g r i d  p o i n t ) .  
AD1 a l g o r i t h m  begins a f t e r  t h e  equat ions f rom t h e  
f i r s t  s tage a r e  solved. I t  i s  i d e n t i c a l  t o  t h e  f i r s t  
s tage  except t h a t  now t h e  equat ions a r e  i m p l i c i t  i n  
t h e  Y d i r e c t i o n  on ly .  
Each system o f  equat ions f o r  a row o r  column i s  
independent ( i n  t h e  c u r r e n t  t i m e  s tep )  o f  i n fo rma t ion  
f rom ne ighbor ing  rows o r  columns. 
f rom p a s t  t i m e  s teps  f o r  ne ighbor ing  rows o r  columns 
I n  t h e  f i r s t  stage, equat ions a r e  formed which 
The second stage o f  t he  
Only i n fo rma t ion  
i s  used. T h i s  c h a r a c t e r i s t i c  o f  t h e  AD1 a l g o r i t h m  
makes i t  p a r t i c u l a r l y  a t t r a c t i v e  f o r  s o l u t i o n  on a 
p a r a l l e l  processor.  
can  be  done i n  p a r a l l e l .  Each row o r  column can be 
a l l o c a t e d  t o  a processor,  i f  s u f f i c i e n t  processors a r e  
a v a i l a b l e .  Otherwise, qroups o f  rows o r  columns must 
be  formed, where t h e  number o f  groups would equal  t h e  
number o f  a v a i l a b l e  processors.  
The f i r s t  s tage o f  t h e  AD1 a l g o r i t h m  would map 
on to  t h e  h y p e r c l u s t e r  as shown i n  F i g u r e  4. For  t h e  
s imp le  4 by 4 g r i d  and D-2 h y p e r c l u s t e r  shown, each 
row would b e  so lved on a h y p e r c l u s t e r  node. If t h e  
g r i d  were l a rge r ,  groups o f  rows would be assigned t o  
t h e  node, o r  more nodes c o u l d  be  added. The processor  
a l l o c a t i o n  descr ibed thus  f a r  c o u l d  be accompl ished 
on  any hypercube implementat ion.  The advantage of  t h e  
h y p e r c l u s t e r  a r c h i t e c t u r e  f o r  t h e  AD1 a l g o r i t h m  i s  t h e  
a b i l i t y  t o  app ly  t h e  t i g h t l y  coupled m u l t i p l e  proces- 
so rs  w i t h i n  each node t o  t h e  s imultaneous s o l u t i o n  of 
t h e  equa t ion  systems. 
h y p e r c l u s t e r  nodes r e s u l t s  i n  one o r  more b lock  tri- 
diagonal  equa t ion  systems which must be  so lved a t  each 
node. A f t e r  t h e  p a r a l l e l  s o l u t i o n  o f  t h e  equa t ion  
s e t s  i s  completed, i n f o r m a t i o n  t o  and f rom neighbor-  
i n g  rows i s  t r a n s m i t t e d  between nodes v i a  t h e  hyper-  
cube network. 
a l g o r i t h m  can proceed w i th  columns a l l o c a t e d  t o  hyper-  
c l u s t e r  nodes. A f t e r  s o l u t i o n  o f  b o t h  staqes, t h e  
r e s u l t s  a r e  checked f o r  convergence. If convergence 
has n o t  been achieved, t h e  whole process i s  repeated. 
The p a r a l l e l  AD1 a l g o r i t h m  i s  o u t l i n e d  by t h e  psuedo- 
code i n  F igu re  5. 
The AD1 a l g o r i t h m  i s  o n l y  one o f  many a lqo r i t hms  
a v a i l a b l e  t o  so l ve  a system o f  p a r t i a l  d i f f e r e n t i a l  
equat ions.  
t h e  AD1 a l g o r i t h m  descr ibed above i s  one o f  many pos- 
s i b l e  methods. T h i s  example has been g i v e n  o n l y  t o  
demonstrate t h e  usefu lness o f  t h e  h y p e r c l u s t e r  a r c h i -  
t e c t u r e  f o r  implement ing a p a r t i c u l a r  a lgor i thm.  
s o l v i n g  p a r t i a l  d i f f e r e n t i a l  equat ions have been pro-  
posed i n  Hockney and Jesshope 1981. Some o f  t hese  
a lqo r i t hms  a r e  a l s o  vec to r i zab le .  Fu tu re  work u s i n g  
t h e  h y p e r c l u s t e r  as a test -bed w i l l  a t tempt  t o  de te r -  
mine which a lqo r i t hms  (combined w i th  t h e  a p p r o p r i a t e  
a r c h i t e c t u r e )  a re  optimum f o r  CFM/CSM a p p l i c a t i o n s .  
CONCLUDING REMARKS 
The s o l u t i o n  o f  rows o r  columns 
The a l l o c a t i o n  o f  rows t o  
Then t h e  second stage o f  t h e  AD1 
The p a r t i t i o n i n g  o f  t h e  c a l c u l a t i o n s  f o r  
A number o f  p a r a l l e l  process ing a lgo r i t hms  f o r  
The h y p e r c l u s t e r  a r c h i t e c t u r e  i s  in tended t o  pro-  
v i d e  a r e c o n f i g u r a b l e  test -bed on  which v a r i o u s  p a r a l -  
l e l  process ing a lqor i thms,  Drogramning and o p e r a t i n g  
t o o l s  can be developed. There i s  s t i l l  a c o n s i d e r a b l e  
amount o f  u n c e r t a i n t y  as t o  t h e  optimum p a r a l l e l  pro-  
cess ing  a r c h i t e c t u r e  f o r  s p e c i f i c  a p p l i c a t i o n s  such 
as CFM and CSM. 
qramming and opera t i ng  so f tware  t h a t  w i l l  a l l o w  
researchers t o  e a s i l y  t a k e  advantage o f  p a r a l l e l  pro- 
cess ing.  Fu tu re  work us ing  t h e  h y p e r c l u s t e r  test -bed 
w i l l  a t tempt  t o  address some o f  t hese  issues. I t  w i l l  
a l l o w  CFM/CSM research a t  t h e  NASA Lewis Research 
Center  t o  r e a d i l y  adapt t o  t h e  r a p i d l y  develop ing 
d i s c i p l i n e  o f  p a r a l l e l  process ing.  
There i s  a l s o  a d e f i n i t e  l a c k  of p ro -  
REFERENCES 
1. Ortega, J .  M. and Vo ig t ,  R G., " S o l u t i o n  of  
P a r t i a l  D i f f e r e n t i a l  Equat ions on P a r a l l e l  and 
Vector  Computers," S I A M  Review, Vol. 27, No. 2, 
June 1985. 
3 
2. Arpasi, D. 3. "Real-Time Multiprocessor 
Programming Language (RTMPL) Users Manual ," NASA 
Technical Paper 2422, June 1985. 
3 .  Blech, R. A. and Arnasi, D. J., " Hardware for a 
Real-Time Multiprocessor Simulator, " NASA TM 
83805, January 1985. 
Multiprocessor Propulsion System Simulator,'' NASA 
Technical Paper 2426, January 1985. 
5. Arpasi, 0. J. and Milner, E. J., "Partitioning 
and Packing Mathematical Simulation Models for 
Calculation on Parallel Computers," NASA TM 
87170, November, 1985. 
6. Pfister, G. F., et. al., "The IBM Research 
Parallel Processor Prototype (RP3): Introduction 
and Architecture," Proceedings of the 1985 
International Conference on Parallel Processing, 
August 1985. 
Summary," June 1985. 
4. Cole, G. L . ,  "Operating System for a Real-Time 
7. Alliant Computer Systems, "FX/Series Product 
8. Crowther, W .  et. al., "Performance Measurements 
on a 128-Node Butterfly Parallel Processor,'' 
Proceedings of the 1985 International Conference 
on Parallel Processing, August 1985. 
Indefinitely," Electronics Week, May 13, 1985 
Homogeneous Vector Supercomputer," Proceedings of 
the 1986 International Conference on Parallel 
Processing, August 1986. 
Processors to Challenge Supercomputers," 
Electronic Engineering Times, April, 1985 
12. Gerald, C. F., "Applied Numerical Analysis," 
Addison-Wesley Publishing Co.. 1980. 
13. Hockney, R. W .  and Jesshope. C. R . ,  "Parallel 
Computers," Adam Hilqa Ltd., Bristol, 1981. 
9. Manuel, Thomas, "Parallel Machine Expands 
10. Gustafson, J. et. al., " The Architecture o f  a 
11. Robinson, Brian, "Hypercube Sprouts Vector 
4 
P = PROCESSOR M = SHARED EMORY 
FIGURE 1. - TWO-DIMENSIONAL (D-2) HYPERCLUSTER CONFIGURATION. 
5 
TO FEP TO FEP 
t 7 M = SHARED MEMORY 7 
CP = CONTROL PROCESSOR 
SP = SCALAR PROCESSOR 
VP = VECTOR PROCESSOR 
FEP = FRONT-END PROCESSOR 





FIGURE 2 .  - NASA LEWIS IWLEENTATION OF D-2 HYPERCLUSTER. 
6 
X DIRECTION- 
( 0  0 0 o j  



























FIGURE 3. - SWEEP PATTERN FOR ALTERNATING DIRECTION 
IMPLICIT NETHOD. 
7 
0 0 0 0  
r 
’ \  




FOR ROW = 1 TO NROWS DO I N  PARALLEL 
BEGIN 
CALCULATE COEFFICIENTS OF MATRIX A. VECTOR 6 
SOLVE AXK+' = 6 V I A  PARALLEL ALGORITHP: 
END 
TRANSFER DATA TO NEIGHBORING NODES 
FOR COLUMN = 1 TO NCOLUMNS DO I N  PARALLEL 
BEGIN 
CALCULATE COEFFICIENTS OF MATRIX A. VECTOR 6 
SOLVE mk+2 = 5 V I A  PARALLEL ALGORITHM 
END 
TRANSFER DATA TO NEIGHBORING NODES 
UNTIL Ax e E 
FIGURE 5. - PSEUDOCODE FOR PARALLEL AD1 ALGORITHM. 
9 
1. Report No. 12. Government Accession No. 1 3. Recipient's Catalog No. 
NASA TM-89823 
4. Title and Subtitle 5. Report Date 
The Hypercluster: A Parallel Processing Test-Bed 
Applications 
Architecture for Computational Mechanics 
7. Author@) 
Richard A. Blech 
6. Performing Organization Code 
505- 62- 21 
8. Performing Organization Report No. 
E- 3469 
17. Key Words (Suggested by Aulhor(s)) 
9. Performing Organization Name and Address 
____ 
National Aeronautics and Space Administration 
Lewis Research Center 
Cleveland, Oh'to 44135 
- 
19. Security Classif (of this report) 
~ Unc lassi f ied 
11. Contract or Grant No. I 
20. Security Classif. (of this page) 21. No. of pages 22. Price' 
Un c 1 a s s i f i ed 10 A0 2 
13. Type of Report and Period Covered 
ri
2. Sponsoring Agency Name and Address I Technical Memorandum 
 if    
Y s i f i e  
National Aeronautics and Space Administration 
Washington, D.C. 20546 
14. Sponsoring Agency Code 
5. Supplementary Notes 
Prepared for the Summer Computer Simulation Conference sponsored by the Society 
for Computer Simulation, Montreal, Canada, July 27-3C, 1987. 
6. Abstract 
The development o f  numerical methods and software tools for parallel processors 
can be aided through the use of a hardware test-bed. lhe test-bed architecture 
must be flexible enough to support investigations into archltecture-algorithm 
interactions. One way to implement a test-bed is to use a commercial parallel 
processor. Unfortunately, most commercial parallel processors are fixed in their 
interconnection and/or processor architecture. In this paper, we describe a 
modified n-cube architecture, called the hypercluster, which is a superset of 
many other processor and interconnection architectures. lhe hypercluster is 
intended to support research into parallel processing of computational fluid and 
structural mechanics problems which may require a number o f  different architec- 
tural configurations. An example o f  how a typical partial differential equation 
solution algorithm maps on to the hypercluster is given. 
18. Distribution Statement 
Unclassified -- unlimited 
STAR Category 62 
~~ 
'For Sale by the National Technical Information Service, Springfield. Virginia 22161 
