The Cube-Connected-Cycles: A Versatile Network for Parallel Computation by Preparata, Franco P. & Vuillemin, Jean
ACT-2 0 NOVEMBER, 1979
BO *CO O RDINATED SCIEN CE LABORATORY
APPUED COMPUTATION THEORY GROUP
THE CUBE-CONNECTED-CYCI.ES: 
A VERSATILE NETWORK 
FOR PARALLEL COMPUTATION
FRANCO R PREPARATA 
JEAN VU ILLEM IN
APPROVED FOR PUBLIC RELEASE. DISTRIBUTION UNLIMITED.
REPORT R-874 UILU-ENG 80 -22 06
UNIVERSITY OF ILLINOIS -  URBANA, ILLINOIS
( SE CURI T Y  C L A S S I F I C A T I O N  OF TH I S  P A G E  (When D a ta  E n te red)
REPORT DOCUMENTATION PAGE R E A D  I N S T R U C T I O N S  B E F O R E  C O M P L E T I N G  F O R M
1. R E P O R T  N U M B E R
*
2. G O V T  A CCESSI ON NO. 3. R E C I P I E N T ' S  C A T A L O G  N U M B E R
4. T I T L E  (a nd  S u b t i t le )
THE CUBE-CONNECTED-CYCLES: A VERSATILE NETWORK 
FOR PARALLEL COMPUTATION
5. T Y P E  O F  R E P O R T  a P E R I O D  C O V E R E D
T echn ica l Report
6. P E R F O R M I N G  ORG.  R E P O R T  N U M B E R
R-874(ACT-20);UILU-ENG80-2206
7. AU T H O R f s)
Franco P. Preparata 
Jean Vuillem in
8. C O N T R A C T  OR G R A N T  N U M B E R S
NSF MCS-78-13642; N00014-79- 
C-0424
9. P E R F O R M I N G  O R G A N I Z A T I O N  N A M E  AND ADDRESS
Coordinated Science Laboratory 
U n iversity  o f  I l l in o i s  at Urbana-Champaign 
Urbana, I l l in o i s  61801
10. P R OG R AM  E L E M E N T .  P R O J E C T .  TA S K  
A R E A  a WORK U N I T  N UM B E R S
11. C O N T R O L L I N G  O F F I C E  N AM E  A N D  ADDRESS
N ational Science Foundation;
Join t S ervices E le ctro n ics  Program Contract
12. r e p o r t  D A T E
November, 1979
13. N U M B E R  O F  P A GE S
27
14. M O N I T O R I N G  A G E N C Y  N AM E  & AOORESS ( I f  d i f f e r e n t  from C o n t r o l l i n g  O f f i c e ) 15. S E C U R I T Y  CLASS,  ( o f  t h i s  repor t )
UNCLASSIFIED
1 5a. D ECL AS S I  FI C ATI  O N / D O  WN G R ADI  N G 
S C H E D U L E
16. D I S T R I B U T I O N  S T A T E M E N T  (o f  th i s  R epo r t )
Approved fo r  p u b lic  r e le a s e ; d is t r ib u t io n  unlim ited
17. D I S T R I B U T I O N  S T A T E M E N T  (o f  the a b s t ra c t  en te red  in  B lo c k  20, I f  d i f f e re n t  from Report)
18. S U P P L E M E N T A R Y  N O T E S
19. K E Y  WORDS ( C o n t in u e  on reverse  s ide  i f  n e c e s s a ry  and i d e n t i f y  by b lo c k  number)
P a ra lle l p rocess in g , VLSI design , s o r t in g , Fourier Transform
20. A B S T R A C T  (C o n t in u e  on reverse  s ide  I f  n e c e s s a ry  and i d e n t i f y  by b lo c k  number)
We introduce an in tercon n ection  pattern  o f  p rocessin g  elem ents, the cube- 
con n ected -cycles  (CCC), which can be used as a general purpose p a r a lle l  p ro­
ce sso r . Because it s  design com plies w ith present te ch n o lo g ica l c o n s tra in ts , 
the CCC can a lso  be used in the layout o f  many sp e c ia liz e d  large sca le  in te ­
grated c ir c u it s  (VLSI). By combining the p r in c ip le s  o f  p a ra lle lism  and 
p ip e lin in g , the CCC can emulate the cube-connected machine and the p e r fe c t  
sh u ffle  with no s ig n if ic a n t  degradation o f  performance but with a more compact 
s tru ctu re . We describe  in  d e ta i l  how to  program the CCC fo r  e f f i c i e n t l y
(over)DD , X “ 3 1473 E OI T I O N  OF 1 NOV  65 IS O B S O L E T E
S E C U R I T Y  C L A S S I F I C A T I O N  OF THIS P A G E  (When D ata  En te red)
20. (Continued)
so lv in g  a large c la s s  o f  problem s, which includes Fast-Fourier-T ransform , 
so rt in g , perm utations, and derived  algorithm s.
THE CUBE-CONNECTED-CYCLES: A VERSATILE NETWORK
FOR PARALLEL COMPUTATION
by
Franco P. Preparata and Jean Vuillem in
This work was supported in part by the N ational Science Foundation 
under Grant MCS 78-13642 and Join t S erv ices E lectro n ics  Program under 
Contract N00014-79-C-0424.
Reproduction in whole or in part is  perm itted fo r  any purpose o f  
the United States Government.
A prelim inary vers ion  o f  th is  rep ort was issued in  June 1979 by I .R .I .A . 
In s t itu t  de Recherche d 'in form atique et d 1 Automatique, 78150 Rocquencourt, 
France.
Approved fo r  p u b lic  re le a s e . D is tr ib u tio n  un lim ited .
THE CUBE-CONNECTED-CYCLES : A VERSATILE NETWORK FOR PARALLEL COMPUTATION
Franco P. Preparata Jean Vuillem in
Coordinated Science Laboratory L aboratoire de Recherche en Inform atique 
U n iversity  o f  I l l i n o i s  U n iversite  de Paris-Sud
Urbana, I l l i n o i s  61801 91405 Orsay, France
A b stra ct : We in troduce an in tercon n ection  pattern  o f  p rocessin g  elem ents,
the cu b e -con n ected -cycles  (CCC), which can be used as a general purpose 
p a r a l le l  p rocessor . Because i t s  design  com plies w ith present te ch n o lo g ica l 
c o n s tra in ts , the CCC can a lso  be used in  the layout o f  many sp e c ia liz e d  
large s ca le  in tegrated  c ir c u i t s  (VLSI). By combining the p r in c ip le s  o f  
p a ra lle lism  and p ip e lin in g , the CCC can emulate the cube-connected machine 
and the p e r fe c t  s h u ffle  w ith no s ig n if ic a n t  degradation o f  performance but 
with a more compact s tru ctu re . We d escribe  in  d e t a i l  how to program the CCC 
fo r  e f f i c i e n t l y  so lv in g  a large c la ss  o f  problem s, which in clu des F ast- 
Fourier-Transform , so rt in g , perm utations, and derived  algorithm s.
Keywords : P a ra lle l p rocess in g , VLSI design , s o rt in g , Fourier Transform.
CR C ategories :
t This work was p a r t ia l ly  supported by N ational Science Foundation 
Grant MCS-78-13642, by the J o in t S erv ices E lectro n ics  Program Contract 
DAAG-29-78-C-0016, by I .  R. I .  A ., In s t itu t  de Recherche en Inform atique 
e t Automatique, 78150 Le Chesnay, France, and by ERA 452 " a l  Khowarizmi" 
o f  Centre N ational de la  Recherche S c ie n t if iq u e .
$k A prelim inary vers ion  o f  th is paper has been presented at the 
20 Annual Symposium on Foundations o f  Computer S cien ce , Puerto R ico , 
O ct. 1979.
1. INTRODUCTION
The great te ch n o lo g ica l progress embodied in  very large sca le  
in te g ra tio n  (VLSI) o f  e le c t r o n ic  c ir c u it s  has made i t  p o ss ib le  to 
con ceive  large systems o f  processing  elements cooperatin g  in  the execu­
t io n  o f  p a r a l le l  a lgorithm s. This has m otivated con siderab le  research 
in te r e s t  in  p a r a lle l  com putation. U nfortunately , here the s itu a t io n  is  
very d i f fe r e n t  from that o f  s e r ia l  com putation, where the RAM machine
[1] represents a u n iv e rsa lly  accepted model. The d i f f i c u l t y  o f  choosing 
a s p e c i f i c  in tercon n ection  is  frequ en tly  bypassed by assuming a model 
(shared-memory-machine) where each pair o f  processors  is  connected (or  
an equ iva len t system) [2 -5 ] .  Although not w ithout m erit, because i t  aims 
at uncovering the inherent data-dependence o f  given problem s, th is 
approach ignores the te ch n o lo g ica l con stra in ts  o f  VLSI, p a r t icu la r ly  as 
regards the communication among the p rocessin g  elements [6 ] .  At the 
op p osite  end, other workers [7 -11 ] suggest that processor in tercon n ection  
should be lim ited  to planar links between to p o lo g ic a lly  neighboring c e l l s  
(arrays or meshes). Such designs are c e r ta in ly  w e ll su ited  fo r  current 
VLSI technology , and they have c le v e r ly  been used in  implementing algorithm s fo r  
m atrices or graph problems [9 -1 2 ], fo r  example. This type o f  con n ection , 
however, is  not su ited  fo r  e f f i c i e n t l y  implementing algorithm s fo r  
various fundamental problem s, such as sortin g  and con vo lu tion . Indee'd, 
good algorithm s fo r  so lv in g  these problems in t r in s ic a l ly  requ ire data 
movement between processors  which are to p o lo g ic a lly  fa r  apart; fo r  
example, s o r t in g  on an n processor array such as ILLIAC IV requires 
time Ci (Vn) [ 8 ] .
2The purpose o f  the paper is  to propose and analyze a new in tercon n ec­
tion  o f  p rocessors , c a lle d  the cu b e -co n n e cte d -cy c le s , which is  remarkably 
su ited  fo r  implementing e f f i c i e n t  algorithm s such as F ast-F ou rier-T ran s- 
form (FFT), s o r t in g , e t c . . .  . The geom etric stru ctu re  underlying the 
in tercon n ection s  is  that the k-dim ensional cube. This stru ctu re  which 
has already been studied  in  r e la t io n  to p a r a lle l  com putation [1 3 ], is  
not re a d ily  usable fo r  VLSI design , s in ce  each o f  the 2 p rocessors  is  
connected to k other p ro cessors .
By combining p a ra lle lism  and p ip e lin in g  we are able to achieve the 
fo llo w in g  r e su lts :
(1) The number o f  connections per p rocessor is  reduced to 3.
(2) Processing time is  not s ig n if ic a n t ly  increased  w ith resp ect 
to that ach ievable on the k-cube s tru ctu re .
(3) Programs fo r  the in d iv id u a l modules are obtained in  a system atic 
way from a standard d e scr ip t io n  o f  the g lo b a l a lgorithm s.
(4) The o v e ra ll stru ctu re  com plies w ith the ba s ic  requirements o f
VLSI technology: m odularity, ease o f  layout, s im p lic ity  o f  communication
among the processing elem ents, s im p lic ity  in  timing and c o n tr o l o f  the 
e n t ire  system [1 4 ]. We a lso  propose a w ire layout o f  the CCC, which can 
be p h y s ica lly  re a lize d  w ith two orthogonal layers o f  w ires . This layout 
is  optim al fo r  severa l problem s, according to a re cen tly  proposed VSLI 
model [18] .
(5) F in a lly  we are a b le , w ithout resortin g  to any d ra s t ic  departure 
from c la s s ic a l  a lg o l - l ik e  languages, to provide fu l ly  accurate and 
h op e fu lly  e a s ily  understandable d e scr ip tion s  o f  our p a r a l le l  programs.
This is  a favorab le  sign  that p a r a lle l  processin g  may p o ss ib ly  be endowed 
with su ita b le  high le v e l  programming languages.
3This paper is  organized as fo llo w s . S ection  2 in troduces a c la s s  
o f  algorithm s com prising many important a p p lica t io n s , such as merging, 
s o r t in g , Fourier Transform, data rearrangement, . . .  . S ection  3
presents models o f  module con n ection s, in clu d in g  the CCC, allow ing fo r  
e f f i c i e n t  p a r a lle l  execution  o f  the algorithm s in  S ection  2. S ection  4 
d escribes  the implementation o f  such algorithm s on the CCC, and S ection  5 
is  devoted to op tim a lity  con sid era tion s  regarding a layout o f  the machine 
fo r  VLSI r e a liz a t io n s .
42. A CLASS OF HIGHLY PARALLEL ALGORITHMS
To d escr ib e  our a lgorithm s, assume that input data t g , t ^ , . . . , t  ^
are stored  r e sp e c t iv e ly  in  storage lo ca tio n s  T [0 ] ,T [1 ] , . . . ,T [ n - l ] , and 
1cthat n = 2 , i . e . ,  the number o f  inputs is  a power o f  2. We say that an 
algorithm  is  in  the DESCEND c la ss  i f  i t  performs a sequence o f  basic  
operations on data which are su cce ss iv e ly  2^ \  . . . , 2 ^ , . . . , 2 ^  = 1 lo ca ­
tion s  apart. Each b a s ic  operation  0PER(m,j;U,V) m odifies the two data 
items present in  storage lo ca tio n s  U and V; the com putation performed 
a f fe c t s  only the contents o f  U,V and i t  may depend upon parameters m and 
j ,  which are in tegers  0 <  m < n, 0 <  j  < k.
Algorithm s in  the DESCEND c la ss  are then s p e c if ie d  as: 
proc DESCEND
fo r  j  *- k -1  step -1  u n t i l  j  = 0 
do foreach  m: 0 <  m < n





Here, bit^(m ) is  the c o e f f i c i e n t  o f  2  ^ in  the binary rep resen tation  o f
m = Z b it .(m )2 ^ . The language con stru ct foreach  m: <cond(m)> pardo 
j^O J
<action>  odpar obv iou sly  in d ica tes  that a l l  in s tru ctio n s  <action>  c o r ­
responding to values o f  m s a t is fy in g  <cond(m)> can be performed sim ultaneously. 
On machines where such p a ra lle lism  can be r e a liz e d , DESCEND algorithm s run 
in  k = log^Cn) elementary step s .
We a lso  in troduce the dual c la ss  ASCEND, where the co n tro l o f  the 
algorithm  is  changed to
fo r  j 0 step  1 u n t i l  j  = k -1 ,
5i . e . ,  OPER is  performed on data which are su cce ss iv e ly
1 = 2 ° ,2 1, . . . , 2 ^ , . . . ,2 k 1 lo ca tio n s  apart. To c la r i f y  the d u a lity  between
ASCEND and DESCEND con sider the binary rep resen tation  o f  m = E b it .(m ) •21
0^i<k
r* • 2.and d e fin e  m = E b it . (m ) *2 , the in teger  whose binary represen -
0^i<k
ta tio n  is  the reversa l o f  that o f  m. Once k is  f ix e d , the fu n ction : 
m -  m is  an in v o lu tory  permutation o f  0 , 1 , . . . , 2  -1  known as the b i t  
re v ersa l permutation (BRP). For example, fo r  k = 3, the BRP o f  
(0 1 2 3 4 5 6 7) i s  ( 0 4 2 6 1 5 3 7 ) .
By f i r s t  applying the BRP to i t s  in p u ts , an ASCEND algorithm  can be 
transformed in to  a dual DESCEND algorithm  (fig u re  1) whose b a s ic  operation  
OPER is  re la ted  to the o r ig in a l  OPER by:
0PER(m,j;U,V) = 0P E R (m ,k-l-j;U ,V )
0 1 2 3 4 5 6 7  input 0 4 2 6 1 5 3 7
j  = 2
j  = 1
j  = 0
O'  1'  2» 3* 4 ‘ 5'  6'  7'
OPER
j = 0
O' 4 ' 2 ' 6 ’ 1' 5* 3 1 7'
0" i'» 2" 3" 4M 5" 6" 7"
OPER j  = 1
0" 4" 2” 6" 1" 5" 3” 7"
0'i 11'» * 2 " »3M '4 " '  5M • 6” ' 7"'
OPER
j = 2
0M,4H12 " 16” 11M * 5,M 3 " '7 " 1
DESCEND ASCEND
Figure 1 . Dual algorithm s; operands are denoted by th e ir  o r ig in a l  
addresses, connecting lin es  show in te ra ctin g  operands, 
and priming in d ica te s  the number o f  operations through 
which an operand has been processed.
I t  is  now time to e x h ib it  algorithm s fo r  s o lv in g  s p e c i f i c  in te re s t in g  
problems. Some a p p lica tion s  -  such as b ito n ic  merge and c y c l i c  s h i f t  - 
are d ir e c t ly  w ith in  the ASCEND or DESCEND cla sses  ( simple a lgorith m s); 
fo r  these a p p lica t io n s , a l l  we have to do is  sp e c ify  0PER(m ,j;U ,V ).
6Other a p p lica tio n s  (such as perm utation, s h u f f le , u n s h u ffle , b i t -  
rev ersa l (BRP), odd-even-m erge, F ast-F ourier-T ransform , co n v o lu tio n , 
matrix tra n sp o s it io n ) have programs co n s is t in g  o f  a short sequence 
o f  algorithm s (cascaded algorithm s) in  the preceding c la s s ,  and thus 
run in  O (logn ) p a r a l le l  s tep s .
We a lso  have a p p lica tion s  -  such as b ito n ic  s o r t , o d d -e v e n -so rt ,
and c a lcu la t io n s  o f  symmetric functions -  fo r  which the combining step
o f  the two re su lts  o f  a recu rsive  c a l l  i s  i t s e l f  an algorithm  in  one
o f  the two preceding c a te g o r ie s . These a lgorithm s, which we c a l l
2com posite , run in  O ((logn ) ) p a r a lle l  s tep s .
2 .1  B iton ic  Merge
The e legan t algorithm  fo r  b ito n ic  merge, due to K. E. Batcher 
[1 5 ], is  id e a l ly  su ited  fo r  implementation w ith in  the DESCEND c la s s .
A ll  we need is  to  s p e c ify  OPER(m,j;U,V) as a com parison-exchange. 
P re c is e ly , in  order to handle sequences which are sorted  e ith e r  in  
in creasin g  or in  decreasing order, we defin e  ORIENTCOMPEXCHANGE(m,j;U,V) 
as
i f  b i t j  (m) = 0 then (U,V)<-(min (U ,V), max (U ,V)) 
e ls e  (U,V)*~(max (U ,V ), min (U ,V))
f i  .
B atch er 's  odd-even merge [15 ,16] can a lso  be programmed as a cascaded 
algorithm , running in  O (logn) p a r a lle l  s tep s .
2 .2  Radix-2 Fast-Fourier-Transform s and Convolution
The im portant FFT algorithm  can be se t  in  the ASCEND c la s s . Let
lr
oj be a p r im itiv e  root o f  u n ity  o f  order n = 2 . I f  < A „ , . . . ,A  >0 n-1
is  the Fourier Transform o f  v e cto r  < a g , . . . , a  ^>, i t  is  well-known
that A. = U. + and A , = U. -  (D^V. where the U 's and V 's
J J J j+2k~l  J J
7are re sp e c t iv e ly  the Fourier Transforms, with p rim itiv e  ro o t u> , o f
the "even” subsequence <a , a0 , . . . , a  , > and the "odd" subsequence
0 l  ' 2 - 2
<a1?a , . . . , a  , > ; we c a l l  the auJ * s the combining ro o t powers .I  «3 ft”  i
The above re la tion sh ip s  in d ica te  that the sequence < aQ ,. . . >an_^> 
must be i n i t i a l l y  rearranged by means o f  the b it - r e v e r s a l  permutation.
Once the d esired  re co n fig u ra tion  has been ach ieved, we may proceed with 
the actu a l FFT com putation, which is  in  the ASCEND c la s s .
I t s  b a s ic  operation  OPER(m,j;U,V) is  s p e c if ie d  by
0k -j  m • 2 J(U,V) *“ (U+aV,U-QfV) where ql = 0)
I t  is  not hard to show that a can be computed e f f i c i e n t l y  at each s tep ;
p r e c is e ly , the time used by each module to compute, by su ccessiv e  squaring,
the required combining ro o t  powers fo r  the en tire  a lgorithm  is  
2O ((lo g lo g n ) ) = o ( lo g n ) .  Using a sequence o f  two inverse  Fourier transforms 
in  the c la s s i c a l  manner [1] allow s one to compute the con vo lu tion  o f  two 
sequences, from which a w ealth o f  a p p lica tion s  can be derived  (see [1 1 ).
2 .3  Data Rearrangements
Being able to e f f i c i e n t l y  permute the data is  obv iou sly  important
fo r  may a p p lica t io n s . For example, the BRP rearrangement is  a necessary
prelim inary step to the FFT algorithm, o f  the preceding s e c t io n . Some
perm utations, such as c y c l i c  s h i f t s ,  s h u ff le , and u n sh u ffle  can be
computed by algorithm s in  ASCEND or DESCEND, as the reader w i l l
ken joy d iscov erin g  fo r  h im self (here " s h u ff le "  o f  ( 0 , 1 , 2 , . . . , 2  -1 ) i s  
(0 ,2^  \ l , 2 ^  ^ + 1 ,. . . ,2 ^  ^ -1 ,2^ -1 ) and "u n sh u ffle "  is  the in verse  
perm utation). Other perm utations, such as BRP or m atrix transpose, are 
computed by cascaded algorithm s. In gen era l, we can emulate a Benes 
permutation network [21] by a sequence ASCEND;DESCEND, thus in  time 
O (log n ); i t  must be pointed ou t, however, that to r e a liz e  an a rb itra ry  
perm utation, the exchange in form ation  must be precomputed.
82.4  S orting  and C a lcu la tion  o f  Symmetric Functions
The p rev iou sly  described  merge routines can be used as the basis  o f
e f f i c i e n t  sortin g  algorithm s. A sequence o f  input keys is  d iv ided  in to
two h a lv es , each o f  which is  r e cu rs iv e ly  sorted  ( in  opposite  order in
the case o f  b it o n ic  s o r t ) ,  and then merged using e ith e r  o f  the above
2
merge rou tin e s . Both algorithm s run in  time O ((logn ) ) .
One can compute symmetric functions in  a com pletely  analogous
fash ion : apply recu rsive  c a l ls  to each h a lf  o f  the data, and compute the
2con volu tion  o f  the two re su ltin g  sequences, again in  time O ((logn ) ) .
2 .5 M atrix M u ltip lica tio n s  and Other Algorithms
To compute the m atrix product C = A X B o f  two n X n m atrices, we
T X Tmust obv iou sly  f i r s t  s to re  A = (Aq . - .A ^ ^ )  in  row major order, and
B = (Brt. . .B  , )  in  column major order. Assuming we have enough space0 n -1
k 3and p ro ce sso rs , i . e . ,  2 ^ n , we copy A and B in to  the pattern :
AnBnAnB, . . .AnB «A-B« . . .A.B . . .  .A ..B - . A ll  th is  can be achieved0 0 0 1  O n - 1 1 0  i j  n - l n - 1
with simple-minded cascaded algorithm s, in  time O (logn ).
Each o f  the sca la r  products c . . = A. *B. = E a. , *b, . i s  computed in
»J i  J -^ > ** *^ > J
p a r a l le l ,  w ith in  O (logn) a d d ition a l time u n its . The resu lts  c . . are
> J
then regrouped, according to the output format (say , row m a jor ).
Although the d e ta ils  o f  th is  algorithm  are a b i t  tedious to d e scr ib e , 
i t  should be c le a r  that m atrix m u lt ip lica tio n  can be computed in  time 
0 ( lo g n ) , w ith in  our c la s s  o f  a lgorithm s. In fa c t ,  a surpris ing  number 
o f  other algorithm s can be e f f i c i e n t l y  implemented w ith in  th is  framework, 
in clu d in g  a l l  o f  the in te re s t in g  algorithm s fo r  p a r a lle l  p rocessing known
to the au th ors.
93. DESCRIPTION OF THE SCHEME
In order to e f f i c i e n t l y  implement algorithm s in  the DESCEND c la s s ,
the most natural in tercon n ection  o f  modules is  that o f  the k-dim ensional
binary cube (k -cube) where each o f  the 2 processors is  numbered from 
1c0 to 2 -1  and is  connected to each o f  the k processors  whose binary 
numbering d i f f e r s  in  ex a ctly  one binary p o s it io n  ( fig u re  2 ) .  Although 
an ASCEND or DESCEND algorithm  can be implemented on such a machine in  
log^n p a r a lle l  s tep s , th is  proposal i s  not fe a s ib le  mainly becapse the 
number k = lo g 2n o f  connections fo r  each processor is  too la rg e . The 
unfolded k^cube and the p e r fe c t  s h u ffle  in tercon n ection s  have been proposed 
[17] ( fig u re  3 ) ,  as attempts to remedy th is  d i f f i c u l t y .
0
Figure 2. The 3-cube.
10
Figure 3. Unfolded 3-cube ( l e f t )  and p e r fe c t  s h u ffle  (r ig h t )  in te rco n n e ctio n s .
Although both structures have a fix ed  number (4) o f  connections per 
p rocessor , th e ir  in t r in s ic  topology make them in fe r io r ,  as regards p h ysica l 
layout (see s e c t io n  5 ) ,  to the scheme we now d e scr ib e .
Our p a r a lle l  computing system, the cu be-con n ected -cycles  (CCC), is  
a network o f  id e n t ic a l  p ro cessors , c a lle d  m odules. A module has 3 in t e r ­
connection  p o rts . Each in tercon n ection  lin e  lin k in g  two modules can be 
used fo r  the b id ir e c t io n a l  transm ission o f  one operand, and i t  is  
ir re le v a n t  here whether operand transm ission is  s e r ia l  or p a r a l le l .  For 
c o r r e c t ly  executing the algorithm s described  in  the preceding s e c t io n s , 
i t  is  in d if fe r e n t  to synchronize the en tire  system through a ce n tra l c lo c k , 
which d efin es  time un its  fo r  a l l  modules, or to le t  synchronization
11
problems be s e tt le d  at the le v e l  o f  each communication l in e ,  thus achieving 
a g lo b a lly  asynchronous system. In order to d escr ib e  the in t e r ­
con n ection s, we assume fo r  s im p lic ity  that n, the number o f  modules,
kis  a power o f  two, i . e . ,  n = 2 , and, moreover, assume that k is  o f  the
form k = r + 2r ; the m od ifica tion s  re su ltin g  when k is  a rb itra ry  are
stra ightforw ard  (in  the la t te r  case , r is  the sm allest in teg er  fo r  which
r + 2 ^ k ) . Each module has a k -b it  address m which in  turn is
expressed as a p a ir  (£ ,p )  o f  in tegers  represented w ith (k -r )  and r b it s
rre s p e c t iv e ly , such that l *2 + p = m.
As mentioned e a r l ie r ,  each module has three p o rts : F, B, and L
(mnemonic fo r  forw ard, backward, l a t e r a l ) , whose con n ection  is  e n t ir e ly
determined by the module address C£,p), that i s :
F (4 ,p ) i s  connected to B( l , (p+l)mod2r )
x*
B (£ ,p ) is  connected to F (£ ,(p -l)m od 2  )
L (£ ,p ) is  connected to L(£ + € 2P,p)
where € = l -2 b i t ^ ( i - ) .  The in tercon n ection  scheme i s  d isp layed  in
k* icfig u re  4 . In words, the modules are grouped in to  2 c y c le s ,  each
■£
cy c le  c o n s is t in g  o f  2 modules, c y c l i c a l l y  connected by the F-B lin e s .
The c y c le s  are in  turn in terconnected  as a (k -r ) -c u b e ; i f  
<Xq, x ^ , . . . ,x^ r are the dimensions o f  the (k -r )-c u b e , a l l  edges 
along dimension x^, c a lle d  c o l l e c t iv e ly  sheaf i ,  lin k  modules whose 
addresses are ( . , i ) .  The t o ta l  number o f  in tercon n ection  links is  
= | -n.
Each module conta ins an operand r e g is te r  T, a few memory lo ca t io n s , 
and possesses b a s ic  arithm etic and lo g ic a l  c a p a b i l i t ie s .  I t  is  c o n tro lle d  
by a stored  program or a c i r c u i t  implementation o f  such a program.
For the time bein g , we make the hypothesis o f  unlim ited p a ra lle lis m ,
p 6 -0 -0 — 0-| [-¿ -0 -0 — ^ X ) - - 0 |  —
Figure 4 . The CCC in tercon n ection  scheme.
that i s ,  the number o f  modules is  ta ilo re d  to the problem s iz e ;  under 
th is  h ypoth esis , the one or two memories mentioned e a r l ie r  s u f f i c e .  
Subsequently (s e c t io n  4 .3 ) ,  under the hypothesis o f  lim ited  
p a r a lle lis m , we s h a ll endow each module with a small p rivate  random 
access memory. In e ith e r  case , each module is  somewhat sim pler than 
a current m icroprocessor but not b a s ic a lly  d if fe r e n t  from i t .
13
4 . EMULATION OF THE k-CUBE ON THE CCC
In order to implement DESCEND on the CCC, we prune the k-cube so 
as to use on ly  connections e x is t in g  in  the CCC. The f i r s t  stage co n s is ts  in  
removing the sheaves corresponding to dimensions 0 , 1 , . . . , r -1 ,  and 
using instead  the c y c le  connections F and B, as introduced in  se ct io n  3.
Our o r ig in a l  DESCEND program is  thus transformed to : 
proc DESCEND
fo r  j  k -1  s tep -1  u n t i l  j  = r
do foreach  m: 0 <  m < n




foreach  1: 0 <  JL < 2^ r pardo LOOPOPER( )^ odpar 
corp DESCEND.
Here procedure DLOOPOPER( )^ processes the data w ith in  cy c le  
i  to compute the d esired  re su lt  in  0(2 ) p a r a l le l  s te p s , as we
r
show la te r .  Note that the running time is  s t i l l  0 (k -r )  + 0 ( 2  ) = 0 ( lo g n ) .
The second transform ation  co n s is ts  in  removing, fo r  a l l  
j = 0 , . . . , k - r - l ,  the k-cube links perta in ing  to sheaf (r  + j ) ,  except 
those e x is t in g  between modules whose addresses are o f  the form ( . , j ) :  
the re su ltin g  in tercon n ection  is  then e x a ctly  the one o f  the CCC, as 
introduced in  S ection  3.
The com putation corresponding to the fo r  loop o f  the above 
algorithm  can no longer be performed in  one p a r a lle l  step . Using 
repeated c ir c u la r  s h i f t s  w ith in  c y c le s ,  however, each operand in  the
14
c y c le  can be su cce ss iv e ly  brought to re s id e  fo r  one time un it in  module 
where OPER(. , j ; . , . )  can then be executed . Although the execution  
o f  OFER(. , j ; . , . )  fo r  a l l  operands in  a c y c le  now requires 2 time u n its , 
th is  com putation can be p ip e lin ed  (overlapped) w ith the analogous 
operations OPER(., i ; . , . )  fo r  r ^  i  < k. To achieve p ip e lin in g  thus 
requ ires a new fu n ction  BSHIFT(A), which performs a c y c l i c  backward 
s h i f t  o f  the operands in  c y c le  A, that i s :
foreach  j :  0 <  j < 2r pardo T [Z *2r+ ( ( j - l ) mod2r ) 1 «- T[A*2r+ j]
odpar.
The f in a l  v ers ion  o f  DESCEND is  thus: 
proc DESCEND
fo r  i  — 2r - l  s tep -1  u n t i l  i  = -2 r
do foreach  Z: 0 <  A < 2^ r
pardo foreach  p :m ax (i,0 ) <  p < min(2r , 2r+ i)
pardo i f  b i t  (A) = 0 then OPER(a,b;U,V) 
where a = Z *2r+ ( ( p + i - l ) mod 2r ) , 
b = p+r,
U = T [ i« 2 r+ p ],
V = T[Ce+2P) " 2 r+ p ]. 
f j
odpar;
BSHIFT(A) Comment backwards s h i f t  o f  cy c le  Z;
od ;
Comment end o f  treatment on sheaves k - l , k - 2 , . . . , r ;
k“ itforeach  Z: 0 <  Z < 2 pardo LOOPOPER(A) odpar 
corp DESCEND.
15
The inner operation  o f  the fo r  loop is  executed in  two time u n its ; 
one fo r  OPER, then one fo r  BSHIFT. The t o t a l  running time is  thus 4*2r 
plus the time fo r  executing LOOPOPER. I f  we can ensure that LOOPOPER 
can be processed in  time lin ea r  in  the c y c le  s iz e ,  the e n t ire  procedure 
w i l l  be executed on the CCC in  time O (lo g n ).
Figure 5 provides a schematic view o f  DESCEND on the CCC, and 
conventions used are those o f  fig u re  1, which d ep icts  DESCEND on the 
k -cube. Here we assume k = 3, thus the CCC co n s is ts  o f  4 cy c le s  o f  
length 2.
data time
0 1 2 3 4 5 6 7
i----------------------- L = J ------------------------1 OPER
1'  0 3 ’ 2 5'  4 7'  6
rw-*O' 3"-*—*2' 5"-*—*>4' 7"-*-*6'











5" 6" 7" 
1___ 1































4 .1  Computation W ithin the Cycles
The next question  to be addressed is  the implementation o f  
LOOPOPER(i-), so that i t  runs in  time lin ea r  in  the c y c le  length .
O bviously , we are constrained  to using on ly the F and B c y c le  links 
e x is t in g  in  the CCC. Our o b je c t iv e  i s  to emulate, on the c y c le  o f  
length 2 , the operation  OPER as i t  would be executed on h y p oth etica l 
r-cube sheaves. Since OPER may take p lace  in  the cy c le  only between 
ad jacent modules, p a rticu la r  care must be exerc ised  to ensure that the 
desired  a d ja cen cies , corresponding to a l l  sheaves, be g lo b a lly  
re a liz e d  in  time lin ea r  in  the c y c le  length . The key permutations fo r  
th is  task are based on the p e r fe c t  u n sh u ffle  [1 6 ,1 7 ], S p e c i f i c a l ly ,  
UNSHUFFLE( i , i )  performs the p e r fe c t -u n sh u ffle  operation  on each o f
1 — 1 ^  1 v* ^
the 2 contiguous b locks o f  length 2 in to  which T [i«2  : :(4 + 1 )* 2  -1 ]
is  subdivided, and is  r e a liz e d  as fo llo w s :
proc UNSHUFFLE(X, i )
fo r  b ♦“ 2L step -1  u n t i l  b = 2 
do foreach  m: m = JL »2r 4- (2 *s+l) •21 + p
where 0 <  s < 2r  ^ , -b < p < b,
(p mod 2) = (b mod 2) 
pardo T[m-1] ** T[m] odpar 
od
corp UNSHUFFLE.
C le a rly , UNSHUFFLE( 4 , i )  runs in  (21-1 ) p a r a lle l  step . I t  i s  a lso  easy to 
r e a liz e  that the program 
proc BRP(X)
fo r  i  *- r -1  step -1  u n t i l  i  = 1 do UNSHUFFLE (i,, i )  od 
corp BRP
r e a liz e s  the b it -r e v e r s a l  permutation o f  T[X*2r : : (X + l)2r - l ]  w ith referen ce 
to the r le a s t -s ig n if ic a n t  b it s  o f  the addresses.
17
We can now e lu cid a te  the general format o f  LOOPOPER, which co n s is ts
o f  a sequence o f  u n sh u ffle -op era tion  p a ir s , each emulating a sheaf
op era tion . This is  preceded by BRP, so that, upon com pletion , the re su lts
are in  the c o r r e c t  order (see fig u re  6 ) .  In the d e sc r ip t io n  below
the parameter a gives the o r ig in a l  address o f  the operand which is
brought to module (£ ,p )  by the sequence: BRP; UNSHUFFLE(4 ,0 ) ;
UNSHUFFLE{ 1 , 1 ) ;UNSHUFFLE( 4 , r - l - j ) .  (R eca ll that q denotes the
in teg er  whose binary rep resen tation  i s  the reversa l o f  that o f  the
in te g e r  q .)
proc LOOPOPER(4)
BRP(4) ;
fo r  j  r -1  step -1  u n t il  j  = 0 
do foreach  q: 0 <  q < 2r , b i t ^ q )  = 0
pardo 0PER(a, j  ;T [& «2r+q] ,T [JL *2r+ q+ l])





0 1 2 3 4 5 6 7
0 2 4 6 1 3 5 7
0 4 2 6 1 5 3 7
O' 4 ' 2 ' 6 ' 1 ' 5 ’ 3 ' 7 '
O' 2 ' 1* 3 ' 4 ' 6 ' 5 ' 7 ’
0" 2 " 1" 3" 4" 6" 5 ” 7"
0 M r* 2 ” 3" 4" 5" 6" 7"
o ’ " i ” , 2 1" 3 ' " 4  *" 5 ' " 6 7 ,M















Figure 6 . A schematic presentation  o f  LOOPOPER fo r  r = 3.
18
With resp ect to execution  time, we noted that UNSHUFFLE(• , i ) runs
i  O  *1
in  time 0 (2  ) ;  thus BRP and L00P0PER jo in t ly  run in  0 (1+2+2 + . . .+2 ) = 0(2 )
s tep s , lin e a r  in  the c y c le  length .
4 .2  Programs fo r  each Module o f  the CCG
From the preceding g lo b a l d e scr ip tio n  o f  DESCEND, i t  i s  rather 
stra ightforw ard  to produce the sequentia l program o f  module (£ ,p ) .  The 
program MODULE (i-,p ) fo r  a given DESCEND algorithm  is  o f  the form:
HIGHSHEAVES( l ,p);L0WSHEAVES(X,p), which re sp e c t iv e ly  implement the 
(k -r )-cu b e  operation  and L00P0PER. The e n tire  MODULE(4,p) is  o f  a 
very simple nature: i t  b a s ic a lly  counts up time and, at each time unit 
numbered t , i t  te sts  a simple lo g ic a l  con d it io n  in v o lv in g  £ , p ,  and t; 
depending on th is  t e s t ,  e ith e r  i t  does nothing, or i t  exchanges operands, 
or i t  exchanges operands and performs an operation  on them. The d e ta ils  
o f  these programs are om itted fo r  the sake o f  b re v ity .
The p re cise  execution  time o f  DESCEND (or ASCEND) on the CCC is  
given  by the formula:
T = 4 .2 r • T___ + (r+-2r )TCCC v '  oper
where Tccc  is  the time required fo r  stepping up the c o n tr o l v a ria b le  t ,  
te st in g  i t  and perform ing one data exchange on some o f  the lin k s : Toper
is  the time required fo r  computing 0PER(m,j;U,V) w ith in  each module.
4 .3  Lim ited P ara lle lism
So fa r ,  we have assumed that the s iz e  n o f  the CCC was ta ilo re d  to 
the a p p lica t io n . To cope w ith the r e a l i s t i c  s itu a t io n  where the number 
N o f  inputs is  larger than the s iz e  n o f  the CCC, we suggest to le t  
each module o f  the CCC be a f u l l  fledged  m icroprocessor endowed w ith a 
p riva te  RAM memory.
19
qAssuming fo r  s im p lic ity  that N = sn, w ith s = 2 in te g e r , we 
requ ire that the RAM memory o f  each module be o f  s iz e  s and denote 
by T [m ,0 ::2 q -1 ] the p riva te  memory lo ca tio n s  o f  module m. The input 
a g , . . . , a ^   ^ i s  d iv ided  in to  con secu tive  b lock s o f  s iz e  s , each b lock  
being stored  w ith in  a module o f  the CCC, so that T [m ,j] = a
2q *mfj
fo r  0 <  j  < 2q .
The on ly  m od ifica tion  concerns the program MODULE (JL, p) (see
S ection  4 .2 ) ,  which now assumes the format HIGHSHEAVES, p);LOWSHEAVES( l , p ) ;
LOCAL(£,p). Programs fo r  HIGHSHEAVES and LOWSHEAVES are the same as b e fo re ,
except that each operation  and data transm ission is  now su cce ss iv e ly  
q
performed on the 2 data items o f  each module. As fo r  LOCAL:
proc LOCAL(£,p) 
u m*2q
fo r  j q-1 s tep -1  unti 1 j = 0
q
do fo r  i  •- 0 step 1 u n t il  i  = 2 -1 
do i f  b i t ^ ( i )  = 0




I t  should be c le a r  by now that a l l  o f  the algorithm s described  in
%
S ection  1 can be applied  h ere . A d ir e c t  ana lysis  shows th at, on a CCC
co n s is t in g  o f  n p rocessors , each processor having memory “  , we can
Nprocess N inputs in  time 0 (—*logN) fo r  algorithm s in  the c la sse s  ASCEND or
DESCEND, thus ach ieving the optim al speed-up p o ss ib le  with n p ro cessors .
20
5. LAYOUT OF THE CCC FOR VLSI
I t  is  in te re s t in g  to examine the ju s t  d escribed  CCC w ith in  the
framework o f  the "VLSI model o f  com putation" re ce n tly  proposed [1 4 ,1 8 ,1 9 ].
In  th is  model, each w ire has u n it width on the s i l i c o n  ch ip  and transm its
a unit o f  in form ation  in  a u n it o f  time; in form ation  is  taken from, or
d e livered  t o ,s p e c ia l  areas on the ch ip , c a lle d  nexuses, each associa ted
w ith a module. W ithin th is  model, which takes r e a l i s t i c  account o f  the
placement o f  modules and in tercon n ection , C. D. Thompson has studied  the
implementation o f  the Fast-Fourier-Transform  [18] and has e lu cid a ted
s ig n if ic a n t  re la tion sh ip s  between input s iz e  n, ch ip  area A, p rocessing
time T, and the s o -c a lle d  minimal b is e c t io n  width oo.  ^  ^ Thompson has shown 
2
that A ^ (i) /4  in  gen era l, and th at, fo r  the n -p o in t FFT, T ^ n/2uj, thus
2 2e s ta b lish in g  the bound AT ^ n /16 - The lower bound fo r  time ap p lies  to a 
wider c la s s  o f  problem s, as shown by the fo llow in g  p ro p o s it io n  which we 
s ta te  w ithout p roo f:
P ro p o s itio n : In the VLSI model (Thompson [1 8 ] ) ,  time T ^ is  required to
merge two sorted  sequences o f  length n /2 , or to r e a liz e  the data rearrange­
ment s p e c if ie d  by some permutation drawn from a tra n s it iv e  group o f  
perm utations. 2^^
2 n2As a consequence, we have AT ^ rrr fo r  a l l  such problem s.
^For a graph G -  (V,E) the minimal b is e c t io n  width <jq is  defin ed  as the 
sm allest in teger  such that oj = | t (u ,v )  € E:u € V ^ v  6 V2)| , where
CVi,V2l is  a p a r t it io n  o f  V with |V^ | <  (V^ | <  |v^ | + 1.
( 2 )
A subgroup G o f  the symmetric group S i s  said  to be tra n s it iv e  i f  
1 — i » j  — n, 4C € G :a ( i )  = j ,  meaning that data loca ted  in  any
p o s it io n  o f  the machine may be moved in to  any other p o s it io n  o f  the machine.
21
With the CCC, we have shown that operations such as FFT, merging,
c y c l i c  s h i f t s ,  s h u ffle s , e t c . ,  are a l l  r e a liz a b le  in  the minimal ach ievable
2 2time T = O (lo g n ). We now demonstrate that A = 0 (n  /lo g n  ) thus ach ieving 
the lower bound e x a ct ly ; th is  means that the CCC is  optim al in  the VLSI 
model fo r  FFT, merging o f  sorted  sequences, and r e a liz a t io n  o f  permutations 
drawn from a tra n s it iv e  group. In co n tra s t , known layouts fo r  the k-cube 
or the p e r fe c t  sh u ffle  have area o f  a larger order.
2
To achieve A = O ((n /lo g n ) ) fo r  the CCC, con sider a layout which 
uses two sheaves o f  evenly spaced w ires , h o r izo n ta l and v e r t i c a l ,  used 
r e sp e c t iv e ly  fo r  cube and c y c le  con n ection s. Figure 7 p i c t o r ia l ly
g
provides base, in d u ctive  h ypoth esis , and exten sion , to prove that an n = s*2
s smodule CCC can be placed on a 2 X (2 .2  -1 ) ch ip ; s ince s ~  lo g ^ (n /lo g ^ n ),
2
the ch ip  s iz e  i s  about (n /lo g 2n) x (2 n /lo g 2n - l )  = O ((n /lo g n ) ) .  S lig h t ly  
more com plicated  con stru ction s  y ie ld  somewhat more e f f i c i e n t  module 
placements as suggested by fig u re  8.
22
Figure 7 .  A standard layout fo r  the in tercon n ection  o f  4 .2  modules.
Figure 8. A more econom ical layout fo r  the in tercon n ection  o f  4 .2  modules.
23
For pedagogica l reasons, the CCC introduced so fa r  has a number
S rn = s*2 o f  processin g  modules w ith s = 2 a power o f  2. A more general
g
vers ion  o f  the CCC can be designed, com prising n = h*2 modules. Each
S so f  the 2 cy c le s  o f  the machine has h ^ s modules. The lower s X 2
modules o f  the cy c le s  e x h ib it  the h o r izo n ta l in tercon n ection  o f
3
standard CCC, w hile the (h -s )  X 2 higher modules on ly have v e r t ic a l
(c y c le )  con n ection s, as in d ica ted  in  figu re  9 . Such a layout has height 
s s+12 + h -s and width 2 ( in  u n it w ire w id th ). The programs presented in
s e c t io n  4 can be adapted to run on such a machine by simply ignoring
operations p erta in in g  to n on -ex istin g  h o r izo n ta l (e x tern a l) lin k s , and
th e ir  running time i s  p rop ortion a l to the cy c le  length h. We see that, fo r
2
any value o f  h s a t is fy in g  l ° g 2n — h — Vn, the area X (tim e) product
AT2 = (~ + h -  lo g (^ ) )  X ^  X h2 = n2 + nh2 - nh lo g (^ ) = 0(n2) 
meets the optim al th e o r e t ica l bound, to w ith in  a constant fa c to r .  Of 
p a rticu la r  in te r e s t  is  the ch o ice  h = 0(\/n) , which leads to a running 
time T = OCs/n) and uses the minimal achievable area A = 0(n) .
Figure 9 . A standard layout fo r  an h X 2S CCC (h = 6, s = 4 ) .
24
6 * CONCLUSION
In th is  paper, we have proposed a s tru ctu re  which can be used fo r  
d ir e c t  hardware implementation o f  s p e c i f i c  u se fu l a lgorithm s, o r , as 
suggested in  se ct io n  4 .3 , as a general purpose p a r a l le l  processin g  
system.
We expect the CCC to be p r a c t ic a l ly  fe a s ib le  in  the present s ta te  o f  
the technology , and to be capable o f  executing e f f i c i e n t l y  a wide v a r ie ty  
o f  a lgorithm s. The extent o f  the c la s s  o f  algorithm s amenable to e f f i c i e n t  
CCC p rocessin g  is  not yet w e ll understood, but i t  goes beyond the 
a p p lica tio n s  described  in  S ection  1; in  p a r t icu la r , i t  in cludes a v a r ie ty  
o f  m atrix and graph algorithm s, as w e ll as arithm etic and a lgebra ic  
problem s.
Another s a lie n t  feature o f  th is  work is  the p o s s ib i l i t y  which appears 
to e x is t  o f  developing a high le v e l ,  general purpose language fo r  p a r a lle l  
programming, which would nevertheless be autom atica lly  com pilable on systems
such as the CCC.
25
REFERENCES
[1] A. V. Aho, J . E. H opcroft and J . D. Ullman, The A nalysis and Design 
o f  Computer A lgorithm s. Addison-W esley, Reading, M ass., 1974.
[2] D. H e lle r , "A survey o f  p a r a lle l  algorithm s in  num erical lin ear  
a lg e b ra ,"  Dept, o f  Comp. S e i . ,  Carnegie-M elion U n iversity ,
P ittsbu rgh , P a ., Feb. 1976.
[3] D. S. H irschberg, "Fast p a r a lle l  so rtin g  a lg orith m s,"  Communications 
o f  the ACM, v o l .  21, no. 8, pp. 657-661, 1978.
[4] L. G. V a lia n t, "P a ra lle lism  in  comparison problem s," SIAM Journal on 
Computing, v o l .  4 , 3, pp. 348-355, S ept. 1975.
[5] F. P. Preparata, "New p a r a lle l  s o rt in g  schem es," IEEE Transactions on 
Computers, v o l .  C-27, no. 7, pp. 669-673, July 1978.
[6] W. M. Gentleman, "Some com plexity re su lts  fo r  m atrix computation on 
p a r a l le l  p r o ce s s o rs ,"  Journal o f  the ACM, 25, 1, pp. 112-115,
Jan. 1978.
[7] G. H. Barnes e t  a l ,  "The ILLIAC IV com puter," IEEE Transactions on 
Computers, v o l .  C-17, pp. 746-757, 1968.
[8] C. D. Thompson and H. T. Rung, "S ortin g  on a mesh connected com puter," 
P ro c . ACM-SIGACT Symp. on Theory o f  Computing, Hershey, Pa .,
pp. 58-64, May 1976.
[9] H. T. Rung and C. E. L e iserson , "Algorithm s fo r  VLSI processor a rra y s ,"  
Symposium on Sparse M atrix Computations, R n o x v ille , Tenn., Nov. 1978.
[10] D. Nassimi and S. Sahni, "B iton ic  s o r t  on a m esh-connected p a r a lle l  
com puter," IEEE Transactions on Computers, v o l .  C-28, no. 1, pp. 2 -7 , 
Jan. 1979.
[11] L. J . Guibas, H. T. Rung and C. D. Thompson, "D irect  VLSI implementa­
tion  o f  com binatoria l a lgorith m s," Research R eport, Dept, o f  Comp.
S e i . ,  Carnegie-M ellon U n iversity , P ittsburgh , P a ., March 1979.
[12] R. N. L e v itt  and W. H. Rautz, "C e llu la r  arrays fo r  the so lu t io n  o f  
graph problem s," Comm, o f  the ACM, v o l .  15, no. 9, pp . 789-801, 1972.
[13] M. C. Pease, "The in d ir e c t  binary n-cube m icroprocessor a rra y ,"
IEEE Transactions on Computers, v o l .  C-26, no. 5, pp . 458-473,
May 1977.
[14] A. M. Mead and L. A. Convay, In trod u ction  to VLSI Systems.
Textbook in  preparation  (1979).
26
[15] K. E. Batcher, "S ortin g  networks and th e ir  a p p lic a t io n s ,"
P roc . AFIPS Spring J o in t Computer C on ference, v o l .  32, 
pp. 307-314, A p ril 1968.
[16] D. E. Knuth, The Art o f  Computer Programming Volume 3, S ortin g  and 
Searching. Addison-W esley, Reading, M ass., 1973.
[17] H. S . Stone, "P a r a lle l  p rocessin g  w ith the p e r fe c t  s h u f f le ,"
IEEE Transactions on Computers, v o l .  0 2 0 ,  pp. 153-161, 1971.
[18] C. D. Thompson, "Area-tim e com plexity fo r  VLSI," Proc. o f  the 11th 
Annual ACM Symp. o f  Theory o f  Computing, pp . 81-88, 1979.
[19] C. D. Thompson, "A com plexity theory fo r  V I5 I ,"  Ph.D. T hesis , 
Carnegie-M elion U n iv ers ity , Dept, o f  Comp. S e i . ,  1980.
[20] M. C. Pease, "An adaptation o f  the Fast Fourier transform  fo r  
p a r a lle l  p ro ce s s in g ,"  Journal o f  the ACM, 1 5 (2 ), pp. 252-264,
A p r il 1968.
[21] A Waksman, "A permutation netw ork," Journal o f  the ACM 15 m  
pp. 159-163 (1968).
