Automatic data distribution for nearest neighbor networks by Philippsen, Michael
Automatic Data Distribution for Nearest Neighbor Networks
Michael Philippsen
Universitat Karlsruhe
D Karlsruhe FRG
phlipp	iraukade
This paper appeared in Frontiers The Fourth Symposium on the Frontiers of Massively
Parallel Computation pages 	
	 McLean Virginia October 
 
Abstract
An algorithm for mapping an arbitrary multidimen
sional array onto an arbitrarily shaped multidimen
sional nearest neighbor network of a distributed mem
ory machine is presented The individual dimensions
of the array are labeled with highlevel usage descrip
tors that can either be provided by the programmer or
can be derived by sophisticated static compiler analy
sis
The presented algorithm achieves an appropriate
exploitation of nearest neighbor communication and
allows for ecient address calculations
We describe the integration of this technique into
an optimizing compiler for Modula and derive ex
tensions that render ecient translation of nested par
allelism possible and provide some support for thread
scheduling
  Introduction
An important problem facing numerous projects on
distributed memory machines is that of distributing
array data over the available processors There is
widespread agreement about the two goals of data dis
tribution 	
 Data locality All data elements should
be stored local to a processor that executes some state
ments upon it reducing the amount of communication
and resulting in minimal runtime 	 Parallelism
Perfect data locality and the minimal communication
cost are achieved by employing only a single processor
In general however the runtime can be improved by
exploiting the full degree of parallelism provided by
the hardware An appropriate tradeo between the
conicting goals of data locality and parallelism must
be found
Whereas the goals are agreed upon totally dier
ent approaches to reach them have been developed
In many programming languages the user must pro
gram the data layout explicitly Some languages re
quire an explicit mapping of the data onto the topol
ogy 
  
 
  others are more abstract and
provide either sets of directives for the compiler or in
teractive or knowledgebased environments that help
determine alignment of array dimensions and mapping
functions  
 
 
 
 
  Even a data distri
bution language has been developed for this purpose
 Recent work    
 

 focuses on static
compiletime analysis to automatically nd data de
composition that achieves both goals
The approaches based on directives and on auto
matic decomposition often use a highlevel interme
diate representation where transformations are done
dimensions of dierent arrays are grouped together
superarrays and corresponding index transfer func
tions are derived usage patterns are exploited etc
After this complex compiler phase the resulting data
structures which can again be understood as arrays
are straightforwardly mapped onto the available pro
cessors of the hardware
However we feel that the mapping of the 	result
ing arrays onto the topology could be improved to
reach the following goals 	 Exploit fast communi
cation patterns if there is special hardware support
for a particular communication pattern 	 Simple
address calculations The computation of processor
numbers and addresses of data elements must be fast
Otherwise evaluation of complex expressions easily
consumes too much computational power
Therefore we present an algorithm for mapping an
arbitrary multidimensional array onto an arbitrarily
shaped multidimensional nearest neighbor network of
a distributed memory machine
After a short introduction of nearest neighbor net
works and the notation we use we identify some allo
cators that are commonly used In the fourth section
we present our strategy for determining the data dis
tribution The strategy consists of several steps each
of which is based on the available information and
renes the distribution function In the last two sec
tions we describe the integration of this technique into
an optimizing compiler for Modula and derive ex
tensions that render ecient translation of nested par
allelism possible and provide some support for thread
scheduling



kCoordinate ndimensional Nearest
Neighbor Networks
In a

kcoordinate ndimensional neighbor network

k
NN
n
 each processor is connected to two neighbors in
each of the n dimensions In other words each node
participates in n dierent NN
 
subnetworks one per
dimension The number of processors in each of these
dimensions is given by

k  	k
 
 k

     k
n
 In a NN
n
the degree of each node is n
Some NN
n
have been used successfully in parallel
machines and thus have received special names For
example a

kNN
n
with

k  	       is called a
hypercube A two or threedimensional processor grid
is a NN

or a NN

 resp For our purposes it does not
matter whether a given

kNN
n
is a hypertorus where
the NN
 
subnetworks are rings or a hypermesh where
they are linear arrays We restrict our considerations
to

kNN
n
with dimensions that are powers of two ie
there exist 
i
such that k
i
 

i
for all 
   i   n
The number of bits that are necessary to encode
the address of a node along dimension i is given by 
i

Along a dimension the nodes are numbered continu
ously The complete address of a node in the network
needs
P
n
i 

i
bits It is constructed by concatenating
the address segments of length 
i
 Bits are numbered
as follows
         
 
  
 z 
 
 
  
 
       
 
 

  
 z 
 

      
n  
X
i 

i
      
n
X
i 

i
  
 z 
 
n
From the address of a given node the address of a
neighboring node along dimension i can be derived by
incrementing or decrementing the according segment
of 
i
bits 	
i  
 
i  
 
     
i
 
 ie by adding

  

i  
 We dene 

  for notational conve
nience
 Array Allocation
By studying several approaches for explicit declaration
of layout patterns 
  
 
  
 
  
 
 


 and for intermediate representation of allocation
information in modern compilers    
 

 

we realized that a set of three machineindependent
types of distribution information is used most often
Consider a mdimensional array Each of the di
mensions can be labeled with either spread cycle
or local These hints for data distribution are either
derived from the explicit specication by the user or
they are a result of compiler analysis of reference pat
terns The semantics of these hints are as follows
 spread Dimensions that are attributed spread
are divided into segments one for each of the
available processors A vector with l elements is
assigned to P processors by allocating a segment
of length dlP e to each processor While utilizing
all available processors it minimizes the cost of
nearestneighbor communication
 cycle Dimensions labeled with cycle are dis
tributed in a roundrobin fashion over the avail
able processors In contrast to spread cycle
maximizes the cost of nearestneighbor commu
nication neighboring array elements are always
in dierent processors leading to better proces
sor utilization if a parallel algorithm operates on
subsegments of a vector at a time
 If either spread or cycle apply to several dimen
sions the distribution of the data on the proces
sors must try to generalize the above communica
tion behavior and requirements for potential par
allelism for all these dimensions
 local Array elements whose indices dier only
in a dimension that is labeled local are associat
ed with the same processor This facility is used
to avoid distribution of data in a given dimen
sion By unrolling all given local dimensions
into one pseudodimension with a length that is
the product of the lengths of the individual di
mensions only one single local dimension must
be considered
We focus our considerations on an array with m di
mensions The vector

l  	l
 
 l

     l
m
 denotes the
number of elements in each of the dimensions We as
sume that the dimensions are rounded up to the next
power of two ie there exist 
j
such that l
j
 

j
for
all 
   j   m
The number of bits that are necessary to encode
the subscript of one dimension j is given by 
j
 In to
tal
P
m
j 

j
bits are required to encode all subscripts
Without loss of generality we consider the dimensions
of the array to be sorted such that the following label
ing holds
j 




    s  spread
s  
   c  cycle
c 
  m  local
Note that replication and other distribution patterns
eg wrappedk are reducible to the allocators given
above by adding additional address bits or by simply
splitting the original array address in two segments
 Data Distribution
To distribute array data we present a mapping func
tion that operates in four major steps Mapping is
based on bit representations of the addresses of da
ta elements in a virtual address space First we de
cide how many bits of this virtual address space are
mapped onto both the processor address and the local
address Second we allot the limited number of bits in
the processor address to dimensions of a given array
Third we fulll the intention of spread and cycle by
deciding which of the address bits of the dimensions
will go into the processor address Finally the bits in
the processor address are reshued to align spread di
mensions with axes in the network to support ecient
nearest neighbor communication
 Processor Bits and Local Bits
The
P
m
j 

j
address bits that are necessary to encode
the oset of the array elements must be mapped onto
a real topology Therefore the virtual address is split
into two parts on distributed memory machines
 the number of bits used for the processor address
b
p
 min	
n
X
i 

i

c
X
j 

j

 the number of bits used for the local part of the
address referring memory cells in the local mem
ory of the processors
b
l









m
P
jc 

j
if
c
P
j 

j
 
n
P
i 

i
m
P
j 

j

n
P
i 

i
otherwise
In this paper we deal with the general case where
there are fewer processor bits available than are re
quired in total by all spread and cycle dimensions
Therefore the processor fraction of the address of a
data element consists of b
p

P
n
i 

i
bits and the lo
cal fraction consists of the remaining b
l

P
m
j 

j
b
p
bits
 Allotting of Processor Bits
In step 
 we have determined how many of the bits
that are required to address all array elements are im
plemented by the node number and how many are
mapped to addresses in the local memory of the nodes
In step  we allot the available number of processor
bits to individual dimensions of the array ie given
the number of bits that are necessary to encode a sub
script in a particular dimension we decide now how
many thereof are mapped onto the processor address
For each dimension j we nd numbers p
j
 IN and
q
j
 IN with 
j
 p
j
 q
j
 Let p
j
denote the num
ber of bits that are mapped onto bits of the processor
address The number of local bits is denoted by q
j

We propose a distribution that reects the relation
of the number of elements in individual dimensions
ie a dimension that has twice the number of ele
ments than another dimension should receive a twice
as many processor bits One can think of other al
lotting schemes eg a simple even distribution solely
based on the number of dimensions or an allotment
driven by weights that are based on detailed program
analysis For simplicity however we favor the follow
ing distribution
p
j








min	
j

l
j
c
P
 
l

 b
p
 if 
   j   c
 if c 
   j   m
The number of local bits q
j
for a dimension 
   j   m
is thus q
j
 
j
 p
j
 Work on comparing dierent al
lotting schemes and on deriving allotting information
from programs is in progress
 Semantics of spread and cycle
Although for the address in each dimension the num
ber of bits that end up in the processor address or in
the local fraction of the address have been determined
it is still unclear to which group each particular bit be
longs if both p
j
and q
j
are nonzero
The selection is based on the semantics of spread
and cycle
 spread by assigning the rst p
j
bits of the coor
dinate address to the processor bits the semantics
of spread is fullled neighboring data elements
whose addresses dier only in the lowest bits re
side in the same processor
 cycle for cyclic dimensions the processor bits
are assigned from the rear Thus neighboring
data elements end up in dierent processors
For example consider the data distribution for an ar
ray with one dimension 	length l
 
   on a ma
chine with k processors ie a kNN
 
 Although

 
 
 bits are required to represent the address of
an array element only 
 
 
  p
 
bits of processor
number are available
A
lsbhsb
processor number local
address
SPREAD
A
lsbhsb
processor numberlocaladdress
CYCLE
The gure shows the selection of those p
 
bits of the
address that end up in the processor address 	shaded
in case of both a spread and a cycle labeling of the
dimension
Assume a subscript vector 	x
 
 x

     x
m
 For the
local address each dimension with q
j
  contributes
q
j
bits These bits are concatenated ie shifted ac
cording to the sum of all q

of earlier dimensions and
added together For spread dimensions the last q
j
bits of x
j
are extracted by computing x
j
mod 
q
j
 The
rst q
j
bits of a subscript x
j
in a cycle dimensions
are derived by x
j
div 
p
j

localAdr 
m
X
j 
x

j
 
j  
P
 
q

with x

j




x
j
mod 
q
j
if j  f
   sg
x
j
div 
p
j
if j  fs 
    cg
x
j
if j  fc 
   mg
Analogously the preliminary processor address is de
rived Here the div and mod operations are exchanged
since for spread dimensions the rst 	div 
q
j
 and for
cycle dimensions the last 	mod 
p
j
 bits must be ex
tracted
prelimProcessorNo 
m
X
j 
x

j
 
j  
P
 
p

with x

j




x
j
div 
q
j
if j  f
    sg
x
j
mod 
p
j
if j  fs  
    cg
x
j
if j  fc 
   mg
Note that these address computations can be done by
masking and shifting ie by bit operations that can
eciently be executed on virtually all computers
 Network Topology
Although a preliminary processor address has been
determined in step  this mapping does not use the
topology of the network Due to the semantics of
spread it is desirable to exploit as many of the n near
est neighbor links provided by a NN
n
as possible for
mapping of spread dimensions This can be achieved
by mapping the p
j
bits of a spread dimension onto
a segment of the processor address that represents a
NN
 
subnetwork Therefore the p
j
bits ideally should
start at bit position  
 
     or 
n  

Step  gives a permutation for the bits in the pre
liminary processor address This permutation results
in an ordering such that the least signicant bits rep
resenting spread dimensions are aligned with those
processor bits that correspond to axes of the network
We present the permutation by means of an example
before it is given formally
We assume that the spread dimensions of the given
array are sorted so that their relative sizes conform to
the relative sizes of the dimensions of the underlying
network
Consider a 	

 

 
	
NN

and an array with p 
	    and s   c   ie  bits of the rst
dimension  bits of the second dimension  bits of
third dimension and  bits of the fourth dimension
mapped onto the processor address The rst three
dimensions 	s   are labeled with spread 	The
dimensions may have q
j
  but we do not consider
the local part of the address in this example The
bits in the fourth dimension are not relevant to step 
since cycle dimensions do not need neighbor links
The mapping after step  and before considering
the topology of the network is
1
2345678
910111213
141516
0
bits dimension 3 bits dimension 2 bits dim 1
In general the bits belonging to a spread dimension
are not aligned with axes of the given NN

 To achieve
alignment processor address bits must be permuted
The rst dimension of the array is already aligned
with bit position  of the processor address Formally
we apply the 	identical permutation
	
 

 



The second dimension of the array 	shaded above is
not aligned with the bits representing the second axis
of the network This can be achieved by shifting the
address left by  positions
	
        

      
  


The rst seven entries in the permutation implement
the shift of the bits that belong to the second dimen
sion of the array to the desired position 	 
 The
last two entries move those bits that are overwritten
by the shift to the released positions
01
2345678
910111213
141516
12345678
910111213
141516 0
bits dimension 3 bits dimension 2 bits dim 1
The nal permutation aligns the third dimension
	shaded above with the third axis of the network
Note that due to previous permutations bits that be
long to the third dimension are in various positions of
the processor address Therefore the results of previ
ous permutations on earlier dimensions must be pre
served Since ve bits t into the third segment of the
processor address all ve bits 	
 must be moved
In this example three target positions must be cleared
by moving the occupying bits away

B
B
B
B

	 	
 

 
 
 
 
 

   

 
 
 
 
 	 	
 


   

C
C
C
C
A
12345678
910111213
141516 0
bits dimension 3 bits dimension 2 bits dim 1
1234567910111213
141516
08
We now present the permutation formally Let the bits
in the processor address be numbered from  to b
p


Then the following permutation must be applied 
  
   min	n s times ie for each spread dimensions
the permutation is applied as long as the number of
axes is sucient Initially 	i  i holds


i 























j 
  
P
 


if i  j 
  
P
 
p


    j  minp

  


iminp

  

 if
  
P
 


 i 
  
P
 
p

i minp

  

 if
  
P
 
p

 i minp

  


 i minp

  

 
  
P
 


i otherwise
The rst of the above cases maps the appropriate num
ber of the bits belonging to dimension  to their tar
get positions Cases  and  clean target positions by
moving the occupying bits away
Note that again these address computations can be
done by simple bit operations After the permutation
is applied a sequence of bits that originally belong
to a single array dimension might run accros sever
al network dimensions in the processor address espe
cially when mapping lowdimensional arrays on high
dimensional networks eg on a hypercube In this
case performance can be improved by graycoding of
the bit sequence which is also a fast bit operations
 Data Allocation in Modula
In Modula 	complete denition in 
 there is no
explicit mapping of data elements to processors Da
ta layouts are derived automatically by the compiler
The mechanism is based on 	
 directives in the dec
laration of the arrays that describe the intended use
of individual dimensions of the array on a high level
and 	 on the analysis of runtime constants inside of
forall statements and their use in array subscripts
In the gure the general architecture of the opti
mization phase of our Modula compilers is depict
ed The compiler detects redundant synchronization
events by data dependence analysis The elimination
process is described in more detail in 
  This
transformation is a prerequisite for an ecient transla
tion for MIMD machines where synchronization is ex
pensive and it improvesmachines utilization on SIMD
machines since larger code sections can be fused into
single virtualization loops
Data allocation works on a perprocedure basis ie
only one procedure at a time is analyzed To separate
machineindependent from machinedependent parts
of the allocation strategy and to limit the size of the
problem we split data allocation hierarchically into
two parts
Virtualization
Standard Optimization
Code
Data Allocation
inter-forall Heuristic
intra-forall Heuristic
machine
dependent
cost
functions
Elimination of Synchronization Events
Modula-2* Program
equivalent Modula-2* Program
The interforall heuristic called Inter considers the
individual forall statements as black boxes Inside
a given forall the data distribution is xed The
interforall heuristic decides whether the actual distri
bution remains unchanged between successive foralls
or not This decision is based on the result of the
intraforall heuristic Intra which processes the se
quence of statements inside a forall founded upon
known distributions of data elements The questions
are where to compute and to store intermediate re
sults and where to handle scalar values 	frontend
single processor redundancy etc Whereas Inter is
machineindependent because it is based solely on the
cost values determined by Intra the latter needs to
know about computation costs the topology of the
network and the communication costs
Inter is based on Knobes work   we con
struct a graph of machineindependent alignment pref
erences ie we decide which dimensions of two arrays
should be aligned to achieve locality of the elements
Alignment is done by constructing superarrays and
by providing functions that map the original arrays
onto the superarray Dimensions of dierent arrays
that are often used together in general end up in the
same dimension of the superarray However often
this graph of alignment preferences contains cycles
Breaking a cycle is equivalent to redistributing array
data Often there are dierent ways to break cycles
ie there are dierent possible redistributions
Based on this rst phase of alignment the super
arrays are distributed on the machine by means of the
mechanism presented in this paper The distribution
is a prerequisite for realistic estimation of computation
and communication costs
 Thread Scheduling in Modula
The presented distribution technique can be extend
ed to allocate processors to threads and to determine
virtualization Nested parallelism can be handled as
well
Thread Scheduling
Parallelism in Modula is not limited to simple vec
tor operations In a forall all Modula statements
are allowed eg ifwhile procedure calls and nested
foralls Therefore it is not sucient to analyze ar
ray occurences in a program Data layout and thread
scheduling can be combined by the technique present
ed above
A forall statement creates a number of threads
Each thread has a runtime constant to which a
unique value of an enumeration type is assigned
These threads must be distributed to the processes
Since a thread is identied by its runtime constant
the array of these runtime constants describes all
threads created by the forall Thus by deriving a
data layout for this array the threads are scheduled
The analysis of the patterns in which the forall
constant is used results 	
 probably in a linear trans
formation function reecting oset and stride of the
usage patterns of the runtime constant and 	 prob
ably in an alignment with a particular array dimension
used inside of the forall if any
Consider the following example We assume that A
is a      array and N  
FORALL iN BEGIN
Acdi  	
END
Program analysis gives us that the runtime index i
which is created by the forall should be aligned with
the third dimension of array A By applying the data
distribution technique both for the array A and the
array of runtime constants i we nd the following
situation
For the assignment statement of the example only


 
 of the processors are active which is indicated
by the shaded fraction of the address bits of i The
white fraction determines the size of the virtualization
loops Since one bit reserved for i ends up in the local
address each of the active processors must iterate and
perform two assignments Thus it is easy to derive
from the allocation of i to which processor the threads
are scheduled and what type of strip mining must be
done
A
local
address
processor number
i
SPREAD SPREAD SPREAD
Nested Parallelism
The above concept is neatly extended for both nested
and recursive parallelism as can be seen from a slight
extension of the previous example
FORALL iN BEGIN
FORALL jN BEGIN
Ajdi  	

P
END
END
Again program analysis derives the set of active pro
cessors and the iteration space for the virtualization
loops
Consider the call of the procedure P Assume P it
self spawns additional parallelism From the layout of
the runtime constants it is obvious which processors
must call and virtualize it and which bits of the pro
cessor address are free to be used for the additional
parallelism In the example bits  to  can be used for
scheduling of the additional threads
Because of control statements 	if while     only
a small fraction of the total number of threads may
A
local
address
processor number
SPREAD SPREAD SPREAD
47
j i
be active at a procedure call Thus not all the bits of
the processor address assigned to the initial runtime
constants are really required If the number of active
thread needs fewer bits to be coded rescheduling these
threads ie redistributing the context stacks enables
a higher degree of parallelism for the called procedure
	 Conclusion
An important problem facing numerous projects on
distributed memory machines is that of distributing
array data over the available processors Although
some work has been done on aligning data structures
to reach data locality and a sucient degree of paral
lelism by improving the nal mapping of the data to
the processors hardware supported nearest neighbor
communication can be exploited and address calcula
tions can be simplied In this paper we have pre
sented an algorithm for mapping an arbitrary mul
tidimensional array onto an arbitrarily shaped multi
dimensional nearest neighbor network This mapping
achieves these goals It has partly been implemented
in optimizing compilers for a parallel language and
it supports thread scheduling and the translation of
nested parallelism
References

 American National Standards Institute Inc
Washington DC ANSI Programming Language
Fortran Extended Fortran 	 ANSI X
	
 

 Ingo Barth Thomas Br aunl Stefan Engelhardt
and Frank Sembach Parallaxis 	vers  us
er manual Technical Report !
 Universit at
Stuttgart February 


 Barbara M Chapman Heinz Herbeck and
Hans P Zima Automatic support for data dis
tribution In Proc	 of the th Distributed Memo
ry Computing Conference pages 
" Portland
Oregon April  " May 
 


 Georey Fox Seema Hiranandani Ken Kennedy
Charles Koelbel Uli Kremer ChauWen Tseng
and MinYou Wu Fortran D language specica
tion Technical Report CRPCTR Center
for Research on Parallel Computation Rice Uni
versity December 

 Manish Gupta and Prithviraj Banerjee Auto
matic data partitioning on distributed memory
multiprocessors In Proc	 of the th Distribut
ed Memory Computing Conference pages "
Portland Oregon April  " May 
 


 Ernst A Heinz Automatische Elimination
von Synchronisationsbarrieren in synchronen
FORALLs Masters thesis University of Karl
sruhe Department of Informatics November



 K Kennedy and HP Zima Virtual shared mem
ory for distributedmemory machines In Proc	
of the th Conference on Hypercubes Concurrent
Computers and Applications volume 
 pages

" Monterey CA 
 ACM Press New
York
 Kathleen Knobe Joan D Lukas and Guy L
Steele Data optimization Allocation of arrays
to reduce communication on SIMD machines
Journal of Parallel and Distributed Computing
	
"

 February 

 Kathleen Knobe and Venkataraman Natarajan
Data optimization Minimizing residual interpro
cessor data motion on SIMD machines In Joseph
J#aJ#a editor Frontiers The Third Symposium
on the Frontiers of Massively Parallel Computa
tion College Park University of Maryland Oc
tober "
 


 Charles Koelbel and Piyush Mehrotra Support
ing shared data structures on distributed memo
ry architectures In Proc	 of the nd ACM SIG
PLAN Symposium on Principles and Practice of
Parallel Programming PPOPP pages 
"

March 



 Jingke Li and Marina Chen Index domain align
ment Minimizing cost of crossreferencing be
tween distributed arrays In Joseph J#aJ#a editor
Frontiers  The Third Symposium on the Fron
tiers of Massively Parallel Computation pages
" College Park University of Maryland
October "
 


 MasPar Computer Corporation MasPar Parallel
Application Language MPL Reference Manual
September 


 Piyush Mehrotra and John Van Rosendale The
BLAZE language A parallel language for scien
tic programming Parallel Computing "

 November 


 Michael Metcalf and John Reid Fortran  Ex
plained Oxford Science Publications 


 Michael Philippsen and Walter F Tichy Modula
 and its compilation In First Internation
al Conference of the Austrian Center for Paral
lel Computation Salzburg Austria  pages

"
 Springer Verlag Lecture Notes in Com
puter Science 
 


 Prentice Hall Englewood Clis New Jersey
INMOS Limited Occam Programming Manual



 J Ramanujam and P Sadayappan Access based
data decomposition for distributed memory ma
chines In Proc	 of the th Distributed Memory
Computing Conference pages 
"
 Portland
Oregon April  " May 
 



 M Rosing R Schnabel and R Weaver DINO
Summary and example In Proc	 of the Third
Conference on Hypercube Concurrent Computers
and Applications pages "
 Pasadena CA

 ACM Press New York

 Thinking Machines Corporation Cambridge
Massachusetts Lisp Reference Manual Version
	 

 Thinking Machines Corporation Cambridge
Massachusetts C Language Reference Manual
April 



 Walter F Tichy and Christian G Herter
Modula An extension of Modula for high
ly parallel portable programs Technical Report
No ! University of Karlsruhe Department of
Informatics January 

 US Government Ada Joint Program Oce
ANSIMILStd  A Reference Manual for the
Ada Programming Language January 

