Tiling for Heterogeneous Computing Platforms. by Boulet, Pierre et al.
HAL Id: hal-02102006
https://hal-lara.archives-ouvertes.fr/hal-02102006
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Tiling for Heterogeneous Computing Platforms.
Pierre Boulet, Jack Dongarra, Yves Robert, Frédéric Vivien
To cite this version:
Pierre Boulet, Jack Dongarra, Yves Robert, Frédéric Vivien. Tiling for Heterogeneous Computing
Platforms.. [Research Report] LIP RR-1998-08, Laboratoire de l’informatique du parallélisme. 1998,
2+18p. ￿hal-02102006￿
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
Tiling for Heterogeneous Computing
Platforms
Pierre Boulet  Jack Dongarra 
Yves Robert and Frederic Vivien
January  
Research Report No 
Ecole Normale Supérieure de Lyon
Adresse électronique : lip@ens−lyon.fr 
Téléphone : (+33) (0)4.72.72.80.00    Télécopieur : (+33) (0)4.72.72.80.80
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Tiling for Heterogeneous Computing Platforms
Pierre Boulet Jack Dongarra Yves Robert and Frederic Vivien
January  
Abstract
In the framework of fully permutable loops  tiling has been extensively studied as a
sourcetosource program transformation However  little work has been devoted to the
mapping and scheduling of the tiles on physical processors Moreover  targeting hetero
geneous computing platforms has  to the best of our knowledge  never been considered
In this paper we extend tiling techniques to the context of limited computational re
sources with dierentspeed processors In particular  we present ecient scheduling
and mapping strategies that are asymptotically optimal The practical usefulness of
these strategies is fully demonstrated by MPI experiments on a heterogeneous network
of workstations
Keywords  tiling  communicationcomputation overlap  mapping  limited resources  dierent
speed processors  heterogeneous networks
Resume
Dans le cadre des boucles totalement permutables  le partitionnement a ete intensive
ment etudie en tant que transformation de programme Cependant  tres peu de travaux
ont concerne lordonnancement et lallocation des tuiles sur les processeurs physiques  et
aucun  a notre connaissance  na considere un ensemble de processeurs heterogene Dans
ce rapport  nous etendons les techniques de partitionnement au cadre des ressources
bornees et des processeurs de vitesses dierentes En particulier  nous presentons des
strategies dordonnancement et dallocation asymptotiquement optimales Nous de
montrons linteret pratique de ces strategies par des experimentations avec MPI sur un
reseau heterogene de stations de travail
Motscles  partitionnement  recouvrement calculscommunications  allocation  ressources lim
itees  processeurs de vitesses dierentes  reseau heterogene
Tiling for Heterogeneous Computing Platforms 
Pierre Boulet  Jack Dongarra  Yves Robert   and Frederic Vivien 
  LIP Ecole Normale Superieure de Lyon 	 Lyon Cedex 
 France
 Department of Computer Science University of Tennessee Knoxville TN 
   USA
 Mathematical Sciences Section Oak Ridge National Laboratory Oak Ridge TN 
  USA
e mail PierreBoulet YvesRobert FredericVivienens lyonfr
e mail dongarracsutkedu
Abstract
In the framework of fully permutable loops  tiling has been extensively studied as a source
tosource program transformation However  little work has been devoted to the mapping and
scheduling of the tiles on physical processors Moreover  targeting heterogeneous computing
platforms has  to the best of our knowledge  never been considered In this paper we extend
tiling techniques to the context of limited computational resources with dierentspeed pro
cessors In particular  we present ecient scheduling and mapping strategies that are asymp
totically optimal The practical usefulness of these strategies is fully demonstrated by MPI
experiments on a heterogeneous network of workstations
Key words  tiling communicationcomputation overlap mapping limited re
sources dierentspeed processors heterogeneous networks
  Introduction
Tiling is a widely used technique to increase the granularity of computations and the locality of
data references This technique applies to sets of fully permutable loops 	

     The basic idea
is to group elemental computation points into tiles that will be viewed as computational units the
loop nest must be permutable so that such a transformation is valid The larger the tiles  the more
ecient are the computations performed using stateoftheart processors with pipelined arithmetic
units and a multilevel memory hierarchy this feature is illustrated by recasting numerical linear
algebra algorithms in terms of blocked Level  BLAS kernels 	   Another advantage of tiling
is the decrease in communication time which is proportional to the surface of the tile relative to
the computation time which is proportional to the volume of the tile The price to pay for tiling
may be an increased latency for example  if there are data dependencies  the rst processor must
complete the whole execution of the rst tile before another processor can start the execution of
 This work was supported in part by the National Science Foundation Grant No  ASC by the Defense
Advanced Research Projects Agency under contract DAAH		
 administered by the Army Research Oce
by the Oce of Scientic Computing
 U S  Department of Energy
 under Contract DEACOR by the
National Science Foundation Science and Technology Center Cooperative Agreement No  CCR by the
CNRSENS LyonINRIA project ReMaP and by the Eureka Project EuroTOPS  Yves Roberts work was conducted
while he was on leave from Ecole Normale Superieure de Lyon and partly supported by DRETDGA under contract
ERE ADRETDSSR 

the second one Tiling also presents loadimbalance problems the larger the tile  the more dicult
it is to distribute computations equally among the processors
Tiling has been studied by several authors and in dierent contexts see  for example  	   

      
              
 Rather than providing a detailed motivation for tiling  we
refer the reader to the papers by Calland  Dongarra  and Robert 	 and by Hogsted  Carter  and
Ferrante 	
  which provide a review of the existing literature Briey  most of the work amounts to
partitioning the iteration space of a uniform loop nest into tiles whose shape and size are optimized
according to some criterion such as the communicationtocomputation ratio Once the tile shape
and size are dened  the tiles must be distributed to physical processors and the nal scheduling
must be computed
A natural way to allocate tiles to physical processors is to use a cyclic allocation of tiles to
processors Several authors 	  
   suggest allocating columns of tiles to processors in a purely
scattered fashion in HPF words  this is a CYCLIC  distribution of tile columns to processors
The intuitive motivation is that a cyclic distribution of tiles is quite natural for loadbalancing com
putations Once the distribution of tiles to processors is xed  there are several possible schedul
ings indeed  any wavefront execution that goes along a lefttoright diagonal is valid Specifying a
columnwise execution may lead to the simplest code generation
When all processors have equal speed  it turns out that a pure cyclic columnwise allocation
provides the best solution among all possible distributions of tiles to processors 	provided that
the communication cost for a tile is not greater than the computation cost Since the communication
cost for a tile is proportional to its surface  while the computation cost is proportional to its volume  
this hypothesis will be satised if the tile is large enough
However  the recent development of heterogeneous computing platforms poses a new challenge
that of incorporating processor speed as a new parameter of the tiling problem Intuitively  if the
user wants to use a heterogeneous network of computers where  say  some processors are twice as
fast as some other processors  we may want to assign twice as many tiles to the faster processors
A cyclic distribution is not likely to lead to an ecient implementation Rather  we should use
strategies that aim at loadbalancing the work while not introducing idle time The design of such
strategies is the goal of this paper
The rest of the paper is organized as follows In Section 
 we formally state the problem of tiling
for heterogeneous computing platforms All our hypotheses are listed and discussed  and we give a
theoretical way to solve the problem by casting it in terms of a linear programming problem The
cost of solving the linear problem turns out to be prohibitive in practice  so we restrict ourselves to
columnwise allocations Fortunately  there exist asymptotically optimal columnwise allocations  as
shown in Section   where several heuristics are introduced and proved In Section  we provide MPI
experiments that demonstrate the practical usefulness of our columnwise heuristics on a network
of workstations Finally  we state some conclusions in Section 
 Problem Statement
In this section  we formally state the scheduling and allocation problem that we want to solve We
provide a complete list of all our hypotheses and discuss each in turn
 For example
 for twodimensional tiles
 the communication cost grows linearly with the tile size while the com
putation cost grows quadratically 
Of course
 we can imagine a theoretical situation in which the communication cost is so large that a sequential
execution would lead to the best result 


  Hypotheses
H The computation domain or iteration space is a twodimensional rectangle of size N  N
Tiles are rectangular  and their edges are parallel to the axes see Figure  All tiles have
the same xed size Tiles are indexed as Ti j     i   N     j   N
H Dependences between tiles are summarized by the vector pair 









In other words  the computation of a tile cannot be started before both its left and upper
neighbor tiles have been executed Given a tile Ti j   we call both tiles Ti  j and Ti j  its
successors  whenever the indices make sense
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
x x x
j
i
x x x
x x x
x x x
x x x
N
N
1
2
x x x
x x x
x x x
x x x
T2,3
Figure  A tiled iteration space with horizontal and vertical dependencies
H There are P available processors interconnected as a virtual ring Processors are numbered
from  to P  Processors may have dierent speeds let tq the time needed by processor Pq
to execute a tile  for   q   P  While we assume the computing resources are heterogeneous 
we assume the communication network is homogeneous if two adjacent tiles T and T   are
not assigned to the same processor  we pay the same communication overhead Tcom  whatever
the processors that execute T and T  
H Tiles are assigned to processors by using a scheduling  and an allocation function proc
both to be determined Tile T is allocated to processor procT   and its execution begins
at timestep T  The constraints induced by the dependencies are the following for each
tile T and each of its successors T    we have 
T   tprocT   T
  if procT   procT  
T   tprocT   Tcom  T
  otherwise
In fact
 the dimension of the tiles may be greater than   Most of our heuristics use a columnwise allocation

which means that we partition a single dimension of the iteration space into chunks to be allocated to processors 
The number of remaining dimensions is not important 
The actual underlying physical communication network is not important 
There are other constraints to express e g 
 any processor can execute at most one tile at each timestep  See
Section   for a complete formalization 

The makespan MS proc of a scheduleallocation pair  proc is the total execution time
required to execute all tiles If execution of the rst tile T  starts at timestep t    the makespan
is equal to the date at which the execution of the last tile is executed
MS proc  TN  N  tprocTN  N 

A scheduleallocation pair is said to be optimal if its makespan is the smallest possible over all
valid solutions Let Topt denote the optimal execution time over all possible solutions
   Discussion
We survey our hypotheses and assess their motivations  as well as the limitations that they may
induce
Rectangular iteration space and tiles We note that the tiled iteration space is the outcome
of previous program transformations  as explained in 	    
     The rst step in
tiling amounts to determining the best shape and size of the tiles  assuming an innite grid
of virtual processors Because this step will lead to tiles whose edges are parallel to extremal
dependence vectors  we can perform a unimodular transformation and rewrite the original
loop nest along the edge axes The resulting domain may not be rectangular  but we can
approximate it using the smallest bounding box however  this approximation may impact
the accuracy of our results
Dependence vectors We assume that dependencies are summarized by the vector pair V 
f t  tg Note that these are dependencies between tiles  not between elementary
computations Hence  having right and topneighbor dependencies is a very general situa
tion if the tiles are large enough Technically  since we deal with a set of fully permutable
loops  all dependence vectors have nonnegative components only  so that V permits all other
dependence vectors to be generated by transitivity Note that having a dependence vector
 at with a  
 between tiles  instead of having vector  t  would mean unusually long
dependencies in the original loop nest  while having  at in addition to  t as a depen
dence vector between tiles is simply redundant In practical situations  we might have an
additional diagonal dependence vector  t between tiles  but the diagonal communication
may be routed horizontally and then vertically  or the other way round  and even may be
combined with any of the other two messages because of vectors  t and  t
Computationcommunication overlap Note that in our model  communications can be over
lapped with the computations of other independent tiles Assuming communicationcompu
tation overlap seems a reasonable hypothesis for current machines that have communication
coprocessors and allow for asynchronous communications posting instructions ahead  or us
ing active messages We can think of independent computations going along a thread while
communication is initiated and performed by another thread 	 An interesting approach
has been proposed by Andonov and Rajopadhye 	 they introduce the tile period Pt as the
time elapsed between corresponding instructions of two successive tiles that are mapped to
the same processor  while they dene the tile latency Lt to be the time between corresponding
instructions of two successive tiles that are mapped to di erent processors The power of this
approach is that the expressions for Lt and Pt can be modied to take into account several
architectural models A detailed architectural model is presented in 	  and several other
models are explored in 	
 With our notation  Pt  ti and Lt  ti  Tcom for processor Pi

Finally  we briey mention another possibility for introducing heterogeneity into the tiling
model We chose to have all tiles of same size and to allocate more tiles to the faster processors
Another possibility is to evenly distribute tiles to processors  but to let their size vary according to
the speed of the processor they are allocated to However  this strategy would severely complicate
code generation Also  allocating several neighboring xedsize tiles to the same processor will have
similar eects as allocating variablesize tiles  so our approach will cause no loss of generality
  ILP Formulation
We can describe the tiled iteration space as a task graph G  VE  where vertices represent the
tiles and edges represent dependencies between tiles Computing an optimal scheduleallocation
pair is a wellknown task graph scheduling problem  which is NPcomplete in the general case 	
If we want to solve the problem as stated hypotheses H to H  we can use an integer linear
programming formulation Several constraints must be satised by any valid scheduleallocation
pair In the following  Tmax denotes an upper bound on the total execution time For example 
Tmax can be the execution time when all the tiles are given to the fastest processor Tmax 
N   N  miniP ti
We now translate these constraints into equations In the following  let i  f     N g denote
a row number  j  f     Ng a column number  q  f     P  g a processor number  and
t  f     Tmaxg a timestep
 Number of executions Let Bi j q t be an integer variable indicating whether the execution
of tile Ti j began at timestep t on processor q if this is the case  then Bi j q t     and
Bi j q t   otherwise Each tile must be executed once  and thus starts at one and only one
timestep Therefore  the constraints are
i j q t Bi j q t   and i j
P X
q	
TmaxX
t	
Bi j q t  
 Execution place and date Using Bi j q t  we can compute the date Di j at which tile i j
starts execution We can also check which processor q processes tile i j The  result is
stored in Pi j q
i j Di j 
P X
p	
TmaxX
t	
t  Bi j q t and i j q Pi j q 
TmaxX
t	
Bi j q t
 Communications There must be a communication delay between the end of execution of
tile i   j resp i j   and the beginning of execution of tile i j if and only the
two tiles are not executed by the same processor  that is  if and only if there exists q such
that Pi j q  Pi  j q resp Pi j q  Pi j  q The boolean result is stored in vi j resp hi j
vi j   if tiles i   j and i j are not executed by the same processor  and vi j  
otherwise We have a similar denition for hi j suing tiles i j  and i j The equations
are
i  
 j q vi j  Pi j q  Pi  j q  vi j  Pi  j q  Pi j q
i j  
 q hi j  Pi j q  Pi j  q vi j  Pi j  q  Pi j q
Note that if a communication delay is needed between the execution of tile i  j and that
of tile i j  then vi j will impose one If none is needed  vi j may still be equal to   as long
as this does not increase the total execution time



min

DN  N 
P
q PN  N qtq
	
Pt
t	ttq 
P
i j Bi j q t     q  P   tq    t  Tmax
Di j  Di  j  vi jTcom 
P
q Pi  j qtq 
  i  N    j  N
Di j  Di j   hi jTcom 
P
q Pi j  qtq   i  N  
  j  N
vi j  Pi j q  Pi  j q 
  i  N    j  N   q  P  
vi j  Pi  j q  Pi j q 
  i  N    j  N   q  P  
hi j  Pi j q  Pi j  q   i  N  
  j  N   q  P  
hi j  Pi j  q  Pi j q   i  N  
  j  N   q  P  
Pi j q 
P
tBi j q t   i  N    j  N   q  P  
Di j 
P
q
P
t tBi j q t   i  N    j  NP
q
P
tBi j q t     i  N    j  N
Bi j q t     i  N    j  N   q  P     t  Tmax
Figure 
 Integer linear program that optimally solves the scheduleallocation problem
 Precedence constraints The execution of tile i  j resp i j  must be nished 
and the data transferred  before the beginning of execution of tile i j
i  
 j Di j  Di  j  vi jTcom 
P X
q	
Pi  j qtq
i j  
 Di j  Di j   hi jTcom 
P X
q	
Pi j  qtq
 Number of tiles executed at any timestep	 A processor executes at most one tile at
the time Therefore processor q can start executing at most one tile in any interval of time
tq as tq is the time to execute a tile by processor q
q tq    t  Tmax
tX
t	ttq 
N X
i	 
NX
j	 
Bi j q t  
Now that we have expressed all our constraints in a linear way  we can write the whole linear
programming system We need only to add the objective function the minimization of the time
step at which the execution of the last tile TN  N is terminated The nal linear program is
presented in Figure 
 Since an optimal rational solution of this problem is not always an integer
solution  this program must be solved as an integer linear program
The main drawback of the linear programming approach is its huge cost The program shown
Figure 
 contains more than PN NTmax variables and inequalities The cost of solving such a
problem would be prohibitive for any practical application Furthermore  even if we could solve
the linear problem  we might not be pleased with the solution We probably would prefer regular 
allocations of tiles to processors  such as columnwise or rowwise allocations
Nevertheless  such allocations can lead to asymptotically optimal solutions  as shown in the
next section

 Columnwise Allocation
Before introducing asymptotically optimal columnwise or rowwise allocations  we give a small
example to show that columnwise allocations or equivalently rowwise allocations are not optimal
 Optimality and Columnwise Allocations
Consider a tiled iteration space with N  
 columns  and suppose we have P  
 processors such
that t     t the rst processor is ve times faster than the second one Suppose for the sake
of simplicity that Tcom   If we use a columnwise allocation 
 either we allocate both columns to processor   and the makespan is MS  
N t
 or we allocate one column to each processor  and the makespan is greater than N t  a lower
bound time for the slow processor to process its column
The best solution is to have the fast processor execute all tiles But if N  is large enough  we can
do better by allocating a small fraction of the rst column the last tiles to the slow processor 
which will process them while the rst processor is active executing the rst tiles of the second
column For instance  if N   n and if we allocate the last n tiles of the rst column to the slow
processor see Figure   the execution time becomes MS  nt 
  

 N t  which is better than
the best columnwise allocation

5n
6n
P
P
i
0
1
j
Figure  Allocating tiles for a twocolumn iteration space
This small example shows that our target problem is intrinsically more complex than the in
stance with samespeed processors as shown in 	  a columnwise allocation would be optimal for
our twocolumn iteration space with two processors of equal speed
  Heuristic Allocation by Block of Columns
Throughout the rest of the paper we make the following additional hypothesis
H
 We impose the allocation to be columnwise all tiles Ti j     i  N   are allocated to the
same processor
This is not the best possible allocation
 but it is superior to any columnwise allocation 
Note that the problem is symmetric in rows and columns  We could study rowwise allocations as well 

We start with an easy lemma to bound the optimal execution time Topt
Lemma 
Topt 
N   NPP 
i	
 
ti

Proof Let xi be the number of tiles allocated to processor i    i   P  Obviously 
PP 
i	 xi 
N N Even if we take into account neither the communication delays nor the dependence con
straints  the execution time T is greater than the computation time of each processor T  xiti
for all   i   P  Rewriting this as xi  Tti and summing over i  we get N N 
PP 
i	 xi 

PP 
i	
 
ti
T   hence the result
The proof of Lemma  leads to the intuitive idea that tiles should be allocated to pro
cessors in proportion to their relative speeds  so as to balance the workload Specically  let
L  lcmt t      tP   and consider an iteration space with L columns if we allocate
L
ti
tile
columns to processor i  all processors need the same number of timesteps to compute all their tiles
the workload is perfectly balanced Of course  we must nd a good schedule so that processors do
not remain idle  waiting for other processors because of dependence constraints
We introduce below a heuristic that allocates the tiles to processors by blocks of columns whose
size is computed according to the previous discussion This heuristic produces an asymptotically
optimal allocation the ratio of its makespan over the optimal execution time tends to  as the
number of tiles the domain size increases
In a columnwise allocation  all the tiles of a given column of the iteration space are allocated
to the same processor When contiguous columns are allocated to the same processor  they form a
block When a processor is assigned several blocks  the scheduling is the following
 Block are computed one after the other  in the order dened by the dependencies The
computation of the current block must be completed before the next block is started

 The tiles inside each block are computed in a rowwise order if  say   consecutive columns
are assigned to a processor  it will execute the three tiles in the rst row  then the three tiles
in the second row  and so on Note that given  this strategy is the best to minimize the
latency for another processor to start next block as soon as possible
The following lemma shows that dependence constraints do not slow down the execution of two
consecutive blocks of adequate size by two dierentspeed processors
Lemma  Let P  and P be two processors that execute a tile in time t  and t respectively
Assume that P  was allocated a block B  of c  contiguous columns and that P was allocated the
block B consisting of the following c columns Let c  and c satisfy the equality c t   ct
Assume that P  starting at timestep s  is able to process B  without having to wait for any
tile to be computed by some other processor Then P will be able to process B without having to
wait for any tile computed by P  if it starts at time s  c t   Tcom
Proof P  resp P executes its block row by row The execution time of a row is c t  resp
ct By hypothesis  it takes the same amount of time for P  to compute a row of B  as for P to
compute a row of B
Since P  is able to process B  without having to wait for any tile to be computed by some other
processor  it nishes computing the ith row of B  at time s   ic t 

P cannot start processing the rst tile of the ith row of B before P  has computed the last
tile of the ith row of B  and has sent that data to P  that is  at timestep s   ic t   Tcom
Since P starts processing the rst row of B at time s  where s  s   c t   Tcom  it
is not delayed by P  Later on  P will process the rst tile of the ith row of B at time
s  i  ct  s  i  c t   s   c t   Tcom  i  c t   s   ic t   Tcom hence
P will not be delayed by P 
We are ready to introduce our heuristic
Heuristic
Let P     PP  be P processors that respectively execute a tile in time t     tP  We allocate
column blocks to processors by chunks of C  L  
PP 
i	
 
ti
  where L  lcmt t      tP 
columns For the rst chunk  we assign the block B of the rst Lt columns to P  the block B 
of the next Lt  columns to P   and so on until Pp  receives the last Ltp columns of the chunk
We repeat the same scheme with the second chunk columns C   to 
C rst  and so on until all
columns are allocated note that the last chunk may be incomplete As already said  processors
will execute blocks one after the other  row by row within each block
Lemma  The di erence between the execution time of the heuristic allocation by columns and the
optimal execution time is bounded as
T  Topt  P  Tcom  N  N  lcmt t      tP 
Proof Let L  lcmt t      tP  Lemma 
 ensures that  if processor Pi starts working at
timestep si  iLTcom  it will not be delayed by other processors By denition  each processor
executes one block in time LN  The maximal number of blocks allocated to a processor is
n 


N
L 
PP 
i	
 
ti


The total execution time  T   is equal to the date the last processor terminates execution T can be
bounded as follows
T  sP   n  LN 
On the other hand  Topt is bounded below by Lemma  We derive
T  Topt  P  L Tcom  LN 


N
L 
PP 
i	
 
ti


N   NPP 
i	
 
ti

Since dxe  x  for any rational number x  we obtain the desired formula
Proposition  Our heuristic is asymptotically optimal letting T be its makespan and Topt be the
optimal execution time we have
lim
N
T
Topt
 
Processor PP  is not necessarily the last one
 because the last chunk may be incomplete 

The two main advantages of our heuristic are i its regularity  which leads to an easy imple
mentation and ii its guarantee it is theoretically proved to be close to the optimal However  we
will need to adapt it to deal with practical cases  because the number C  L 
PP 
i	
 
ti
of columns
in a chunk may be too large
 Practical Heuristics
In the preceding section  we described a heuristic that allocates blocks of columns to processors in a
cyclic fashion The size of the blocks is related to the relative speed of the processors However  the
execution time variables ti are not known accurately in practice  and a straightforward application
of our heuristic would lead to diculties  as shown next in Section  We explain how to modify
the heuristic computing dierent block sizes in Section 

 Processor Speed
To expose the potential diculties of the heuristic  we conducted experiments on a heterogeneous
network of eight Sun workstations To compute the relative speed of each workstation  we used
a program that runs the same piece of computation that will be used later in the tiling program
Results are reported in Table 
Name nala bluegrass dancer donner vixen rudolph zazu simba
Description Ultra  SS  SS  SS  SS  SS  SS 	
 SS 	

Execution time ti      	  
Table  Measured computation times showing relative processor speeds
To use our heuristic  we must allocate chunks of size C  L
P
i	
 
ti
columns  where L 
lcmt t      t    
 We compute that C     columns  which would require
a very large problem size indeed Needless to say  such a large chunk is not feasible in practice
Also  our measurements for the processor speeds may not be inaccurate  and a slight change may
dramatically impact the value of C Hence  we must devise another method to compute the sizes
of the blocks allocated to each processor see Section 
 In Section   we present simulation
results and discuss the practical validity of our modied heuristics
  Modied Heuristic
Our goal is to choose the best block sizes allocated to each processor while bounding the total
size of a chunk We rst dene the cost of a block allocation and then describe an algorithm to
compute the best possible allocation  given an upper bound for the chunk
		 Cost Function
As before  we consider heuristics that allocate tiles to processors by blocks of columns  repeating
each chunk in a cyclic fashion Consider a heuristic dened by C  c     cP   where ci is the
number of columns in each block allocated to processor Pi
The  workstations were not dedicated to our experiments  Even though we were running these experiments
during the night
 some other users processes might have been running  Also
 we have averaged the results
 so the
error margin roughly lies between  and  

Denition  The cost of a block size allocation C is the maximum of the computation times citi
of each block divided by the total number of columns computed in each chunk
costC 
maxiP  citiP
iP  ci
Considering the steady state of the computation  all processors work in parallel inside their
block  so that the computation time of a whole chunk is the maximum of the computation times
of the processors During this time  s 
P
iP  ci columns are computed Hence  the average
time to compute a single column is given by our cost function When the number of columns is
much larger than the size of the chunk  the total computation time can well be approximated by
C  N  the product of the average time to compute a column by the total number of columns
		 Optimal Block Size Allocations
As noted before  our cost function correctly models reality when the number of columns in each
chunk is much smaller than the total number of columns of the domain We now describe an
algorithm that returns the best with respect to the cost function block size allocation given a
bound s on the number of columns in each chunk
We build a function that  given a best allocation with a chunk size equal to n    computes a
best allocation with a chunk size equal to n Once we have this function  we start with an initial
chunk size n    compute a best allocation for each increasing value of n up to n  s  and select
the best allocation encountered so far
First we characterize the best allocations for a given chunk size s
Lemma  Let C  c     cP  be an allocation and let s 
P
iP  ci be the chunk size Let
m  max ip citi denote the maximum computation time inside a chunk If C veries
i   i  P   tici  m  tici   
then it is optimal for the chunk size s
Proof Take an allocation verifying the above condition  Suppose that it is not optimal Then
there exists a better allocation C   c      c
 
P  with
P
iP  c
 
i  s  such that
m   max
iP 
c iti   m
By denition of m  there exists i such that m  ci	ti	  We can then successively derive
ci	ti	  m  m
   c i	ti	
ci	  c
 
i	
i  ci    c
 
i 

because
X
iP 
ci  s 
X
iP 
c i
	
ci     c
 
i 
ti ci     ti c
 
i 
m  m  by denition of m and m 
which contradicts the nonoptimality of the original allocation
There remains to build allocations satisfying Condition  The following algorithm suces

 For the chunk size s    take the optimal allocation       
 To derive an allocation C  verifying equation  with chunk size s from an allocation C
verifying  with chunk size s   add  to a wellchosen cj one that veries
tjcj    min
iP 
tici   

In other words  let c i  ci for   i  P   i  j  and c
 
j  cj  
Lemma 
 This algorithm is correct
Proof We have to prove that allocation C   given by the algorithm  veries Equation 
Since allocation C veries equation   we have tici  m  tjcj   By denition of j from
Equation 
  we have
m   max
iP 
tic
 
i  max

tjcj   max
 iq i	j
tici

 tjc
 
j 
We then have tjc
 
j  m
   tjc
 
j   and
i  j   i  q
tic
 
i tici  mm
  tjc
 
j  min
iP 
tici    tici    tic
 
i  
so the resulting allocation does verify Equation 
To summarize  we have built an algorithm to compute good block sizes for the heuristic
allocation by blocks of columns One selects an upper bound on the chunk size  and our algorithm
returns the best block sizes  according to our cost function  with respect to this bound
The complexity of this algorithm is OPs  where P is the number of processors and s  the
upper bound on the chunk size Indeed  the algorithm consists of s steps where one computes a
minimum over the processors This low complexity allows us to perform the computation of the
best allocation at runtime
A Small Example	 To understand how the algorithm works  we present a small example with
P    t    t     and t   In Table 
  we report the best allocations found by the
algorithm up to s   The entry Selected j denotes the value of j that is chosen to build the
next allocation Note that the cost of the allocations is not a decreasing function of s If we allow
chunks of size not greater than   the best solution is obtained with the chunk  
  of size 
Finally  we point out that our modied heuristic converges to the original asymptotically
optimal heuristic For a chunk of size C  L 
PP 
i	
 
ti
  where L  lcmt t      tP  columns 
we obtain the optimal cost
costopt 
L
C


 X
iP 

ti

A
 

which is the inverse of the harmonic mean of the execution times divided by the number of proces
sors


Chunk Size c c  c Cost Selected j
    
     

    
 
 
   
 

 
   
 
     
  
   
  
  
Table 
 Running the algorithm with  processors t    t     and t  
 MPI Experiments
We report several experiments on the network of workstations presented in Section  After
comments on the experiments  we focus on cyclic and blockcyclic allocations and then on our
modied heuristics
		 General Remarks
We study dierent columnwise allocations on the heterogeneous network of workstations presented
in Section  Our simulation program is written in C using the MPI library for communication
It is not an actual tiling program  but it simulates such behavior we have not inserted the code
required to deal with the boundaries of the computation domain The domain has  rows and a
number of columns varying from 
 to  by steps of  An array of doubles is communicated
for each communication its size is the square root of the tile area
The actual communication network is an Ethernet network It can be considered as a bus 
not as a pointtopoint connection ring hence our model for communication is not fully correct
However  this conguration has little impact on the results  which correspond well to the theoretical
conditions
As already pointed out  the workstations we use are multipleuser workstations Although our
simulations were made at times when the workstations were not supposed to be used by anybody
else  the load may vary The timings reported in the gures are the average of several measures
from which aberrant data have been suppressed
In Figures  and   we show for reference the sequential time as measured on the fastest machine 
namely  nala 
		 Cyclic Allocations
We have experimented with cyclic allocations on the  fastest machines  on the  fastest machines 
and on all  machines Because cyclic allocation is optimal when all processors have the same
speed  this will be a reference for other simulations We have also tested a block cyclic allocation
with block size equal to   in order to see whether the reduced amount of communication helps
Figure  presents the results  for these  allocations  purely cyclic allocations using     and 
machines  and  blockcyclic allocations
We comment on the results of Figure  as follows
 	Some results are not available for  columns because the chunk size is too large 

sequential
cyclic 
cyclic 
cyclic 
cyclic 
cyclic 
cyclic 
columns
se
co
n
d
s













Remark cyclicb m corresponds to a block cyclic allocation with block size b  using the m
fastest machines of Table 
Figure  Experimenting with cyclic and blockcyclic allocations
 With the same number of machines  a block size of  is better than a block size of  pure
cyclic
 With the same block size  adding a single slow machine is disastrous  and adding the second
one only slightly improve the disastrous performances
 Overall  only the block cyclic allocation with block size  and using the  fastest machines
gives some speedup over the sequential execution
We conclude that cyclic allocations are not ecient when the computing speeds of the available
machines are very dierent For the sake of completeness  we show in Figure  the execution times
obtained for the same domain  rows and  columns and the  fastest machines  for block
cyclic allocations with dierent block sizes We see that the blocksize as a small impact on the
performances  which corresponds well to the theory all cyclic allocations have the same cost
		 Using our modied heuristic
Let us now consider our heuristics In Table   we show the block sizes computed by the algorithm
described in Section 
 for dierent upper bounds of the chunk size The best allocation computed
with bound u is denoted as Cu
The time needed to compute these allocations is completely negligible with respect to the
computation times a few milliseconds versus several seconds

block sizes in columns
se
co
n
d
s











Figure  Cyclic allocations with dierent block sizes
nala bluegrass dancer donner vixen rudolph zazu simba cost chunk
C   
 
 
 
    
C         
 
C           
C  
 

       
 
Table  Block sizes for dierent chunk size bounds
Figure  presents the results for these allocations Here are some comments
 Each of the allocations computed by our heuristic is superior to the best blockcyclic alloca
tion
 The more precise the allocation  the better the results
 For  columns and allocation C   we obtain a speedup of 

 and 
 for allocation C 
which is very satisfying see below
The optimal cost for our workstation network is costopt 
L
C
  
  
    Note that the
cost of costC   
 is very close to the optimal cost The peak theoretical speedup is equal
to mini ti
costopt
 
 For  columns  we obtain a speedup equal to 

 for C  This is satisfying
considering that we have here only  chunks  so that side eects still play an important role Note
also that the peak theoretical speedup has been computed by neglecting all the dependencies in
the computation and all the communications overhead Hence  obtaining a twofold speedup with
 machines of very dierent speeds is not a bad result at all!
 Conclusion
In this paper  we have extended tiling techniques to deal with heterogeneous computing platforms
Such platforms are likely to play an important role in the near future We have introduced an
asymptotically optimal columnwise allocation of tiles to processors We have modied this heuristic

sequential
C 
C 
C
C
columns
se
co
n
d
s









Figure  Experimenting with our modied heuristics
to allocate column chunks of reasonable size  and we have reported successful experiments on a
network of workstations The practical signicance of the modied heuristics should be emphasized
processor speeds may be inaccurately known  but allocating small but wellbalanced chunks turns
out to be quite successful
Heterogeneous platforms are ubiquitous in computer science departments and companies The
development of our new tiling techniques allows for the ecient use of older computational resources
in addition to newer available systems
References
	 A Agarwal  DA Kranz  and V Natarajan Automatic partitioning of parallel loops and
data arrays for distributed sharedmemory multiprocessors IEEE Trans Parallel Distributed
Systems  "
  
	
 Rumen Andonov  Had Bourzou  and Sanjay Rajopadhye Twodimensional orthogonal
tiling from theory to practice In International Conference on High Performance Computing
HiPC  pages 

"
  Trivandrum  India   IEEE Computer Society Press
	 Rumen Andonov and Sanjay Rajopadhye Optimal tiling of twodimensional uniform recur
rences Journal of Parallel and Distributed Computing  to appear Available as Technical
Report LIMAVRR   at http##wwwunivvalenciennesfr#limav
	 Pierre Boulet  Alain Darte  Tanguy Risset  and Yves Robert penultimate tiling$ Integra
tion the VLSI Journal  "  

	 PierreYves Calland and Tanguy Risset Precise tiling for uniform loop nests In P Cappello
et al  editors  Application Specic Array Processors ASAP 	  pages " IEEE Computer
Society Press  
	 PY Calland  J Dongarra  and Y Robert Tiling with limited resources In L Thiele  J Fortes 
K Vissers  V Taylor  T Noll  and J Teich  editors  Application Specic Systems Achitectures
and Processors ASAP
  pages 

"
 IEEE Computer Society Press   Extended
version available on the WEB at http##wwwenslyonfr#	yrobert
	 YS Chen  SD Wang  and CM Wang Tiling nested loops into maximal rectangular blocks
Journal of Parallel and Distributed Computing  

"
  
	 J Choi  J Demmel  I Dhillon  J Dongarra  S Ostrouchov  A Petitet  K Stanley  D Walker 
and R C Whaley ScaLAPACK A portable linear algebra library for distributed memory
computers  design issues and performance Computer Physics Communications  "  
also LAPACK Working Note %
	 Ph Chretienne Task scheduling over distributed memory machines In M Cosnard  P Quin
ton  M Raynal  and Y Robert  editors  Parallel and Distributed Algorithms  pages "
North Holland  
	 Alain Darte  GeorgesAndre Silber  and Frederic Vivien Combining retiming and scheduling
techniques for loop parallelization and loop tiling Parallel Processing Letters   Special
issue  to appear Also available as Tech Rep LIP  ENSLyon  RR  and on the WEB at
http##wwwenslyonfr#LIP
	 J J Dongarra and D W Walker Software libraries for linear algebra computations on high
performance computers SIAM Review  
"  
	
 K Hogstedt  L Carter  and J Ferrante Determining the idle time of a tiling In Principles
of Programming Languages  pages " ACM Press   Extended version available as
Technical Report UCSDCS  and on the WEB at http##wwwcseucsdedu#	carter
	 Fran&cois Irigoin and Remy Triolet Supernode partitioning In Proc 	th Annual ACM Symp
Principles of Programming Languages  pages "
  San Diego  CA  January 
	 AmyW Lim and Monica S Lam Maximizing parallelism and minimizing synchronization with
ane transforms In Proceedings of the th Annual ACM SIGPLANSIGACT Symposium on
Principles of Programming Languages ACM Press  January 
	 Naraig Manjikian and Tarek S Abdelrahman Scheduling of wavefront parallelism on scalable
shared memory multiprocessor In Proceedings of the International Conference on Parallel
Processing ICPP  CRC Press  
	 H Ohta  Y Saito  M Kainaga  and H Ono Optimal tile size adjustment in compiling general
DOACROSS loop nests In 	 International Conference on Supercomputing  pages 
"

ACM Press  
	 Peter Pacheco Parallel programming with MPI Morgan Kaufmann  
	 J Ramanujam and P Sadayappan Tiling multidimensional iteration spaces for multicomput
ers Journal of Parallel and Distributed Computing  
"
  


	 Robert Schreiber and Jack J Dongarra Automatic blocking of nested loops Technical Report
  The University of Tennessee  Knoxville  TN  August 
	
 S Sharma  CH Huang  and P Sadayappan On data dependence analysis for compiling pro
grams on distributedmemory machines ACM Sigplan Notices  
  January  Extended
Abstract
	
 M E Wolf and M S Lam A data locality optimizing algorithm In SIGPLAN Conference
on Programming Language Design and Implementation  pages " ACM Press  
	

 Michael E Wolf and Monica S Lam A loop transformation theory and an algorithm to
maximize parallelism IEEE Trans Parallel Distributed Systems  

"  October 

