Determining the Idle Time of a Tiling: New Results by Desprez, Frédéric et al.
HAL Id: hal-02102014
https://hal-lara.archives-ouvertes.fr/hal-02102014
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Determining the Idle Time of a Tiling: New Results.
Frédéric Desprez, Jack Dongarra, Fabrice Rastello, Yves Robert
To cite this version:
Frédéric Desprez, Jack Dongarra, Fabrice Rastello, Yves Robert. Determining the Idle Time of a
Tiling: New Results.. [Research Report] Laboratoire de l’informatique du parallélisme. 1997, 2+17p.
￿hal-02102014￿
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
Determining the Idle Time of a Tiling 
New Results
Fr ed eric Desprez Jack Dongarra
Fabrice Rastello and Yves
Robert
October  
Research Report No 
Ecole Normale Supérieure de Lyon
Adresse électronique : lip@lip.ens−lyon.fr 
Téléphone : (+33) (0)4.72.72.80.00    Télécopieur : (+33) (0)4.72.72.80.80
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Determining the Idle Time of a Tiling  New Results
Frederic Desprez Jack Dongarra Fabrice Rastello and Yves Robert
October  
Abstract
In the framework of fully permutable loops  tiling has been studied extensively as a source
tosource program transformation We build upon recent results by Hogsted  Carter  and
Ferrante   who aim at determining the cumulated idle time spent by all processors while
executing the partitioned tiled	 computation domain We propose new  much shorter proofs
of all their results and extend these in several important directions More precisely  we provide
an accurate solution for all values of the rise parameter that relates the shape of the iteration
space to that of the tiles  and for all possible distributions of the tiles to processors In contrast 
the authors in  deal only with a limited number of cases and provide upper bounds rather
than exact formulas
Keywords  Tiling  fully permutable loops  idle time
Resume
Dans le cadre des boucle compl
etement permutables  le pavage a ete beaucoup etudie comme
une transformation source
asource Nous nous basons sur des travaux recents de Hogsted 
Carter  et Ferrante   dont le but est de determiner le temps dattente cumule passe par
tous les processeurs pendant lexecution le domaine de calcul partionne pave	 Nous proposons
des nouvelles preuves  plus courtes  de tous leurs resultats et nous les etendons dans plusieurs
directions importantes Nous donnons une solution plus precise pour toutes les valeurs du
param
etre rise qui relie la forme de lespace diteration 
a celle des tuiles  et pour toutes les
distributions possibles des tuiles sur les processeurs Les auteurs dans  ne traitent quun
nombre limite de cas et fournissent des bornes superieures plutot que des formules exactes
Motscles  Pavage  boucles completement permutables  temps dattente
Determining the Idle Time of a Tiling  New Results 
J Dongarra and Y Roberty F Desprez and F Rastello
Department of Computer Science LIP ENS Lyon
University of Tennessee 	
 Allee dItalie
Knoxville TN 
   USA Lyon Cedex  France
 dongarrayrobertcsutkedu  desprezfrastelllipenslyonfr
October   
 Jack Dongarra and Yves Robert are with the Department of Computer Science  University of Tennessee 
Knoxville  TN   USA Jack Dongarra is also with the Mathematical Sciences Section  Oak Ridge Na
tional Laboratory  Oak Ridge  TN 	  USA Yves Robert is on leave from Ecole Normale Sup
erieure de Lyon  and
is partly supported by DRETDGA under contract ERE ADRETDSSR This work was supported
in part by the National Science Foundation Grant No ASC by the Defense Advanced Research Projects
Agency under contract DAAH  administered by the Army Research Oce by the Department of En
ergy Oce of Computational and Technology Research  Mathematical  Information  and Computational Sciences
Division under Contract DEAC	OR by the National Science Foundation Science and Technology Center
Cooperative Agreement No CCR		 by the CNRSENS LyonINRIA project ReMaP and by the Eureka
Project EuroTOPS The authors acknowledge the use of the Intel Paragon XPS  computer  located in the Oak
Ridge National Laboratory Center for Computational Sciences  funded by the Department of Energys Mathemati
cal  Information  and Computational Sciences Division subprogram of the Oce of Computational and Technology
Research Corresponding author Yves Robert  YvesRobertinriafr

  Introduction
Tiling is a widely used technique to increase the granularity of computations and the locality of data
references This technique was originally restricted to perfect loop nests with uniform dependencies 
as dened by Banerjee   but has been extended to sets of fully permutable loops 		  
  
Tiling is a widely used technique to increase the granularity of computations and the locality of
data references The basic idea is to group elemental computation points into tiles that will be
viewed as computational units The larger the tiles  the more ecient the computations performed
using stateoftheart processors with pipelined arithmetic units and a multilevel memory hierarchy
illustrated by recasting numerical linear algebra algorithms in terms of blocked Level  BLAS
kernels    Another advantage of tiling is the decrease in communication time which is
proportional to the surface of the tile relative to the computation time which is proportional to
the volume of the tile The price to pay for tiling may be an increased latency if there are data
dependencies  for example  we need to wait for the rst processor to complete the whole execution
of the rst tile before another processor can start the execution of the second one  and so on  as well
as some loadimbalance problems the larger the tile  the more dicult to distribute computations
equally among the processors
Tiling has been studied by several researchers and in dierent contexts     	    	  
   
          	  	  Rather than providing a detailed motivation for tiling  we refer the reader to
the papers by Calland  Dongarra  and Robert  and by Hogsted  Carter and Ferrante 	  which
provide a review of the existing literature Most of the work amounts to partitioning the iteration
space of a uniform loop nest into tiles whose shape and size are optimized according to some criteria
such as the communicationtocomputation ratio Once the tile shape and size are dened  there
remains to distribute the tiles to physical processors and to compute the nal scheduling
In this paper  we build upon the work of Hogsted  Carter  and Ferrante 	 Given a tiled
domain  they aim at determining the cumulated idle time spent by all processors This cumulated
idle time heavily depends upon the tile and domain shapes A new parameter  called the tile rise 
is introduced in 	 in order to relate the shape of the iteration domain to that of the tiles  and it
is shown to have a signicant impact on the idle time Both parallelogramshaped and trapezoidal
shaped iteration spaces are considered We summarize the results of 	 in Section 	 Then we
introduce a slightly dierent model of computation  which enables us to propose new  much shorter
proofs of these results  and we extend them in several important directions More precisely  we
provide an accurate solution for all values of the rise parameter and for all possible distributions
of the tiles to processors  while the authors in 	 deal only with a limited number of cases and
provide upper bounds rather than exact formulas These new results are presented in Section  In
Section 
  we apply our results to the problem of hierarchical tiling  that is  when multiple levels of
memory and parallelism hierarchy are involved In Section   we state our conclusions and discuss
directions for future research
 Determining the Idle Tile of a Tiling
In this section  we summarize the results of Hogsted  Carter  and Ferrante 	  who make the
following hypotheses
H There are P available processors interconnected as a ring Processors are numbered from 
to P   
 This small list is far from being exhaustive
	
r = -1
p  = 7
M=11
Processors
tile stacks
0 1 2 3 P-1
M-1
3
2
1
Figure  An example of parallelogramshaped iteration space with parallelogramshaped tiles
Arrows represent dependences between tiles
H Tiles are parallelograms with vertical left and right edges The size and shape of the tiles are
given  so that we deal only with a partitioned already tiled iteration space  as in Figure 
H The iteration space is a twodimensional parallelogram or trapezoid  with vertical left and
right boundaries The rst column and all columns in case of a parallelogramshaped iteration
space has M tiles
H Tiles are assigned to processors using either a onedimensional full block distribution or a
onedimensional cyclic distribution In other words 
 for a block distribution  there are P columns in the iteration space  and all the tiles in
column j    j  P     are assigned to processor j and
 for a cyclic distribution  there are bP columns in the iteration space  and all the tiles in
column j    j  bP     are assigned to processor j mod P 
H	 The rise parameter relates the shape of the iteration space to that of the tiles It is dened
as follows
 Let the slope in reference to the horizontal axis of the top and bottom edges of the
tiles be rtile
 If the iteration domain is a parallelogram  let riter be the slope of the top and bottom
boundaries In this case  Hogsted  Carter  and Ferrante 	 dene the rise r as
r  riter   rtile 
 If the iteration domain if a trapezoid  let riter top and riter bottom be the slopes of the top
and bottom boundaries  respectively In this case  Hogsted  Carter  and Ferrante 	 let
rt  riter top  rtile be the rise at the top of the iteration space and rb  riter bottom  rtile
be the rise at the bottom of the iteration space
See Figure 	 for an illustration
H
 Each tile depends upon both its left neighbor and its bottom neighbor see Figure 

w
h
-h
1.5h
a)
b)
h
br h
tr h
boundary
tile
A tile
boundary
Lower iteration-space
boundary
Upper
iteration-space
Figure 	 Shape of the iteration space the rise is positive in a and negative in b
H Because of hypotheses H
 and H  the scheduling of the tiles is by column Each processor
starts executing its rst column of tiles as soon as possible After having executed a whole
column of tiles  a processor moves on to its next column The time needed to process a tile
is Tcomp with the notations of 	  Tcomp  hw  where h and w are the normalized height
and width of a tile The time needed to communicate data from a tile to its right neighbor
is Tcomm  c  Tcomp As stated in 	  c may take any positive value even though we
expect c   for large tiles  because the communication volume grows linearly with the tile
perimeter  while the computation volume is proportional to the tile volume
Communications can be overlapped with the computations of other independent tiles More
over  no communication cost is paid between a tile and its top neighbor  because both are assigned
to the same processor
We summarize in Table  the results obtained in 	 In this table  Ia denotes the cumulated
idle time spent by the P processors while executing the tiled iteration space As pointed out
in 	  idle time can occur for two dierent reasons i a processor may have to wait for data
from another processor or ii a processor may have nished all of the tiles assigned to it  and it
is waiting for the last processor to terminates execution In Table   condition C is a technical
condition M   c rP  that states than no processor is kept idle when ending the processing
of one column of tiles assigned to it in other words  it can move on to its next column without
waiting for any data to be communicated
 New Results
In this section we propose new proofs and extend the work of 	
  Task Graph Framework
The key to our approach is the following rather than laboriously computing the idle time of each
processor  and then summing up the results to get the total idle time Ia  we compute the parallel
A precise modeling of the communication and computation costs for stateoftheart machines can be found in



Parallelogram shapedBlock distribution
r    Ia  P P    r cTcomp
r   	 Ia  maxc 
 
r
  r

PTcomp
Parallelogram shapedCyclic distribution
r     condi
tion C
Ia  P P    r cTcomp
r   	 Ia  maxc 
 
r 
 r
 bPTcomp
Trapezoidal shapedBlock distributionrb  rt
rb    Ia  P P    
rtrb


cTcomp
rb   	    
rt
Ia  maxc 
  r 
r
 
 P    rtrb

PTcomp
rb  rt    
rb   	
Ia  maxc 
  r 
r
 
 P    rtrb

  rt

PTcomp
Trapezoidal shapedBlock distributionrt  rb
   rt  rb Ia  P P    
rtrb


cTcomp
   c  rt 
 
Ia  P 
rtrb

c  rt


   rb  PTcomp
rt     c 
 
Ia  P 
rb rt
  
rt
 PTcomp
   rb
rt  rb   	 
rb   	
Ia  maxc 
  r 
r
 
  P    rt rb  
rt
 PTcomp
Table  Summary of the results of Hogsted  Carter  and Ferrante
execution time tP with P processors  and we state that
P  tP  Ia  Tseq
where Tseq is the sequential time  that is  the sum of all tile weights see Figure 
We describe the tiled iteration space as a task graph G  VE  where vertices represent the
tiles and edges represent dependencies between tiles A handy view of the graph is obtained by
rotating the iteration space so that rtile   Dependencies between tiles are now summarized

processors
time-steps
P
P
P
P
2
3
1 
4
active
idle
Figure  Active and idle processors during
execution illustrating the formula P  tP 
Ia  Tseq
Processors
Tile
stacks
0
1
2
3
4
6
5
1
2
3
4
5
6
7
8
9
10
M=11
p  = 7
r   =-1
Figure 
 Another view of Figure  after rota
tion
by the vector pair
f
 




 



g 
See Figure 
  where we have rotated the iteration space of Figure 
Computing the parallel execution time tP is a wellknown task graph scheduling problem Since
the allocation of tiles to processors is given  the task amounts to computing the longest path in
the dependency graph  where the weight of a path is the sum of the weights of its vertices and
edges All vertices have same weight Tcomp Horizontal edges have weight Tcomm they imply a
communication cost  while vertical edges have zero weight no communication cost due to the
allocation The problem has complexity OjV j  jEj simply traverse the direct acyclic graph
G  but we aim at nding a closedform formula for tP   specically  an analytical expression in the
problem parameters M   P   r  and c
  Preview of Results
A summary of our results is given in Table 	 A few comments are in order
 In Table 	 we assume that M is suciently large see Sections  to  for a more precise
statement This hypothesis was implicit in the results in Table  quoted from 	 see
Remark 	
 We use a slightly dierent model for border tiles the rst and the last one in each column
We view the number of tiles in each column as a continuous function of the rise r  which
introduces partial tiles These partial tiles are assigned a weight that is proportional to their
area  exactly as in 	 For instance consider Figure  where r     In the rst column 
there is no tile below the horizontal axis In the second column  there are two tiles below the
horizontal axis the rst one is a full tile  while the area hence the weight of the second one 

Parallelogram shapedBlock distribution
  r  c   tP MTcomp  Ia  
 r  c   tP  M  P   
   r  cTcomp
Ia  P P    r cTcomp
Parallelogram shapedCyclic distribution
  r  c   tP  bMTcomp  Ia  
 r  c   tP  M    r  c
 bP   Tcomp
M    r cP Ia  P bP     r  c
  b  M Tcomp
  r  c    tP  bM    r  c
 P   Tcomp
M    r cP Ia  P P    r cTcomp
Trapezoidal shapedBlock distribution
  rb  c    tP MTcomp
rt  rb
  rb  c    tP  M  P   rt  
rbTcomp
rt  rb
  rb  c    tP MTcomp
  rt  c  
  rb  c    tP  M  P     rt 
cTcomp
  rt  c  
Trapezoidal shapedCyclic distribution
  rb  c    tP  bM
rt  rb
bb  
 P rt   rbTcomp
  rb  c    tP  bM  
bb  

P
rt  rb  bP   rt  rbTcomp
  rb  c   see Proposition 

Table 	 Summary of the results of this paper
on the border of the domain  is half that of a full tile If we had r      the weight would
be   Tcomp for the boundary tile Since partial tiles may only occur at the bottom and at
the top of the iteration space  their weight has a little impact on the total execution time
 For trapezoidal shaped iteration spaces  the total idle time Ia is not reported in the table
However  it can be computed straightforwardly from the relation PtP  Ia  Tseq  where
We have checked this using a comprehensive simulation program  see 	

Processors
Tile stacks
Partial tiles
r = -1.5
Figure  Partial tiles on the boundary
Tseq  bP M 
bP  
 rt   rbTcomp let b   for a block distribution
   Parallelogram shapedBlock distribution
This is the simplest case  and we work it out in full detail In the formula below  we use the notation
a to denote the positive part of a real number a
a 

a if a  
 if a  
Proposition  Assume a parallelogram shaped iteration space of size M  P block distribution
and rise r If M  P   jrj Then
tP  M  P     r c
Tcomp 
Equivalently
Ia 

P P     r  cTcomp if   r c  
 otherwise
Proof All processors have the same workload MTcomp Because of the dependencies  processor
P    is always the last one to terminate execution We discuss separately the case r   and the
case r  
If r    processor q  P    can start processing its rst P     r tasks at timestep
t   Then  at timestep t   P   rTcomp  it can continue the processing of its column ie 
the remaining M  P   r tiles only if data communicated along the horizontal axis is already
available Otherwise it must wait To process and communicate data from the rst P    tasks
of the horizontal axis takes P     cTcomp Therefore  the longest path in the dependency
graph has length
P   max r  cTcomp  M  P   rTcomp 
This longest path is represented in Figure 
If r    processor q  P   must wait P  rTcomp timesteps for processor q   to complete
its rst P   r tasks Then processor q  P    must wait another P    cTcomp timesteps

Processors
Tile
stacks
0
1
2
3
4
6
5
1
2
3
4
5
6
7
8
9
10
M=11
p  = 7
r   =-1
Figure  Longest path when r   and M 
P jrj
Figure  Longest path when r   and M 
P jrj
for executing tiles and communicating data along the horizontal dependence path that leads to its
rst task Only then  at timestep t  P     r  cTcomp  can processor q  P    start the
execution of its M tasks  and it will not be further delayed during this processing The longest
path in the dependency graph is represented in Figure 
We summarize both cases with the single formula
tP  M  P     r c
Tcomp 
The formula for Ia is derived from the equation PtP  Ia  Tseq  with Tseq MPTcomp
Remark  We see that the results in 	  as reported in Table   are inaccurate A small rise
does not prevent from a quadratic idle time the precise condition is   c  r    which makes
good sense because the communicationtocomputation ratio of the target architecture has to play
a role In a word  when   r  c    the rise is so small r     c that all tile columns can
be processed independently On the other hand  when   r  c   which is always true when
r     the total idle time grows quadratically with the number of processors
Remark  The assumption M  P   jrj is needed to ensure the validity of the formulas See
Figure  one of the longest path is given by the processor q  Q  where M  Q   jrj  and
the parallel time is
tP  M  Q    r  c
Tcomp 

Note that Q may be much smaller than P in this case The idle time becomes
Ia 

P M
jrj
    r  cTcomp if   r  c  
 otherwise
Processor Q
Figure  r   and M  P jrj
  Trapezoidal ShapedBlock Distribution
Proposition  Assume a trapezoidal shaped iteration space of size M P block distribution and
rises rb bottom and rt top If M  P   jrtj jrbj then
tP  M  P     rb  c
  rt   rb
Tcomp 
Proof Let Cj  M  jrt   rb for   j  P    be the workload of processor j  that is 
the total weight of column j If   rb  c    then all columns can be processed independently
The total time is given by the largest processor workload tP  C  M if rt  rb  and
tP  CP    M  P   rt  rb otherwise
If   rb  c    all processors spend some idle time due to horizontal communications The
discussion is similar to that in the proof of Proposition  If r    processor q    q  P    
can start processing its rst j  rb tiles at timestep t    but then needs to wait until timestep
t    cjTcomp before processing the rest of its column  that is  the remaining Cj  rbj tiles
If r    processor j has to wait until timestep t    c  rjTcomp before starting to work In
both cases  processor j terminates the execution of its column at timestep
t    c rbj M  jrt   rbTcomp 
Depending upon the sign of   c  rb  rt   rb    c rt  this quantity is maximum either
for j   or for j  P   
Altogether  we assemble the results of our case analysis in the above formula
Again  we point out that the condition   rb  c   is the key to minimizing idle time If
this condition holds  the only idle time that remains is due to the unbalanced workload with a
trapezoidal iteration space  processors have dierent workloads  but no overhead is due to data
dependencies and to the communications they incur

  Parallelogram ShapedCyclic Distribution
Proposition  Assume a parallelogram shaped iteration space of size M Pb cyclic distribution
and rise r If M  P   br then
tP 


bMTcomp
if   r  c  
  r  cbP    M Tcomp
if   r  c  M    r  cP
  r  cP     bM Tcomp
if   r  c  M    r  cP
0 1 2
(P-1)(1+c) P(1+c) P(1+c)
M
-(P-1)r
P-1 2P-1 bP-1
M+(P-1)r
M+(2P-1)r
M+(bP-1)r
-(2P-1)r
-(bP-1)r
Figure  Sketch of the proof with rb  
Proof All processors have the same workload bMTcomp If   r  c    all tile columns can be
processed independently  and the rst part of the result follows
If   r  c    processor P    is always the latest one to terminate execution We discuss
the two cases r   and r   separately If r    the work of processor P    which is assigned
columns P    	P           bP     can be decomposed as follows see Figure 
tP  max  c rP   
max of propagating data along horizontal axis
and of computing tiles below axis in column P   

Pb  
k max  cPM   Pr
remaining tiles in column kP   
and tiles below axis in column k  P   
M  bP   r
remaining tiles in column bP 

For r    the same decomposition leads to see Figure 
tP    r  cP   
start up time

Pb 
kmax  c rPM
tiles in column j  kP
M
tiles in column j  b  P
It turns out that  because   r  c    the two expressions for tP coincide in other words 
the last expression is valid for both r   and r   This directly leads to the result
0 1 2 P-1 2P-1 bP-1
{Pr
{(P-1)(1+c) Pr
{
P(1+c)
P(1+c)
Pr
M
M
M
M
Figure  Sketch of the proof with rb  
  Trapezoidal ShapedCyclic Distribution
The trapezoidal shapedcyclic distribution is the most dicult case We have the following result
Proposition  Assume a trapezoidal shaped iteration space of size M  bP cyclic distribution
and rises rb bottom and rt top If M  P   bjrtj jrbj then
tP 


bM  bb  

P rt   rbTcomp
if   r  c   rt  rb
bM   bb   P  bP   rt  rbTcomp
if   r  c   rt  rb
maxtj   j  P   
if   r  c  

where tj  cj
Pb 
kmaxcPM PrbjkP rt rbMjb P rtTcomp
Proof If rbc    all tile columns can be processed independently In this case  the processor
that has the largest workload is processor  if rt  rb and processor P    otherwise The workload
of processor j is
Pb  
kCjkP   where CjkP  M jkP rt  rb is the weight of column
	
jkP     j  P     k  b  This leads to the rst part of the result if rt  rb  the maximum
is achieved for processor  and
Pb  
kCkP   bM
bb  

P rt rb  while if rt  rb  the maximum
is achieved for processor P    and
Pb  
kCP     kP   bM  
bb  

P  bP   rt  rb
If   rb  c    it is more dicult to determine the longest path If rb    we use the same
decomposition as in the proof of Proposition  to decompose the work of processor j    j  P  
tj  max  c rbj
max of propagating data along horizontal axis
and of computing tiles below axis in column j

Pb 
kmax  cPM   Prb  j  kP rt   rb
remaining tiles in column j  kP
and tiles below axis in column j  k  P
M  j  b  P rt
remaining tiles in column j  b  P
We have to take the maximum value of these quantities to obtain the parallel execution time
tP  maxtj  j  P    
Now if rb    the same decomposition leads to the expression for processor j
tj    c rbj
start up time

Pb 
kmax  c rbPM  j  kP rt   rb
tiles in column j  kP
M  j  b  P rt  rb
tiles in column j  b  P
Again  this last expression for tj coincides with the one when rb    hence the result
Remark  It is not dicult to analytically compute the value of j    j  P   that maximizes
tj in Proposition 
 This is a simple but tedious case analysis depending upon the problem
parameters P   M   c  rb and rt
 Hierarchical Tiling
As pointed out by Hogsted  Carter  and Ferrante 	  tiling may be used for multiple levels of
memory and parallelism hierarchy One important motivation for determining the idle tile of a
timing in 	 was  in fact  to demonstrate that such an idle time can have a signicant impact on
real performance for a large application
We reuse the example in 	 to illustrate this point A large rectangular iteration space with
horizontal and vertical dependencies is partitioned into supertiles In turn  each supertile is parti
tioned into secondlevel tiles that are assigned to processors See Figure   where supertiles and
secondlevel tiles are rectangular  as opposed to the situation in Figure 	  where supertiles and
secondlevel tiles have a parallelogram shape To motivate this example  think of a large outofcore
problem  where data is stored on disk Supertiles are brought in from disk and distributed among
the processor main memories there is an implicit synchronization between two consecutive super
tiles Which is the best strategy  rectangular tiles as in Figure   or parallelogramshaped tiles as
in Figure 	 It is stated in 	 that rectangular tiles incur a substantial idle time penalty  whereas
parallelogramshaped tiles do not at least in steady state partial tiles do incur a penalty  too

0 1 2 3P P PP
{supertile
tile {
Figure  Partitioning the iteration space into
rectangular supertiles and secondlevel tiles the
rise is r  
P3
P2
P1
P0
tile {
supertile
{
Figure 	 Partitioning the iteration space into
parallelogramshaped supertiles and second
level tiles the rise is r   
The results of the preceding sections enable us to answer the problem we analytically compute
the best partition shape as a function of the iteration space parameters and of the target machine
characteristics
Let h and w be the normalized height and width of secondlevel tiles whose processing requires
Tcomp  hw timesteps Assume a block distribution of tiles to processors so that each supertile
is of size Mh  Pw in other words  in a supertile there are P columns of M tiles each We have
P  
 and M   in Figures  and 	 Let the size of the whole iteration space be D h Dw 
where D   d M and D  dP  With the rectangular partitioning  there are d   d supertiles
With the parallelogramshaped partitioning  there are d d supertiles  and the rst and last
supertiles in each column are partial The following lemma is a direct consequence of the results
in Table 	
Lemma  With the previous notations assume that M  P   jrj The total execution time to
process the iteration space is

Trect  M  P     cd dTcomp
for rectangular tiles
Triser  M  P     c r
d  dTcomp
for parallelogram tiles with rise r
For rectangular tiles  we rewrite Trect as
Trect  M  P     c
D 
M
D
P
Tcomp
to show that it is a decreasing function of M  In other words  M should be chosen as large as
possible  namely  M  Mmax  where Mmax is such that Mmax tiles ie  Mmaxhw computational
points of the iteration space t in the cache memory of a single processor For the same value of


M   we choose for r the smallest value such that  r c   ie  r    c  and we derive
that
Trise   c M
D 
M
 
D
P
Tcomp 
We formulate the following proposition
Proposition 	 If Mmax  P     c
D
Mmax
 then Trise   c  Trect
The condition in Proposition  will always be true for large enough domains In other words 
parallelogramshaped supertiles will lead to the best performance
 Conclusion
In this paper  we have extended results by Hogsted  Carter  and Ferrante 	  and we have been
able to accurately determine the idle time of a tiling for both parallelogramshaped and trapezoidal
shaped iteration spaces We have provided a closedform expression of the idle time for all values
of the rise parameter  for a block distribution as well as for a cyclic distribution
Furthermore  we have used our new results in the context of hierarchical tiling Although we
have dealt only with a particular instance of the multilevel tiling problem  we believe our approach
is general enough to be applied in several situations such as those described in 
Finally  we point out that the recent development of heterogeneous computing platforms may
well lead to using tiles whose size and shape will depend upon the characteristics of the processors
they are assigned to An interesting research direction would be to extend our approach so as to
incorporate processor speed as a new parameter of the tiling problem

References
 A Agarwal  DA Kranz  and V Natarajan Automatic partitioning of parallel loops and
data arrays for distributed sharedmemory multiprocessors IEEE Trans Parallel Distributed
Systems  
	  
	 Rumen Andonov and Sanjay Rajopadhye Optimal tiling of twodimensional uniform recur
rences Journal of Parallel and Distributed Computing  to appear Available as Technical
Report LIMAVRR   http!!wwwunivvalenciennesfr!limav!andonov
 Utpal Banerjee An introduction to a formal theory of dependence analysis The Journal of
Supercomputing  	
  

 Pierre Boulet  Alain Darte  Tanguy Risset  and Yves Robert penultimate tiling Integra 
tion the VLSI Journal    

 PierreYves Calland and Tanguy Risset Precise tiling for uniform loop nests In P Cappello
et al  editors  Application Specic Array Processors ASAP   pages  IEEE Computer
Society Press  
 PY Calland  J Dongarra  and Y Robert Tiling with limited resources In L Thiele  J Fortes 
K Vissers  V Taylor  T Noll  and J Teich  editors  Application Specic Systems Achitectures
and Processors ASAP	
  pages 			 IEEE Computer Society Press  
 L Carter  J Ferrante  S F Hummel  B Alpern  and KS Gatlin Hierarchical tiling a
methodology for high performance Technical Report CS  University of California at
San Diego  San Diego  CA   Available at http!!wwwcseucsdedu!carter
 YS Chen  SD Wang  and CM Wang Tiling nested loops into maximal rectangular blocks
Journal of Parallel and Distributed Computing  		  
 J Choi  J Demmel  I Dhillon  J Dongarra  S Ostrouchov  A Petitet  K Stanley  D Walker 
and R C Whaley ScaLAPACK A portable linear algebra library for distributed memory
computers  design issues and performance Computer Physics Communications    
also LAPACK Working Note "
 Alain Darte  GeorgesAndr#e Silber  and Fr#ed#eric Vivien Combining retiming and scheduling
techniques for loop parallelization and loop tiling Parallel Processing Letters   Special
issue  to appear Also available as Tech Rep LIP  ENSLyon  RR

 J J Dongarra and D W Walker Software libraries for linear algebra computations on high
performance computers SIAM Review  	  
	 K Hogstedt  L Carter  and J Ferrante Determining the idle time of a tiling In Principles
of Programming Languages  pages  ACM Press   Extended version available as
Technical Report UCSDCS

 Fran$cois Irigoin and R#emy Triolet Supernode partitioning In Proc th Annual ACM Symp
Principles of Programming Languages  pages 	  San Diego  CA  January 

 AmyW Lim and Monica S Lam Maximizing parallelism and minimizing synchronization with
ane transforms In Proceedings of the th Annual ACM SIGPLAN SIGACT Symposium on
Principles of Programming Languages ACM Press  January 

 Naraig Manjikian and Tarek S Abdelrahman Scheduling of wavefront parallelism on scalable
shared memory multiprocessor In Proceedings of the International Conference on Parallel
Processing ICPP  CRC Press  
 H Ohta  Y Saito  M Kainaga  and H Ono Optimal tile size adjustment in compiling general
DOACROSS loop nests In  International Conference on Supercomputing  pages 		
ACM Press  
 J Ramanujam and P Sadayappan Tiling multidimensional iteration spaces for multicomput
ers Journal of Parallel and Distributed Computing  		  	
 Fabrice Rastello Techniques de partitionnement Masters thesis  Ecole Normale Sup#erieure
de Lyon  June 
 Robert Schreiber and Jack J Dongarra Automatic blocking of nested loops Technical Report
  The University of Tennessee  Knoxville  TN  August 
	 S Sharma  CH Huang  and P Sadayappan On data dependence analysis for compiling pro
grams on distributedmemory machines ACM Sigplan Notices  	  January  Extended
Abstract
	 M E Wolf and M S Lam A data locality optimizing algorithm In SIGPLAN Conference
on Programming Language Design and Implementation  pages 

 ACM Press  
		 Michael E Wolf and Monica S Lam A loop transformation theory and an algorithm to
maximize parallelism IEEE Trans Parallel Distributed Systems  	

	
  October 

