Load Balancing HPF Programs by Migrating Virtual Processors by Pérez, Christian
HAL Id: hal-02101882
https://hal-lara.archives-ouvertes.fr/hal-02101882
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Load balancing HPF programs by migrating virtual
processors
Christian Perez
To cite this version:
Christian Perez. Load balancing HPF programs by migrating virtual processors. [Research Report]
Laboratoire de l’informatique du parallélisme. 1996, 2+21p. ￿hal-02101882￿
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
Load balancing HPF programs
by migrating virtual processors
Christian Perez October  
Research Report No 
Ecole Normale Supérieure de Lyon
Adresse électronique : lip@lip.ens−lyon.fr 
Téléphone : (+33) 72.72.80.00    Télécopieur : (+33) 72.72.80.80
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Load balancing HPF programs
by migrating virtual processors
Christian Perez
October  
Abstract
This paper explores the integration of load balancing features in the data parallel lan 
guage HPF targeting semi regular applications We show that the HPF virtual pro 
cessors are good candidates to be the unit of migration Then we compare  possible
implementations and show that threads provide a good tradeo between eciency and
ease of implementation We nally describe a preliminary implementation The ex 
perimental results obtained with the Gaussian elimination with partial pivoting are
promising
Keywords  Data parallel languages HPF compilation multi thread environment thread mi 
gration
Resume
Ce papier etudie lintegration dans le langage HPF dun equilibrage de charge visant les
applications semi reguli	eres Nous montrons que les processeurs virtuels du langage sont
de bons candidats pour 
etre lunite de migration Nous comparons alors  implemen 
tations possibles et il appara
t que les processus legers representent un bon compromis
entre lecacite et la facilite dimplementation Nous decrivons ensuite notre implemen 
tation preliminaire Les resultats experimentaux obtenus avec lelimination de Gauss
avec pivot partiel sont prometteurs
Motscles  Langages 	a parallelisme de donnees HPF compilation environnement dexecution
multi thread migration de processus legers
Load balancing HPF programs by migrating virtual processors
Christian Perezy
November   
Contents
 Introduction 
 The HPF model 
 Irregularity in HPF 
 Regularity and irregularity                                                                   
 Extending the second level irregular data distributions                                 
 Extending the third level irregular virtual processor mapping                           
 Virtual processors as migrating threads 
 One virtual processor per process                                                           
 Virtual processors emulated by virtualization loops                                       
 Virtual processors as threads                                                               
 Summary                                                                                     
 HPF and threads related work                                                             
 Description of a preliminary implementation 
 Global and local variable issues                                                             
 Data and computation distributions                                                       
 Thread localization                                                                           
 Data localization                                                                             
 Scalar codes                                                                                   
 Thread migration                                                                             
	 Experimental results 
 One processor one thread                                                                   
 One processor many threads                                                               
 Overhead of the load balancing module                                                   
 Performance of our implementation                                                         
 Conclusion 

ySupported by the INRIA project ReMaP and the French CNRS Project Coordinated Research Project on Par 
allelism PRS

  Introduction
HPF like data parallel languages seem to have found an increasing popularity thanks to the con 
tinuous improvement of the compilers Ecient at the beginning on a small set of very regular
programs the current compilers include a lot of optimizations For example they can overlap
communications by computations by introducing pipelined techniques  They can also optimize
irregular communications when an irregular pattern is used many times 
However the load balancing issues are seldom taken into account Some papers   propose
irregular data distributions in order to provide mechanisms that allow the user to balance statically
the load More recently HPF  includes now general block distributions and irregular distributions
as approved extensions  Yet this is a redistribution notion which has not all the dynamic load
balancing features One of the advantages of a dynamic load balancing system over irregular data
redistributions is that it is almost transparent for the programmer The user has only to give some
knowledge to the compiler if the compiler does not succeed extracting the properties of the program
So the code remains basically the same and the user does not have to bother too much with load
balancing problems The compiler can introduce calls to the load balancing module according
to some user specications or can do it automatically The portability and the maintenance of
the code are improved Moreover a dynamic load balancing system allows an execution to be
ecient on a static andor dynamic heterogeneous environment The load balancing module may
be interfaced with the operating system to share informations about the application needs and the
system constraints
The aim of this paper is to show how the HPF denition allows the notion of virtual processor
to be extended in order to constitute the unit of a load balancing system The second goal is
to show that the threads are good candidates to encapsulate migrating virtual processors as the
migration of threads is becoming available
The rest of this paper is divided as follows Section  sums up the HPF denitions concerning
alignment and distribution Section  deals with the dierent levels of HPF data mapping where
irregularity can be introduced We compare  solutions of implementation in Section  In Section 
we review some implementation issues based on our experiment Preliminary experimentations are
described in Section 
 The HPF model
High Performance Fortran   abbreviated as HPF is a data parallel language based on Fortran
HPF  was based on Fortran  and HPF  on Fortran  A very important feature of this
language is that it denes a three level mapping of data which represents in fact arrays The
rst level is an alignment level Arrays are aligned relative to one another or to a template
which is an abstract reference array The language provides the ALIGN directive for this purpose
Then arrays are distributed onto abstract processors using the DISTRIBUTE directive of HPF This
distribution is dened onto a rectilinear arrangement of virtual processors The last level is declared
to be an optional implementation dependent directive  p  It consists in mapping virtual
processors onto physical processors Currently most compilers map exactly one virtual processor
onto one process Processes are assumed to be one per physical processor So they use the virtual
onto physical processor mapping directive very poorly A schema of the  levels of mapping and
distribution of the HPF model is displayed on Figure 

of arrays
Alignment
Arrays Template Virtual processors Physical processors
Distribution
of templates
Mapping of
virtual processors
Figure  The HPF model arrays are aligned with a template which is distributed onto abstract or
virtual processors These processors are mapped onto physical processors
The ALIGN directive  Usually the alignment phase aims at minimizing the communications by
forcing elements of dierent arrays to be aligned at the same location as some elements of the
template Whatever the distribution is all the elements of arrays aligned with an element of
the template are guaranteed to be mapped onto the same processor
The DISTRIBUTE directive  The distribution phase deals with spreading data onto virtual pro 
cessors The basic distributions are the block block and cyclic cyclic  distributions
which can be seen like special cases of the block cyclic distribution cyclic n The distri 
bution of a k dimensional array representing a template onto a l dimensional array of virtual
processors with l smaller than k is specied dimension by dimension If l is strictly smaller
than k the remaining dimensions are replicated onto all the virtual processors
The PROCESSOR directive  This directive denes a user declared Cartesian mesh of abstract or
virtual processors Abstract processors represent memory regions  p  The imple 
mentor can use the same number or a smaller number of physical processors to implement
it The language also provides an intrinsic function that returns the number of physical pro 
cessors It is useful to compute the number of virtual processors according to the number
of physical processors Figure  illustrates the distribution of an array according to the two
basic distributions
 Irregularity in HPF
  Regularity and irregularity
HPF is well known for its eciency in the case of very regular computations Some extensions of
the language propose irregular data distributions Their goal is to allow HPF to be ecient with
irregular computations The problem that arises is that the load balancing problems become a
burden for the programmer Indeed irregular data distributions are used to balance dynamically the
load Our goal is to improve HPF by integrating to it dynamic load balancing features A dynamic
load balancing system brings with it eciency in the execution in an heterogeneous environment
We however do not want to restrict such an extension to a purely dynamic load balancing We want
to set up a framework that supports load balancing in order to study the relation between compile 
time knowledge and dynamic load balancing The goal is to guide the dynamic load balancing

ARRAY A(4,6) PROCESSOR P(2,3)
Figure  Example of distribution The array A  is mapped on the array of virtual processor
P  by the directive of distribution DISTRIBUTE A BLOCKCYCLIC ONTO P The rst dimension
of A is distributed block wise whereas the second is cyclicly distributed The dierent colors of the
array elements represent the abstract processors on which they are mapped The alignment phase
to a template is often by passed when there is only one array to distribute
by the properties of the source code At this time properties are assumed to be detected and
declared to the compiler by the user Later we may envisage to automate the detection of relevant
properties Let us review the benets of regularity that we want to preserve and the irregular
problems that we want to handle
Regularity
We do not want to loose regularity because it is a concept with a lot of advantages
Ecient accesses to the memory  Regularity allows memory to be linearly described As a
consequence the cache hit and thus the bandwidth of the memory is very high because the
usual data cache management policies are optimized for linear memory runs
Simple computation of regular communications  The regularity of the data distribution en 
ables simple representations of the data distributions So functions that map data onto
virtual processors and their inverses are simple functions A particular consequence is the
easiness of the computation of communication schedules in regular codes
Base of HPF compilation techniques  The notion of regularity in the distributions is the foun 
dation of the HPF language Almost all previous compilation techniques are based on this
concept By keeping it valid they also remain valid The techniques we are referring to are
relative to linear algebra that is useful for regular computation compilation 
Irregularity
We need to handle irregularity because it is a key for high performances
Real codes have a part of irregularity  Irregularity may come from an unbalanced distribu 
tion of computations or from a repeated irregular communication pattern In these two cases
an irregular distribution may balance the load andor reduce the overall communication cost
in terms of number of messages andor in terms of communication volume according to the
architecture of the underlying network of communication

Heterogeneous environments of execution are irregular  Dedicated massively parallel ma 
chines are still an important issue but networks of workstations NOW are becoming more
and more popular as parallel machines for intensive computations In these environments
irregularity stems from the heterogeneity of the processors andor from the multi user envi 
ronment Irregular distributions can adapt to the distribution of the computation to the static
and dynamic specicities of these environments non uniform processor power variations of
available processor power etc
The denition of the HPF language can be extent in two dierent ways The rst one consists
in extending the diversity of data distributions allowed for the distribution of arrays onto abstract
processors This is achieved by extending the second level of HPF model The second one is to
control the mapping of abstract processors onto physical processors by extending the third level
Let us discuss these two possibilities
  Extending the second level irregular data distributions
The second level describes the distribution of arrays onto abstract processors Figure  The usual
distribution is cyclic n It is not well suited for describing irregular distribution Now assume
that we have a directive that species for each element of the array the virtual processor to which it
must be assigned and that this assignment can change Thus we have the most general denition
for a distribution directive that can support all distributions The main drawback is a complete
loss of regularity It leads to a high overhead for codes that have a part of regularity That is why
more regular  but still irregular  distributions have been proposed
This approach constitutes the basic idea of a lot of works because the limitations of the basic
distributions have been quickly reached Real codes like unstructured mesh codes or molecular
dynamics codes need irregular distributions to do load balancing
Vienna Fortran  has many way to handle irregular distributions It provides a general block
distribution which is a generalization of the usual block distribution by allowing the block size to
vary A more general distribution is the indirect distribution which uses an array to specify on which
processor each element must be mapped Finally user dened distribution functions constitute the
most general mechanism to specify distributions they can express any arbitrary mapping between
array indices and processors including partial or total replication
Annai  provides general block distributions and user dened data distributions User dened
data distributions consist of indirect distributions and mapping functions Mapping functions are
functions that map global to local indices local to global indices global indices to processors and
dene the size of the local array which may be dierent on each processor In contrary to the user
dened distribution functions of Vienna Fortran mapping functions are assumed to introduce more
overhead than indirect distributions  
Fortran D  has two dierent approaches of the problem The rst one  is like other
compilers an index based distribution The user may specied indirect distributions through an
indirection array The second approach  is a value based distribution It consists in partition 
ing distributed arrays according to their values or to the values of an another array rather than
according to their indexes It can be seen as an indirect distribution where the indirection array is
computed automatically by the run time system
High Performance Fortran Forum decides to include general block distributions and irregular
distributions as approved extensions in the denition of HPF   The general block distribution

and irregular distributions are restricted to arrays of dimension one They can use a variable to
dene the mapping when apply with the REDISTRBUTE directive
   Extending the third level irregular virtual processor mapping
The third level controls the mapping of abstract processors onto physical processors as shown in
Figure  Irregularity can be obtain by using an irregular distributions of virtual processors If
we have more virtual processors than physical processors we can balance the load by distributing
virtual processors in such a way that the load of each physical processor is as close as possible to
the average load Figure  shows an irregular distribution of virtual processors onto  physical
processors
ARRAY A(4,6) PROCESSOR P(2,3)
B C
Physical Processors
A
Figure  Example of an irregular virtual processor distribution Let processor A be  time as fast
as processor C and processor B  time as fast as processor C The array A	
 is distributed onto
the virtual processor mesh P
 using the directive DISTRIBUTE A BLOCCYCLIC onto P Then

the load balancing system can map these  virtual processors onto the  physical processors in an
balance way because it has some knowledge of the execution environment
The regularity is embedded in the notion of virtual processor The code of a virtual processor
may be seen as the code currently generated at the process level by the HPF compilers for the
purpose of this discussion we leave aside for a while the management of the virtual processors
The irregularity is achieved by an irregular distribution of the virtual processors onto the phys 
ical processors As this possibility was not taken into account to our knowledge we see two
complementary approaches of using and dening irregular distributions of virtual processors One
is predicative whereas the other is reactive
  First the programmer may give informations to the compiler and so to the run time system
by some new directives when some knowledge about the code behavior is available in advance
  Second when good distributions cannot be predicted dynamic techniques must be used in
order either to balance the load of physical processors or to reduce communication costs and
to maintain load balancing
This approach is a tradeo between regularity and irregularity We transform a ne grained
parallelism into a coarse grain parallelism The virtual processor becomes the central notion for

irregularity and regularity it embeds a regular part of memory allowing static optimizations and
it becomes the unit of irregularity It makes sense because coarse grain level appears to be better
that ne grain for the following reasons
  The data level seems to be too low for current machines because their latencies and their
bandwidths are better adapted for coarse grain applications according to their computing
power Furthermore the load balancing decision is taken when the benets exceed the com 
munication costs of the balancing phase This condition is generally not satised for a small
volume of data
  Another advantage of the coarse grain level over low level comes from communication costs
Reducing communication overheads even at the price of an unbalanced distribution of the
load should be better in terms of performance than having a perfectly balanced distribu 
tion with a lot of overhead in communications Overhead may result from communication
computation costs and bad use of the memory cache as well as from the increase of the com 
munication volume generally generated by very irregular data distributions This last eect
is especially important when a computation depends on its neighbors that is to say when
locality is a main feature of the application
In conclusion we propose the use of the third mapping level of the HPF model to handle the
load balancing issues and not to lose the regularity The two major advantages are the transfer
of the main part of the load balancing responsibility to the system and the ability for the system
to take into account external heterogeneity The user has only to distribute the arrays in a way
that maximizes regularity inside a virtual processor without generating too many communications
across virtual processors
 Virtual processors as migrating threads
We now have to face to the problem of implementing virtual processors We have  main choices
We can embed a virtual processor into a process into a virtualization loop or into a thread To
decide we list the main advantages and drawbacks of each of these solutions The features we
consider are the easiness of implementation the overhead generated in time the behavior of intra 
processors communications between virtual processors mapped onto the same physical processor
the number of virtual processors per physical processor memory overhead and the time for virtual
processor migration
 One virtual processor per process
This usual solution allocates each virtual processor to an independent process See Figure  This
is what current HPF compilers assume
Advantage 
 This solution is very easy to implement Current HPF compilers can generate more pro 
cesses than physical processors So we can have several processes per physical processor
leading us to allocate several virtual processors per physical processor The migration of
process can be achieved by one of the numerous systems that support process migration
like MPVM  or Cocheck 

Physical processor Process Part of an array
DISTRIBUTE
A(4,6)
(Block,Cyclic)
Figure  Array A is distributed onto a    grid of  virtual processors Each virtual processor
is embedded into a process The physical machine has only  processors
 so each physical processor
receives  processes
Drawbacks 
 The context switch of processes is expensive
 The communications between processes on the same processors are very slow compared
to memory accesses Thus the communications between virtual processors onto the
same physical processor  but in dierent processes  are slow
 The replication of the process code limits the number of processes per processor it is a
huge waste of memory
 The migration of processes is expensive due to the process size that represents millions
of bytes Even on a distributed le system like NFS it still represents a lot of bytes
 Virtual processors emulated by virtualization loops
This approach maps several virtual processors into a process and then onto a processor whether
we consider a process per processor The process code essentially emulates the virtual processors
by virtual loops Basically for a sequential machine it consists of embedding each data parallel
instruction in a loop that enumerate all virtual processors This technique is often used in compiling
explicit data parallel languages like C  Figure  illustrates this approach
Advantages 
 Direct memory communications between dierent virtual processors are available
 As a small memory overhead is induced by the virtualization loops a process can manage
many virtual processors
 The virtual processor migration is almost minimal because only the data and some
management variables need to be sent

Part of an arrayVirtual processorProcessPhysical processor
A(4,6)
(Block,Cyclic)
DISTRIBUTE
Figure  Array A is distributed onto a    grid of  virtual processors Virtual processors are
emulated by virtualization loops The physical machine has only  processors
 so that each physical
processor supports  virtual processors emulated by an unique process
Drawbacks 
 Contrary to explicit data parallel languages dependences exists in HPF Thus this
solution does not seem to be easy to implement because of the need of a scheduling of the
virtual processors The scheduling is needed to satisfy the dependences in virtualization
loops
 The virtualization loops create a kind of context switch overhead which is very low But
the overall overhead is very high because of the intensive use of it
  Virtual processors as threads
A parallelism of processes can easily be transformed into a parallelism of threads when we do not
consider interference of the operating system because a thread is a process without the interface to
the outside world This solution consists in embedding a virtual processor into a thread as shown
in Figure 
Advantages 
 This solution may be seen as an extension of the previous HPF compilation approaches
Indeed the code of a thread looks like the current process code That is why this solution
is thus relatively simple to implement
 Direct memory communications between threads in the same process may be imple 
mented allowing ecient virtual to virtual processor communications
 A thread only needs a private stack Hence the memory overhead is not too big and
many threads can reside on a processor
 The thread migration cost is low because a migration message contains the data the
used stack and some thread management variables The data are generally the main
part of the message

ThreadProcessPhysical processor Part of an array
A(4,6)
DISTRIBUTE
(Block,Cyclic)
Figure  Array A is distributed onto a    grid of  virtual processors Each virtual processor
is embedded into a thread The physical machine has only  processors
 so each physical processor
receives  threads that are embedded in a process
Drawback 
 The main limitations stem from the multi threaded environment The limitations con 
cern  but are not restricted to  portability eciency in thread creation and thread
synchronization thread context switch integration of communication etc
 Summary
Processes Virtualization Threads
Easiness of implementation Easy Lot of work Medium
Context switch cost high very high low
overall overhead
Number of virtual processors few a lot many
per physical processor
Migrating data code data data  stack 
 data  virtualization thread management
 stack management variables variables
Migration cost expensive cheap cheap
Table  Features of some implementation solutions for virtual processors
We sum up all the previous observations in Table  Embedding a virtual processor into a process
does not seem to be a good choice for high performance computing mainly because a process needs
a lot a memory It limits the number of processes per processor and represents a lot of data to
be transfered The virtualization looks like a good solution if it does not need a lot of work and
if virtualization loops do not generate so much overhead We choose to embed a virtual processor

into a thread because of the good behavior of current multi thread environments and the ease of
implementation of this solution
 HPF and threads related work
Some compilers generate several threads per process but their purpose is to overlap communications
with computations The multi threaded approach oers a good solution according to the numerous
works that are using it However we could not nd any that considers threads as a load balancing
facility in a HPF context Threads are just used as an optimization technique
The Paradigm compiler  of the University of Illinois integrates threads for the purpose of
overlapping communications by computations Their rst approach was to generate more virtual
processors than physical processors and to use a multi threaded runtime package They observe
that this technique minimizes the compiler support required to obtain multi thread code but at
the expense of a complex machine dependent runtime support  Then they consider a message
driven model that simplies the design of runtime system moving the complexity to the compiler
and also increasing portability Multi threading generates super linear speedup because of cache
eects The main problem however is to determine the optimal number of threads per processor
In  and  multi threading is also used to overlap communications by computations in
HPF like languages Whereas in  Sohn and al use one thread per virtual processor in 
Andre uses only four dierent threads A thread is responsible for sending the data to others
processes The Receive thread receives the message and stores it The Receipt manager thread
deals with received data When the data needed for a computation are received the Computation
thread informed by the Receipt manager thread starts this computation So the number of thread
is constant whatever the number of data
In  the use of threads to support the execution of data parallel programs is studied The
overhead induced by the multi threaded environment like thread creation thread synchronization
communication and thread migration are experimentally studied This paper shows that the choice
between thread creation and thread synchronization is not obvious
We also nd systems like Athapascan Nexus  or Chant  Athapascan is the run time
support of the APACHE project and is divided in  parts Athapascan  is a portable standard
library that allows a parallel application to be described in terms of parallel procedure decompo 
sition Athapascan  is an interface for applications that provides load balancing facilities Nexus
and Chant are portable runtime interface for task parallel programming languages They provide
support for multi threading and oer a global memory model A particular aspect of these two
systems is their integration of communications in multi threaded system   Nexus is the com 
piler target of two task parallel languages which are Fortran M  and Compositional C 
Thread migration is not supported by any of these systems at the best of our knowledge
 Description of a preliminary implementation
In this section we describe the most relevant aspects of our experimental implementation Our
hand coded program is directly derived from the code generated by the source to source compiler
Adaptor  It is a FORTRAN  code with calls to a run time library The run time library
is written in C and implements high level functionalities in cooperation with the multi thread
environment PM  like scalar and array broadcasts global synchronization data migration

etc Our goal was not to create a complete environment but only to have some running examples
in order to test our approach The feasibility was one of our main criteria
We use the multi thread environment PM because it has thread migration But it is not
an environment specially designed to be the run time system of a data parallel language compiler
output The use of a system like Nexus or Athapascan   if only they supported thread migration
 would have simplied either the run time appropriateness to a data parallel program or the
integration of advanced load balancing features
 Global and local variable issues
There are two levels of variables global variables dened in the main and local variables used in
subroutines Global variables are unique in a process and have the important property of being
allocated at the same address in all processes This result stems from the fact that global variables
are statically allocated into the heap This property is very important for thread migration because
it allows a migrated thread to have direct access to global variables Local variables are local to
threads and do not have the property of being at a xed address because there are allocated into the
stack Hopefully the multi thread system PM deals with the stack migration It can compute the
shift to apply to each pointer addressing a local variable The pointer needs only to be declared to
PM However the FORTRAN  compiler we use generates a lot of pointers if we try to optimize
the code As it does not know about threads it does not declare the pointers to PM That is why
our experiments are done with a low level of optimization ag O on Alpha
 Data and computation distributions
Assume that we have the HPF code listed on Figure  We must generate n n threads one per
virtual processor distributed in at most n  n processes Each thread has to handle a block of
the array A This block is dened by  variables  per dimension A LOW and A HIGH for the
rst dimension and A LOW and A HIGH for the second These variables are initialized at run time
according to the value of n n n and the logical number of each thread We give in Figure  the
FORTRAN  code of the hand  compiled program We can see that it is similar to the code
currently generated by HPF compilers
real A NN
hpf processors P nn
hpf distribute A blockblock onto P
hpf independent
do jN
hpf independent
do iN
A ijij
endif
endif
Figure  A HPF code example a nest of loops of depth  without dependence
As previously mentioned  the main advantage of this approach is that we encompass in a thread
HPF compilation knowledges like the one we have present here So a multi thread extension of an

HPF compiler does not seem to be dicult The rst major change is that the address space is
modied The second major change is the the need of introducing synchronization barriers
A	LOW     N 
  mod rg	tn   n   
A	HIGH    N 
  mod rg	tn   n
A	LOW     N 
  rg	t  n   n   
A	HIGH    N 
  rg	t  n   n
call allocate rg	tAA	LOWA	HIGHA	LOWA	HIGH
do jA	LOWA	HIGH
do iA	LOWA	HIGH
A ijij
endif
endif
Figure  The FORTRAN  hand  compiled program of the previous HPF program rg t is the
logical number of thread and is in 
nn 
  Thread localization
When a thread needs to communicate with another one it has rst to locate its partner The
localization of a thread is not static because threads can migrate In each process there is an array
hidden in the run time library which contains for each logical thread number the logical process
number where the thread lives
This solution has the benet of being simple and ecient but at the price of a global knowledge
of thread migration If it is not the case pointers for example must be set by migrating thread
so that a message can follow the pointer chain until it reaches the thread Further optimizations
may collapse chains of pointers
 Data localization
Once a thread is located the base address of one of its array has to be found Once again it is a
dynamic value as arrays are re allocated after migration Moreover re allocation is generally done
at a dierent address
To nd the base address of a distributed array we have used the same technique as for locating
a thread there is one array per process and per distributed array which contains the base address
of the block of each distributed array for each thread the address is valid only if the thread is
present These arrays are updated at array allocation at the array creation and after a migration
An illustration of this technique can be found in Figure 
The benets are still the easiness and the eciency but this time the drawback does not stem
from a global knowledge but from a memory overhead Indeed only some entries of each array
are used per process in average These entries are the entries related to the arrays of the present
threads As the percentage of present threads decreases as the number of processes increases most
entries will not be used when many processes are used leading to a waste of memory

allocate(A, size, rg_t)
p <- malloc(size)
global_ptr_A[rg_t] <-p
Entry number : rg_t
pglobal_ptr_A :
Process P
Thread number : rg_t
Figure  Illustration of the dynamic array management The allocation function of the virtual
processors part of the distributed array A writes the base address of the allocation into the adequate
global array The cell written is at the index corresponding to the logic number of the thread doing
the allocation
A solution may be to store the base array addresses of present threads only So the used
memory becomes related to the number of present threads instead of the number of threads in
the system However accesses are slowed down by a factor that depends of the storage structure
used Accesses are in average in Olog number of present threads resp in Onumber of present
thread if a tree structure resp a list is used instead of O for the full array storage
 Scalar codes
Scalar codes in parallel routines have generally to be executed once per process There are  main
possibilities to implement them
 a scalar thread executes scalar codes
 destruction of parallel threads at the beginning and re creation at the end of each scalar code
 one thread among the parallel threads executes scalar codes This is achieved by using
mutual exclusion to a boolean test which species whether the code has been executed
We have chosen the last solution because thread synchronization that is needed for its imple 
mentation is cheaper than thread creation in PM see  Moreover the extraction of scalar
codes required for the rst approach is a complex problem in general whose benets do not seem
better that the third solution
The most general scheme for our implementation of scalar code is shown in Figure 
This implementation is not optimal because give permission to the first thread  and
end of permission  have mutual exclusions that may lead to generate context switches A more
ecient solution would be to merge these calls with those of the barriers Further optimizations can
focus on removing these calls If two scalar parts are not separated by a call to the load balancing
module the inner calls to give permission to the first thread  and end of permission 
can be removed

  General scheme that embeds scalar code
  call barrier synclocal
  call givepermissiontothefirstthread bool
  if  bool then
   scalar code
 	 endif
 
 call barrier synclocal
  call endofpermission bool
  Explanation
  Intra process synchronization of threads before the scalar part to ensure that all
previous parallel code has terminated
  Only one thread among the parallel threads sees its variable bool set to TRUE
All others see bool set to FALSE
 
 Intra process synchronization to wait for the completion of the scalar code and to be
sure that all threads have gone beyond the test guarding the scalar code
  End of the critical section
Figure  General FORTRAN  scheme that embeds scalar code and its explanation
Another delicate point of this implementation is that there must always be at least one thread
per processor to execute the scalar code It seems that it is possible to extend this implementation
to take into account situation where a processor has no thread but this is not currently done
 Thread migration
The thread migration decision is currently known by each process This knowledge allows processes
to maintain the array of thread location easilysee Section 
The PM multi thread system deals with stack migration But it is not responsible for moving
allocated data Hopefully it provides functions to add and remove data to the migration message
So we have used this possibility to transfer allocated arrays with the migration message as shown
in Figure  The re initialization of the thread in the destination process consists in re allocating
arrays and in correctly setting global variables related to the migrated thread like for example the
number of local threads for the synchronization barriers
 Experimental results
In order to estimate the performance of our approach we use a well known program which is
Gaussian elimination with partial pivoting The classical solution of this problem is to use a cyclic
distribution for the columns and to swap the column holding the pivot with the current column k
at column iteration k In our program we use a block distribution of columns Load balancing is
achieved by migrating threads
We chose this test program because the block distribution approach needs to be balanced and
because we know the cyclic distribution which transforms this problem into an regular one So

 
 
 
 
 
 
 
 
 









    
    
    
    




    
    
    
    




      
      
      
      
      
      






    
    
    
    




   
   
   
   




 
 
 
 
 
 
 
 
 
 
 
 
 
 














    
    
    
    
    





T1
Global_ptr_array_A00 Global_ptr_array_A
Thread 0 Thread 1 Thread 7
Part of array A
Process P Process Q
Th
rea
d m
an
ag
em
en
t v
ari
ab
les
Pa
rt 
of 
Ar
ray
 A
Us
ed
 st
ac
k
Thread Migration Message
Figure  The migration of a thread consists in sending the thread management variables
 the stack
and the arrays allocated by the thread These arrays are not migrated by the multi thread system
PM but we can add them to the migration message At the destination process
 the migrated thread
has to re allocate its arrays and to set global variables correctly
wee can see how far we are of the time of the regular problem which achieves the lower bound As
our code is derived from the code produced by Adaptor we use Adaptor as the reference
All the experiments were done on the Alpha farm of the LIFL  It is composed of  DEC
 model  AXP whose processors are DECchip  running at  MHz with  Ko of
memory cache and  Mo of main memory The interconnection network is a FDDI crossbar of
optic ber Each link has a bandwidth of  Mbs We use PVM with the PvmDontRoute ag
with gives a startup cost of  ms and a bandwidth of  Mbs This constraint is imposed by
Version  of PM It is removed in Version  which will be available by the end of 
Our test matrices are very regular because the pivot is always in the right column That is to
say that on the kth iteration the pivot is in column k Thus a cyclic distribution always achieves
a perfect load balance whereas a block distribution needs load balancing
First we examine the behavior of our program for various matrix sizes for one processor without
multi threading Next we introduce multi threading to measure the speedup achieved by cache
eects Finally we report on the tests concerning the load balancing version
 One processor one thread
In order to compare between our implementation and the code generated by Adaptor we have
tried to be close to the code produced by Adaptor The dierences are essentially the calls to the
thread library like barriers calls and the use of LRPC Light Remote Procedure Call instead of
sendreceive commands LRPC can be seen as active messages and are useful for implementing
remote get and put operations Our small run time library is less general that Adaptors one and
thus suers less overhead

size of the matrix
   
Adaptor block    
Adaptor cyclic    
PM    
Dierence  ! "  !  !
Table  Times in seconds for several programs and several matrix sizes block and cyclic are
the distribution directive used in the HPF version compiled by Adaptor The PM version uses a
block distribution of the matrix The dierence is computed between Adaptor cyclic and PM
As shown in table  the single thread uniprocessor time of both programs are similar The
dierences do not exceed ! and the average is less than ! Adaptors version is slower because
it has to handle array alignment and distribution for communications whilst we do not
 One processor many threads
We also want to measure the impact of multi threading So we plot the execution times for various
thread numbers running on one processor The results are reported in Figure 
First a super linear speedup appears for some thread numbers and some matrix sizes As
discussed in  it results from cache eects For matrix sizes of  and  we observe a
maximal gain of !
Second the thread overhead is mainly due to the thread synchronization barriers We know
see  that in PM the completion time of one barrier is linear with respect to the number of
threads calling it This is why the curve becomes linear once all the memory accesses are in the
cache The slope of the curve for the linear part is around  s per thread As there are  barriers
per loop we obtain an average completion time of  s per thread and per barrier
  Overhead of the load balancing module
In Figure  we have also plotted the sequential time for the same program but with the load
balancing module As there is only one processor no migration occurs We can see that the
behavior is the same and that the slope of the linear part is higher This time it is around  s
per thread The explanation is that we have introduced two more barriers However the completion
time per thread and per barrier is still around  s
The overhead introduced by the load balancing module is quite small since it is around ! as
shown in Table  The module tries to compute a new distribution by moving a thread to the
processor which sends the last pivot from other processors The choice of the thread to be sent is
not arbitrary One chooses the thread with the highest line number among the threads allocated
to the most loaded process
The goal of such an heuristic is to minimize the number of migrations We have tried to keep it
simple enough to be done automatically The idea behind this heuristic is based on the knowledge
that loop iteration increases from  to N  At each iteration k only the array Aknkn is written
while the whole array is Ann Thus the last line to be processed in a process is the line that
has the maximal rank number By migrating it we hope postpone the date of the next migration

0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
12 4 8 16 32 64
T
im
e 
(s
)
# Threads
PM2 block
PM2 block + LB
Adaptor
a matrix size   
6
6.5
7
7.5
8
8.5
9
9.5
12 4 8 16 32 64
T
im
e 
(s
)
# Threads
PM2 block
PM2 block + LB
Adaptor
b matrix size   
57
58
59
60
61
62
63
64
65
12 4 8 16 32 64
T
im
e 
(s
)
# Threads
PM2 block
PM2 block + LB
Adaptor
c matrix size   
475
480
485
490
495
500
505
510
12 4 8 16 32 64
T
im
e 
(s
)
# Threads
PM2 block
PM2 block + LB
Adaptor
d matrix size   
Figure  Time in second of Gaussian elimination with partial pivoting for dierent matrix sizes
and dierent programs LB stands for the load balancing module The curves are plotted in function
of the number of thread
 Performance of our implementation
As we can see in Figure  the time obtained with the load balanced version is about the time
of the cyclic version for the rst experiment and ! ! worst than the Adaptor program for the
second one For some thread numbers it achieves the same performance
We are able to achieve the performance of the cyclic distribution because communications are
slow compared to the computation power Thus an unbalance distribution but not too much is
masked by the communication
The load balancing modules brings an improvement of ! in time between the best time for
the program without and with the load balancing module In general the time of thread migration
represents less than  seconds for the whole execution of the program

size of the matrix
   
PM block    
PM block  LB    
Dierence ! ! ! !
Table  Time in seconds for the thread implementation without and with the load balancing module
for one thread on one processor
190
200
210
220
230
240
250
48 16 32 64 128 256
T
im
e 
(s
)
# Threads
PM2 block
PM2 block + LB
Adaptor block
Adaptor cyclic
390
400
410
420
430
440
450
460
470
480
490
500
816 32 64 128 256
T
im
e 
(s
)
# Threads
PM2 block
PM2 block + LB
Adaptor block
Adaptor cyclic
Figure  Time in seconds for a 		 matrix on 	 processors left and for a 
matrix on  processors right The time for Adaptor cyclic and block are plotted as well as the
time for the thread program without and with the load balancing module
 Conclusion
We have presented in this paper a method that exploits the HPF denition in order to introduce
a load balancing mechanism in to the run time The basic idea is to use the virtual processors of
HPF as the unit of migration So independently of the alignment and the distribution we can
balance the load of the application by controlling the distribution of virtual processors onto physical
processors We have compared the advantages and the drawbacks of  possible implementations We
have validated the feasibility of this approach with a preliminary implementation which implements
a HPF virtual processor in a thread and thus migrates threads Experimentations were done with
a well know program that is the Gaussian elimination with partial pivoting Measures seems to
conrm that this is an interesting direction
Further work is needed to test this approach on a large range of applications The integration
of HPF and migrating virtual processors has to be developed as well as the load balancing module
However some questions appears For example what is the good number of threads for a given
core of computation # Should the number of thread remains constant or should it vary #

Acknowledgments
I would like to thank Luc Bouge for his guidance and suggestions I would like also to thank
Thomas Brandes for his explanations and helpful comments I thank also J M Geib J F Mehaut
and R Namyst from LIFL for their hospitality and their help for PM Last thanks to the LIFL
which granted me access to their Alpha farm
References
 C Ancourt F Coelho F Irigoin and R Keryell A linear algebra framework for static HPF
code distribution Technical Report A  CRI CRI ENSM Paris November 
 F Andre A Multi Threads Runtime For The Pandore Data Parallel Compiler Techni 
cal Report  IRISA February  Available at URL http wwwirisafrEXTERNE
biblipipihtml
 P Banerjee J A Chandy M Gupta E W Hodges IV J G Holm A Lain D J Palermo
S Ramaswamy and E Su The Paradigm Compiler for Distributed Memory Multicomputers
Computer  October 
 T Brandes Adaptor HPF compilation system developed at GMD SCAI Available at URL
http wwwgmddeSCAIlabadaptoradaptor homehtml
 T Brandes and F Desprez Implementing pipelined computation and communication in an
HPF compiler In L Bouge P Fraigniaud A Mignotte and Y Robert editors Euro Par
Parallel Processing number  in Lecture Notes In Computer Science pages  Lyon
France August  Springer
 J Casas R Konuru S W Otto R Prouty and J Walpole Adaptive load migration systems
for PVM In Proceedings of Supercomputing	 pages  Washington DC November

 B Chapman H Zima and P Mehrotra Extending HPF for Advanced Data Parallel Appli 
cations IEEE Parallel and Distributed Technology  
 High Performance Fortran Forum High Performance Fortran Language Specication Rice
University Houston Texas November  Version 
 High Performance Fortran Forum High Performance Fortran Language Specication Rice
University Houston Texas October  Version 
 I Foster and K M Chandy Fortran M  A Language for Modular Programming
Journal of Parallel and Distributed Computer  to appear Available at URL
http wwwmcsanlgovfortranmpapers
 I Foster C Kesselman and S Tuecke The Nexus Task parallel Runtime System In st Int
Workshop on Parallel Processing 
 I Foster C Kesselman and S Tuecke The Nexus Approach to Integrating Multithreading
and Communication Journal of Parallel and Distributed Computing to appear Available at
URL http wwwmcsanlgovnexus

 M Haines D Cronk and P Mehrotra On the Design of Chant  A Talking Threads Package
Technical Report   ICASE NASA Lamgley Research Center April 
 J Holm A Lain and P Banerjee Compilation of Scientic Programs into Multithreaded and
Message Driven Computation In Proceeding of the 	 Scalable High Performance Computing
Conference pages  Knoxville TN May 
 IMAG Laboratoire de Modelisation et Calcul LMC Projet APACHE Available at URL
http wwwapacheimagfrapacheindexhtml
 A Lain Compiler and Run Time Support for Irregular Computations PhD thesis University
of Illinois at Urbana Champaign 
 Philippe Marquet Ferme dAlpha du LIFL Available at URL http wwwlifrmarquet
fermeadmininstallinstallhtml
 A M$uller and R R$uhl Extending High Performance Fortran for the support of unstructured
computations In Proceeding of the th ACM International Conference on Supercomputing
pages  July 
 R Namyst and J F Mehaut PM parallel multithreaded machine  A multithreaded envi 
ronment on top of PVM In J Dongarra M Gengler B Tourancheau and X Vigouroux
editors EuroPVM pages  Lyon France September  LIP ENS Lyon Herm	es
 California Institute of Technology The CC Programming Language Available at URL
http compbiocaltecheduCCplusplushtml
 C Perez Utilisation des processus legers pour lexecution de programmes 	a parallelisme de
donnees  etude experimentale Technical Report   LIP ENS de Lyon April 
 R Ponnusamy Y S Hwang R Das J Saltz A Choudhary and G Fox Supporting Irreg 
ular Distributions in FORTRAN DHPF Compilers Technical Report UMICAS TR  
University of Maryland  Available at URL http hyenacsumdedupubpapersirreg
supportpsZ
 A Sohn M Sato N Yoo and J L Gaudiot Eects of multithreading on data and work 
load distribution for distributed memory multiprocessors In Proceeding of the th IEEE
International Parallel Processing Symposium Honolulu Hawaii April 
 G Stellner and J Pruyne Resource management and checkpointing for PVM In J Dongarra
M Gengler B Tourancheau and X Vigouroux editors EuroPVM pages  Lyon
France September  LIP ENS Lyon Herm	es
 Thinking Machine Corporation Cambridge MA C  programming guide 
 Rice University The D System  Tools for machine independent data parallel programming
Available at URL http softlibriceedufortrantoolsDSystemDSystemhtml
 R von Hanxleden K Kennedy and J Saltz Value Based Distributions and Alignments in
Fortran D Technical Report CRPC TR S Center for Research on Parallel Computation
Rice University December 

