Dynamic adaptation to CPU and memory load in scientific applications by Mills, Richard Tran
W&M ScholarWorks 
Dissertations, Theses, and Masters Projects Theses, Dissertations, & Master Projects 
2004 
Dynamic adaptation to CPU and memory load in scientific 
applications 
Richard Tran Mills 
College of William & Mary - Arts & Sciences 
Follow this and additional works at: https://scholarworks.wm.edu/etd 
 Part of the Computer Sciences Commons, and the Mathematics Commons 
Recommended Citation 
Mills, Richard Tran, "Dynamic adaptation to CPU and memory load in scientific applications" (2004). 
Dissertations, Theses, and Masters Projects. Paper 1539623457. 
https://dx.doi.org/doi:10.21220/s2-my1g-nn08 
This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M 
ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized 
administrator of W&M ScholarWorks. For more information, please contact scholarworks@wm.edu. 
DYNAMIC ADAPTATION TO CPU AND MEMORY LOAD IN 
SCIENTIFIC APPLICATIONS
A D issertation 
Presented to
The Faculty of the D epartm ent of Com puter Science 
The College of W illiam and M ary in Virginia
In  P artia l Fulfillment 
Of the Requirem ents for the Degree of 
Doctor of Philosophy
by
Richard Tran Mills 
2004
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPROVAL SHEET
This dissertation is subm itted  in partia l fulfillment of 
the requirem ents for the degree of
Doctor of Philosophy
Richard Tran Mills
Approved by the com m itteq/N ovem ber 2004
Dim itrios Nikolopemlos
(  jV irginia Torczon
Robe'rt V o T g t^ ^ '
/  R ichard Wolski
Department of Computer Science 
University of California, Santa Barbara
ii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
To my fam ily— especially my mother and fa ther— and the Commodore-64 personal
computer.
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table o f Contents
A cknow ledgm ents viii
List o f Tables x
List o f F igures xi
A bstract x iv
1 Introduction  2
1.1 C o n tr ib u tio n s .................................................................................................................... 4
1.1.1 Dynamic load balancing of inner-outer iterative so lvers .......................... 5
1.1.2 Dynamic memory adap tation  in scientific a p p lic a t io n s .........................  5
1.2 R e le v a n c e ..........................................................................................................................  6
1.3 O r g a n iz a t io n .................................................................................................................... 7
2 Background and related work 9
2.1 Load b a la n c in g ................................................................................................................  9
2.1.1 Pools of independent tasks ............................................................................ 10
2.1.2 D ata  p a rtitio n in g ................................................................................................  12
2.1.2.1 Geometric p a r t i t i o n in g ..................................................................  13
2.1.2.2 Topological p a r t i t io n in g .................................................................. 14
2.1.2.3 Load balancing adaptive com putations ...................................  16
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1.3 Load balancing under variable external l o a d ............................................... 18
2.2 Addressing memory s h o r t a g e ...................................................................................... 19
2.2.1 Out-of-core a lg o r ith m s ......................................................................................  20
2.2.2 V irtual memory system  m o d if ic a tio n s .......................................................  22
2.2.3 M emory-adaptive a lg o r i th m s ........................................................................  23
2.3 Application-centric adaptive scheduling .................................................................  25
3 Load balancing flexible-phase iterations 28
3.1 Flexible-phase iteration ................................................................................................ 29
3.2 Coarse-grain Jacobi-Davidson im p lem en ta tio n .......................................................  30
3.3 Load balancing J D c g ....................................................................................................... 33
3.3.1 Experim ental e v a lu a tio n ................................................................................... 34
3.4 Load balancing a fully m ultigrain Jacobi-Davidson solver ...............................  36
3.4.1 Benefits of m ultigrain p a ra lle lism .................................................................. 36
3.4.2 Load imbalance in the full m ultigrain c a s e ..................................................  38
3.5 A load balanced Additive-Schwarz preconditioner for F G M R E S ...................... 39
4 Load balancing under m em ory pressure 43
4.1 Load-balanced JDcg under memory p re ssu re ........................................................... 43
4.2 Avoiding thrashing from w ithin the a lg o r ith m .......................................................  44
4.2.1 Choosing the param eters ...............................................................................  48
4.2.2 Experim ental r e s u l t s .........................................................................................  49
4.3 L im itations of the memory-balancing a p p r o a c h ....................................................  52
5 A dynam ic m em ory adaptation  fram ework 55
5.1 A portable framework for memory a d a p t iv i t y .......................................................  55
5.2 Elements of the  im p lem en ta tion ..................................................................................  57
5.3 Algorithms for adapting to  memory a v a ila b il ity ....................................................  58
5.3.1 D etecting memory s h o r ta g e ............................................................................  59
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.3.2 D etecting memory s u rp lu s ...............................................................................  63
5.3.2.1 Problem s w ith overly aggressive p ro b in g ..................................  63
5.3.2.2 Balancing performance penalties w ith a dynam ic probing 
p o l ic y ................................................................................................... 64
5.4 Further details of the adap tation  a lg o r ith m s ........................................................... 68
5.4.1 Estim ating the static  memory s i z e ................................................................. 69
5.4.2....................................Calculating R p e n .................................................................................  70
5.4.2.1 A low frequency probing a p p ro a c h ..............................................  70
5.4.2.2 Probing w ith higher f r e q u e n c y ...................................................  72
6 D esign o f M M lib: A m em ory-m alleability  library 77
6.1 The general framework supported by M M l i b .......................................................  77
6.2 D ata  s t r u c tu r e s .................................................................................................................  78
6.3 Core................... interface and fu n c tio n a lity ................................................................. 79
6.4 O ptim izations and additional fun c tio n a lity ..............................................................  80
6.4.1 Memory access a t sub-panel g r a n u la r i ty ....................................................  80
6.4.2 Eviction of “most-missing” panels ............................................................... 83
7 E xperim ental evaluation o f m em ory-adaptation  86
7.1 Application k e rn e ls ..........................................................................................................  86
7.1.1 Conjugate Gradient (CG)...................................................................................  86
7.1.2 Modified Gram -Schm idt (MGS) .....................................................................  87
7.1.3 M onte-Carlo Ising model sim ulation (ISING) ..........................................  89
7.2 Experim ental environm ents and m e th o d o lo g y .......................................................  91
7.3 Choosing the M M l ib  p a ra m e te rs ...............................................................................  94
7.3.1 Panel s i z e ..............................................................................................................  95
7.3.2 R p e n -m a x .................................................................................................................  97
7.3.3 Choice of probing t y p e ......................................................................................  98
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.3.3.1 Static memory p r e s s u r e ................................................................. 98
7.3.3.2 Highly variable memory p re s s u re ................................................ 100
7.4 Graceful degradation of p e rfo rm a n c e ........................................................................  102
7.4.1 Effects of panel w rite-frequency ....................................................................  103
7.5 Quick response to transient memory p r e s s u r e .......................................................  106
7.6 Adaptive versus adaptive j o b s ...................................................................................... 108
7.7 Performance w ith LRU-friendly access p a t t e r n s .................................................... 110
8 C onclusions and future work 116
8.1 Flexible-phase load b a la n c in g ...................................................................................... 116
8.2 A dynamic m em ory-adaptation fram ew ork ..............................................................  117
A ppendix A 120
Bibliography 125
V ita  133
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGMENTS
This work reflects the influence of a great many people on my life, and I regret th a t it 
is not possible to  name all of them  here; all of them  deserve thanks.
I wish to  thank  all of the members of my dissertation com m ittee for their insightful 
comments and guidance, which have improved this work tremendously. Foremost among 
them  I thank  my advisor, Andreas Stathopoulos, for teaching me a great deal about scientific 
com puting and for taking an active and enthusiastic role in guiding my research. This work 
reflects a great deal of effort on his part. I thank  Dim itris Nikolopoulos for serving as alm ost 
a second advisor during the m em ory-related portions of this work. Robert Voigt deserves 
special thanks not only for his comments on this work, bu t also for valuable guidance in my 
career and professional development. He has always being willing to listen to  my problems 
when I encountered bum ps on my road through graduate school. I thank  Virginia Torczon 
for her helpful comments, and for teaching me some of the subtleties of scientific com puting 
when I was the grader for her “Scientific C om puting” course. Rich Wolski helped m otivate 
th is work w ith his research on Grid middleware, and deserves special thanks for agreeing 
to serve on my com m ittee w ithout knowing anything about me.
Outside of my comm ittee, other people a t W illiam & M ary deserve thanks. Evgenia 
Smirni taught me me the basics of parallel com puting and collaborated on the early stages of 
the load balancing portion of this work. Xiaodong Zhang impressed upon me the im portance 
of the interplay between algorithm s and architecture in his excellent “Advanced Com puter 
A rchitecture” course. Shiwei Zhang of the D epartm ent of Physics exposed me to  many 
areas of com putational science—of which I might otherwise have rem ained ignorant—in his 
excellent course on com putational physics. Tom Crockett kept SciClone up and running 
and was always ready to resolve technical issues w ith skill and enthusiasm; additionally, 
he provided much-needed friendly conversation on the upper floor of Savage House. Jim  
McCombs provided the Jacobi-Davidson code used in this work, collaborated on the load- 
balancing work, and was an amicable office mate.
I spent the summ er of 2003 working at Los Alamos National Laboratory  w ith Peter 
Lichtner. A lthough the work th a t I did there was unrelated to my dissertation research, 
the experience was im portan t because it rekindled my enthusiasm  for research after several 
years in graduate school had caused it to  wane. I thank  Peter for serving as a m entor 
w ithout parallel during an incredibly intellectually stim ulating summer.
I thank  Uncle M aury for stim ulating my interest in physics in general and com putational 
physics in particular; th a t led me down the road to becoming a com putational scientist.
I thank  my wife, Saffron, for her companionship, love, and support though some difficult 
times as we suffered through our graduate studies together. The years th a t we have spent 
in graduate school have been very trying, bu t thanks to her constant com panionship they 
have also been the best ones of my life.
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Lastly, bu t most im portantly, my parents deserve special thanks for all th a t they have 
done. Besides being wonderful, loving parents, they impressed the im portance of knowledge 
and learning upon me at a young age, and their gentle encouragement is responsible for 
all of my academic (and other) success. I thank  my m other especially for the countless 
hours she spent teaching me reading, arithm etic, and so on when I was a young child; I 
am  very fortunate to have had such a patient teacher of these fundam entals. My father 
deserves credit for teaching me to  th ink  rationally  and scientifically, and to view the world 
w ith curiosity. It is his example th a t inspired me to pursue a career in the sciences. He has 
served as my first and closest mentor, answering my endless questions when I was four years 
old, reading great books to  me and my sister, encouraging my am bitious science projects 
in prim ary and secondary school, helping me select courses to  take as an undergraduate, 
collaborating w ith me on my senior thesis work, and helping me through some tough tim es 
in graduate school. His influence on my thinking has been inestimable.
This work was supported in part by a Department o f Energy Computational Science 
Graduate Fellowship, provided under grant number DE-FG02-97ER25308 and administered 
by the wonderful folks at Krell Institute. Some o f the work was performed using the Sci- 
Clone computational cluster at the College o f W illiam & Mary, which was enabled by grants 
from  Sun Microsystems, the National Science Foundation, and Virginia’s Commonwealth 
Technology Research Fund.
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
3.1 Performance of JDcg w ith and w ithout load balancing on a system  subject
to  various static  external l o a d s .................................................................................. 36
3.2 Performance of JDcg w ith and w ithout load balancing w ith come-and-go
dum m y jobs on each n o d e ............................................................................................  37
3.3 Performance of the m ultigrain JD  running on different node configurations,
w ith and w ithout load b a la n c in g ..............................................................................  40
3.4 Performance of FGM RES when preconditioned with ASM with and w ithout
load balancing ................................................................................................................  42
4.1 Performance of JDcg w ith and w ithout load balancing w ith some nodes ex­
ternally  loaded by static, memory intensive dum my j o b s ..................................  45
4.2 Performance of JDcg using the anti-thrashing scheme w ith a come-and-go
memory intensive dum m y job on one n o d e .............................................................  50
4.3 Performance of JDcg w ith the “ideal” anti-thrashing sch em e........................... 51
4.4 Improvements in execution tim e of the dum m y job because of increased pro­
cessor th roughput due to our anti-thrashing scheme .........................................  52
7.1 Com parison of the two probing types under highly variable memory pressure 103
x
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
3.1 The d a ta  parallel Jacobi-Davidson algorithm  ......................................................  32
3.2 SciClone: The W illiam and M ary heterogeneous cluster of workstations . . 39
5.1 D etecting and responding to  memory shortage w ith the naive (and incorrect)
“solution” ..........................................................................................................................  60
5.2 The algorithm  for detecting and responding to  memory s h o r t a g e .................. 61
5.3 An example of detecting and responding to  memory shortage using the algo­
rithm  in Figure 5 . 2 .........................................................................................................  62
5.4 G raphs depicting the necessity of a dynam ic delay param eter in the algorithm
for detecting memory a v a ila b il ity ..............................................................................  65
5.5 An example of how peakRSS is d e te rm in ed ..............................................................  67
5.6 The combined algorithm  for detecting and adapting to  memory shortage or
s u rp lu s .................................................................................................................................  68
5.7 Overhead as a function of size of the region upon which m in co reO  is called 71
5.8 An example of how troughRSS is d e te r m in e d .......................................................  73
5.9 The final, complete algorithm  for detecting and adapting to memory shortage
or surplus, using low frequency probing ................................................................. 75
5.10 The final, complete algorithm  for detecting and adapting to  memory shortage
or surplus, using high frequency p r o b in g ................................................................. 76
6.1 Extended framework modeling the memory access needs of a wide variety of
scientific ap p lica tio n s ......................................................................................................  78
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.2 Benefits of using smaller memory transfer u n i t s ..................................................  82
6.3 Benefits of evicting partially  swapped out p a n e l s ...............................................  85
7.1 M atrix-vector m ultiplication algorithm  for a sparse m atrix  of dimension N
consisting of a num ber of d ia g o n a ls ...........................................................................  88
7.2 The algorithm  executed by our MGS test code, which sim ulates the behavior 
of a GM RES-type solver, generating random  vectors of dimension N which
are added to  an orthonorm al basis via modified G ra m -S c h m id t..................... 89
7.3 The algorithm  for executing a M etropolis sweep through the LxL spin lattice
of the Ising model .........................................................................................................  91
7.4 Two different modes in the behavior of the v irtual memory system  observed
under the same experim ental s e tt in g s .......................................................................  93
7.5 Relative frequencies of iteration times for m em ory-adaptive CG running on a 
Linux 2.4 machine with 128 MB RAM against a memlock job consuming 70
M B ........................................................................................................................................  94
7.6 Effect of panel size on performance of MMLlB-enabled C G ..............................  96
7.7 Effect of panel size on performance of MMLlB-enabled I S IN G .......................  97
7.8 Effects of R pen.m ax  on M M lib  p e rfo rm a n c e .........................................................  99
7.9 Com parison of performance using the two m ethods of calculating R pen . . . 101
7.10 RSS versus tim e for the two types of probing in memory adaptive CG under 
highly variable memory lo a d ........................................................................................  104
7.11 Performance of the application kernels under constant memory pressure . . 105
7.12 Effects of write frequency on M M l ib  performance in IS IN G ...........................  107
7.13 Quick response by M M l ib  to  transient memory p re ssu re .................................. 108
7.14 Profiles of RSS vs. tim e for two instances of memory adaptive ISING jobs 
running simultaneously, using low frequency p r o b in g ......................................... I l l
7.15 Profiles of RSS vs. tim e for two instances of memory adaptive ISING jobs 
running simultaneously, using high frequency p ro b in g .........................................  112
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.16 Performance of in-core and memory adaptive versions of CG-LRU, which uses
an LRU-friendly access p a t t e r n .................................................................................. 114
7.17 Performance of m em ory-adaptive CG versus memory pressure when using
the wrong panel replacem ent p o lic y ...........................................................................  115
A .l Pseudocode depicting how the L B lib  library can be used to  load balance the
JDcg s o l v e r .......................................................................................................................  124
xm
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
As comm odity com puters and networking technologies have become faster and more 
affordable, fairly capable machines have become nearly ubiquitous while the effective “dis­
tance” between them  has decreased as network connectivity and capacity has multiplied. 
There is considerable interest in developing means to  readily access such vast am ounts of 
com puting power to  solve scientific problems, bu t the complexity of these m odern com­
puting environm ents pose problems for conventional com puter codes designed to  run  on a 
static, homogeneous set of resources. One source of problems is the heterogeneity th a t is 
naturally  present in these settings. More problem atic is the com petition th a t arises be­
tween program s for shared resources in these semi-autonomous environm ents. F luctuations 
in the availability of CPU, memory, and other resources can cripple application perfor­
mance. Contention for CPU tim e between jobs may introduce significant load imbalance in 
parallel applications. Contention for lim ited memory resources may cause even more severe 
performance problems: If the requirem ents of the processes running on a com pute node 
exceed the am ount of memory present, thrashing may increase execution tim es by an order 
of m agnitude or more.
Our goal is to  develop techniques th a t enable scientific applications to achieve good 
performance in non-dedicated environm ents by m onitoring system  conditions and adapting 
their behavior accordingly. In  this work, we focus on two im portan t shared resources, CPU 
and memory, and pursue our goal on two distinct bu t complem entary fronts: F irst, we 
present some simple algorithm ic modifications th a t can significantly improve load balance 
in a class of iterative m ethods th a t form the com putational core of many scientific and 
engineering applications. Second, we introduce a portable framework for enabling scientific 
applications to dynam ically adapt their memory usage according to  current availability of 
m ain memory. An application-specific caching policy is used to  keep as much of the d a ta  set 
as possible in m ain memory, while the rem ainder of the da ta  are accessed in an out-of-core 
fashion. This allows a graceful degradation of performance as memory becomes scarce.
We have developed m odular code libraries to facilitate im plem entation of our techniques, 
and have deployed them  in a variety of scientific application kernels. Experim ental eval­
uation of their performance indicates th a t our techniques provide some im portan t classes 
of scientific applications w ith robust and low-overhead means for m itigating the effects of 
fluctuations in CPU  and memory availability.
xiv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DYNAMIC ADAPTATION TO CPU AND MEMORY LOAD IN 
SCIENTIFIC APPLICATIONS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
The advent of powerful bu t inexpensive workstations and fast, affordable networking tech­
nologies has fundam entally altered the landscape of parallel and d istribu ted  computing. 
Powerful comm odity components have led to  the emergence of clusters of workstations 
(COWs) as the prim ary means of parallel com puting at many institutions. On a much 
grander scale, researchers have been taking steps towards geographically d istribu ted  com­
puting, building com putational “G rids” th a t link com puting resources between different 
institutions. Grid environm ents promise to  provide the com putational resources necessary 
for analyzing petabyte-scale d a ta  sets and conducting the most com putationally dem anding 
of simulations. The im pact of these technological developments has been particularly  felt 
in scientific applications, which form a great portion of the high-performance com puting 
workload.
The category of “scientific applications” is large and evolving, and therefore difficult 
to  characterize accurately. We make, however, the following broad observations about 
scientific applications in parallel and high-performance com puting (HPC) environments: 
F irst, the  applications often work w ith large quantities of data, because realistic sim ulation 
of a problem  often requires a large num ber of degrees of freedom. Second, they tend to follow 
fairly regular, repetitive da ta  access and comm unication patterns. Third , they usually have 
synchronization points at which collective communications occur.
Many m odern parallel com puting environm ents possess a degree of complexity th a t poses 
problems for conventional scientific applications designed to run  on dedicated, uniform  re-
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 1. INTRO DU CTION 3
sources such as massively-parallel processor (M PP) machines. One source of problems is 
the heterogeneity th a t arises naturally  in these new settings. For instance, upgrades to a 
COW  may occur in stages, w ith only specific machines being replaced or added. In  other 
cases, the cluster may simply be a loosely coupled network of personal workstations. In a 
com putational Grid, d isparate resources from different institu tions may be used in conjunc­
tion. Traditional parallel program s a ttem p t to d istribu te  work evenly between all nodes 
because they target environm ents in which compute nodes are homogeneous in processing 
power and memory capacity. W hen nodes differ considerably in processing speed, such a 
strategy can result in severe load imbalance as faster processors idle a t synchronization 
points, waiting for their slower brethren  to  finish. If nodes differ considerably in memory 
capacity, nodes with higher memory capacities may be underutilized.
Beyond architectural heterogeneity, com petition for non-dedicated, shared resources 
poses even greater challenges. Various economic and practical considerations have made 
m ultiprogram m ing of resources commonplace: In decentralized systems, such as Grid en­
vironm ents or loose networks of personal workstations, space sharing simply may not be a 
viable option. Owners of smaller COWs may not be willing to sacrifice the tim e necessary 
to  adm inister a batch queuing system, or to  tolerate the slower tu rnaround  tim es resu lt­
ing from waiting in the queue, especially during the development and debugging cycles. In 
shared memory machines, even when CPUs are space-shared, jobs still compete for memory 
and I /O  resources.
Fluctuations in the availability of CPU, memory, and other resources can cripple ap­
plication performance. Contention for CPU tim e between jobs may introduce significant 
load imbalance in parallel applications. Contention for lim ited memory resources may cause 
even more severe performance problems: If the requirem ents of the processes running on 
a compute node exceed the am ount of memory present, thrashing may increase execution 
times by an order of m agnitude or more.
It is our thesis th a t the dynamic and unpredictable nature of resource contention in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 1. INTRO D U CTIO N 4
m odern parallel environments necessitates a degree of system  awareness and adaptiv ity  th a t 
is not present in today’s software. Our goal is to develop techniques th a t allow scientific 
applications to achieve good performance in non-dedicated environm ents by m onitoring 
system  conditions and adapting their behavior accordingly. In this work, we focus on two 
im portant shared resources, CPU and memory, and pursue our goal on two distinct bu t 
com plem entary fronts:
1. Application-level dynamic load balancing. We present some simple algorithm ic m odi­
fications which can significantly improve load balance in a class of iterative m ethods. 
We focus on these m ethods because they form the com putational core of many scien­
tific and engineering applications; improvements to  the m ethods can lead to  im m ediate 
performance gains in a variety of scientific codes.
2. M emory adaptation in scientific and data-intensive codes. We introduce a portable 
framework for enabling applications to  dynam ically adapt their memory usage accord­
ing to current availability of m ain memory. An application-specific caching policy is 
used to  keep as much of the d a ta  set as possible in main memory, while the rem ainder 
of the da ta  are accessed in an out-of-core fashion. This allows a graceful degradation 
of performance as memory becomes scarce.
The com bination of system-aware mechanisms for dynamic load balancing and memory 
adap tation  has the potential to change the way com putational science is done by enabling 
efficient use of shared resources th a t have h itherto  been underutilized. We provide some spe­
cific tools and techniques for im plem enting these mechanisms in some specific cases, bu t our 
broader goal is to  establish an archetype for building system-aware, adaptive applications.
1.1 C on trib u tion s
The contributions of this work can be outlined as follows:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 1. INTRO D U CTION 5
1.1 .1  D y n a m ic  load  b a la n c in g  o f  in n er -o u ter  ite r a tiv e  so lvers
We have studied a novel strategy for dynamic load balancing of a class of inner-outer itera­
tive m ethods. Our goal has been to  enable parallel, iterative, sparse m atrix  m ethods—which 
form the core of many scientific applications—to perform  well in heterogeneous and mul­
tiprogram m ed environments. Our techniques are designed for inner-outer iterative solvers 
th a t employ local solvers for the inner iterations, though they are also applicable to  other 
algorithm s th a t possess what we term  a flexible phase (defined in C hapter 3). We have 
w ritten a C-library to facilitate use of our load balancing technique, and have used it to 
implement our scheme in some different solver codes.
We have experim entally dem onstrated th a t our scheme can cope w ith severe load imbal­
ance under unpredictable external loads in a coarse-grained, block Jacobi-Davidson eigen- 
solver th a t independently calculates a different correction vector on each processor. This 
work has been published in [63]. We have also used our load-balancing scheme to enable 
effective use of a collection of heterogeneous sub-clusters of workstations in a version of 
the Jacobi-Davidson solver th a t fully employs the so-called multigrain parallelism; see our 
paper [57]. Finally, we have dem onstrated th a t our m ethod can be used in an additive- 
Schwarz preconditioned linear system solver to sm ooth load imbalances due to  differences 
in conditioning of the subdom ains, and also to sm ooth out small load imbalances due to 
differences in processor speeds.
1 .1 .2  D y n a m ic  m em o ry  a d a p ta tio n  in  sc ien tific  a p p lica tio n s
We have explored a general framework th a t enables scientific and other data-intensive codes 
th a t repetitively access large d a ta  sets to dynam ically adapt their memory requirem ents as 
memory pressure arises. We have developed a m odular supporting library, M M l i b , th a t 
makes it easy to embed m em ory-adaptivity in many scientific kernels w ith little coding 
effort. The biggest technical hurdle we had to overcome to realize our framework is the 
lack of inform ation about memory availability provided by operating systems. We have
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 1. INTRO DU CTION 6
developed an algorithm  th a t can effectively judge memory availability w ith m inim al reliance 
on operating system-provided information.
We have used our library to  inject m em ory-adaptivity into three scientific application 
kernels: a conjugate-gradient linear system  solver, a modified G ram -Schm idt orthogonal- 
ization routine (surrounded by a GMRES-like w rapper), and a M onte-Carlo Ising model 
simulation. Though simple, these kernels are representative elements of scientific sim ula­
tions spanning a diverse range of fields. Experim ental evaluation of these m em ory-adaptive 
kernels in Linux and Solaris environm ents has shown th a t our techniques yield performance 
several tim es be tte r th an  th a t obtained by relying on the v irtual memory system  to handle 
memory pressure. Furtherm ore, we have dem onstrated th a t these performance gains are 
not only due to  enabling use of application-specific replacem ent policies, bu t from avoidance 
of v irtual memory system  overheads and antagonism. We have also shown th a t multiple 
jobs employing our m em ory-adaptation techniques can coexist on a node w ithout ill-effects. 
O ur work is presented in [61] and [62].
1.2 R elevance
This work is relevant to  both  com puter scientists and practitioners of com putational science. 
From a com puter science perspective, it is significant because it explores adap ta tion  to 
fluctuating system  conditions completely a t the application-level; traditionally, adap tation  
approaches have been more system-oriented. An application-centric approach allows us to 
exploit knowledge about the application th a t system  software does not possess.
In our work on load balancing parallel com putations under unpredictable external loads, 
we utilize knowledge of the numerics of an im portan t class of iterative m ethods to dynam i­
cally modify the com putation—w ithout affecting the end result—to achieve good load bal­
ance w ith very little overhead. This departs from conventional load balancing approaches, 
which adaptively schedule a com putation but do not modify it.
In our work on dynamic adap tation  to high levels of memory pressure, we use knowledge
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 1. INTRO DU CTION 7
of the memory-access characteristics of an application to  manage memory using an appropri­
ate  replacem ent policy and to  use units of d a ta  transfer suited to  the granularity  of memory 
access. O ur memory adap tation  work is additionally significant because it has explored the 
interplay of system  and application in m ultiprogram m ed systems under extrem e levels of 
memory demand. System  and application behavior under such conditions can be extremely 
unpredictable and has been relatively unexplored; application-specific v irtual memory m an­
agement schemes, for instance, have usually been designed w ithout m ultiprogram m ing in 
mind.
Prom the perspective of com putational science practitioners, th is work is relevant be­
cause it has the potential to  increase their productivity. M any researchers do not have 
access to  dedicated supercom puters or high-end clusters and m ust meet their com puting 
needs by using either local shared resources such as networks of workstations and small 
SMPs, or resources shared across a com putational grid. The m ethods we present can en­
able some im portan t classes of scientific applications to  effectively use such resources. These 
techniques can also improve the productivity  of researchers who do have access to  dedicated 
supercom puters by allowing them  to run  small to  moderate-sized jobs on readily available 
non-dedicated resources, allowing them  to conserve their supercom puter allocations and 
avoid long wait tim es in compute-center queues.
1.3 O rganization
The rem ainder of this document is organized as follows: C hapter 2 relates our work to the 
current sta te  of knowledge in bo th  load balancing and mechanisms for dealing w ith memory 
shortage. C hapter 3 explains our load balancing scheme and presents experim ental results 
dem onstrating its effectiveness in balancing CPU load. In chapter 4, we illustrate  how 
the load balancing scheme can be extended to  a ttenuate  memory pressure, and we outline 
the shortcomings of this approach. In chapter 5, we present the essential details of our 
memory adap tation  framework, and in chapter 6 we provide an overview of our software
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 1. INTRO DU CTION 8
library, M M l ib , th a t facilitates use of the framework. In chapter 7 we present experim ental 
evaluations of the memory adap tation  strategy as used in three scientific application kernels. 
Finally, conclusions are presented in chapter 8.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Background and related work
The goal of our work is to allow some classes of scientific applications to perform  well in mul­
tiprogram m ed environm ents where CPU  and memory availability can vary unpredictably. 
To meet this goal our techniques m ust m aintain good load balance as external CPU loads 
fluctuate, and adjust memory utilization gracefully as memory availability changes. In this 
chapter, we present a brief overview of related work in load balancing and mechanisms 
for addressing memory shortage. Additionally, we describe related work on application- 
centric adaptive scheduling techniques th a t dynam ically adapt schedules according to  re­
source availability.
2.1 Load balancing
M aintaining good load balance is essential to ensure good performance and efficient u ti­
lization of processors by a parallel application. Otherwise, CPU  cycles will be wasted as 
lightly loaded processors sit idle waiting for their more heavily loaded brethren  to  finish. 
A significant am ount of research has focused on the development of m ethods and software 
for load balancing w ithin parallel applications, and we outline some of those approaches in 
th is section.
Load balancing is essentially a task-m apping problem, and is thus inextricably linked 
w ith the m anner in which a com putation is decomposed into tasks for parallel execution. 
The problem  is to  m ap a num ber of tasks onto a set of processors in a way th a t ensures th a t
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 10
each processor is assigned an am ount of work com m ensurate w ith its speed. This m apping 
usually m ust satisfy other objectives as well, such as lim iting the am ount of comm unication 
required. There are many different strategies for accomplishing these goals, bu t we can 
place them  into two broad categories based on whether they take a task-oriented or a data- 
oriented approach to m apping the tasks onto processors. The type of approach th a t is 
appropriate is determ ined by the am ount of d a ta  required to describe and perform  each 
task, and the cost of moving th a t d a ta  to the appropriate processor. (So the choice between 
task  or d a ta  orientation depends on bo th  the properties of the problem  to be solved and 
the properties of the target parallel machine.) If processors can inexpensively obtain  the 
d a ta  required to  perform  a task, a task-oriented approach can be used. Otherwise, a  data- 
oriented approach m ust be employed because the cost of moving the d a ta  for a task is the 
m ain lim iting factor.
2 .1 .1  P o o ls  o f  in d ep en d en t ta sk s
The task-oriented approach is often known as the pool o f tasks paradigm . The problem  
to be solved is broken into a pool of independent tasks. Processors th a t need work fetch 
tasks from the pool, and if additional tasks are generated during the com putation, they 
are placed into the pool. If the size of the tasks is small enough, the pool of tasks ap­
proach usually results in excellent load balance. (Tasks th a t are too large may result in 
one processor continuing to  work while others sit idle because all other tasks have been 
processed.) In  the simplest im plem entation (m aster/slave or m anager/w orker), processors 
fetch or receive tasks from a single processor th a t acts as centralized m anager of the pool. 
In addition to being very easy to  implement, the m aster/slave approach m aintains global 
knowledge of the com putation state, which facilitates even distribu tion  of tasks and greatly 
simplifies term ination detection. The m aster/slave approach cannot scale to large numbers 
of processors, however, because comm unication w ith the m aster becomes a bottleneck.
To improve scalability, a hierarchical approach can be employed, in which workers are
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND  RELATED  W O RK 11
divided into disjoint sets managed by a local sub-m aster th a t reports to  the central m aster. 
If greater scalability is needed, a completely decentralized scheme can be used, m aking the 
pool of tasks a d istribu ted  d a ta  structure. In  a decentralized scheme, when a processor 
runs out of work it sends a potential donor processor a work request. The donor processor 
to  be polled may be chosen in a num ber of ways: The simplest way is to choose a donor 
randomly, w ith all processors having equal probability of being selected. Alternatively, 
donors can be chosen in a round-robin fashion. The round-robin scheme can employ a 
global list of processors, or might be restricted to  the nearest neighbors of each processor 
in the network interconnect. Note th a t if a round-robin scheme is completely decentralized, 
a donor may receive work requests from m ultiple processors simultaneously. To prevent 
this, a hybrid centralized /d istribu ted  system  may be employed. In  such a scheme, the task 
pool is a d istribu ted  d a ta  structure, bu t one processor m aintains a global variable th a t 
points to  the processor to  which the next work request should be sent. This improves the 
d istribu tion  of work requests, though contention for access to the global variable somewhat 
limits scalability. Explanations and analysis of several decentralized pool of tasks schemes 
and appropriate term ination-detection algorithm s can be found in [70, 52, 51].
If a com putation is decomposed into a large num ber of small tasks, fetching only one 
piece of work a t a tim e can result in excessive scheduling overhead. In such situations, 
an allocation m ethod in which a free processor fetches a chunk of several tasks a t a tim e 
is usually employed. This reduces scheduling overhead, but care m ust be taken not to 
make the chunks too large, because the coarser task  granularity may result in increased 
load imbalance. Kruskal and Weiss [50] used probabilistic analysis to  determ ine an optim al 
chunk size th a t balances the trade-offs between very large and very small chunk sizes. 
Their analysis relies on the assum ption of a num ber of tasks large enough to  cancel out 
variations in the task execution times. Because in practice this assum ption often does not 
hold, other developments in chunk-scheduling employ schemes th a t s ta rt w ith larger chunk 
sizes and then  gradually decrease the chunk sizes as the com putation progresses to  prevent
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 12
processors from finishing at different times. Such approaches include guided self-scheduling 
[68], factoring [39], and trapezoidal self-scheduling [85].
The pool of tasks approach can m aintain excellent load balance, even in the face of 
factors such as large variation in the cost of each task, processor heterogeneity, or dynamic 
variation in external processor load on a tim e-shared system. Unfortunately, the approach is 
inapplicable to  many scientific problems, since they cannot be broken into independent tasks 
th a t can be executed asynchronously. Additionally, the independent tasks often cannot 
be described with an am ount of inform ation th a t is small enough to prevent excessive 
comm unication between m aster and slave processes.
2 .1 .2  D a ta  p a r tit io n in g
If the tasks in a com putation cannot be described w ith a small am ount of inform ation, or 
if dependencies between tasks necessitate frequent inter-processor comm unication, a d a ta  
partitioning approach to load balancing should be employed: R ather th an  thinking in term s 
of m apping tasks to  processors, we m ap the da ta  objects in the com putation (e.g. mesh 
points or particles) onto processors. Processors then  execute those tasks associated w ith 
their local data. To load balance a com putation, we partition  the objects into approxim ately 
equal partitions (ensuring load balance) while m inimizing the d a ta  dependencies between 
partitions (thus reducing comm unication costs). If processors vary in speed, the partition  
associated w ith a given processor is weighted by its relative speed. W hat we have called d a ta  
partitioning is usually referred to as dom ain partitioning, because it is usually employed 
to  assign portions of a  com putational dom ain to  processors. Some of the techniques we 
describe here can be applied to  any abstract object th a t can be modeled as a graph, so we 
use the more general term  “da ta  partitioning” .
The partitioning approaches can be grouped into two broad categories: geometric and 
topological. A geometric approach divides a dom ain based on the locations of objects in a 
sim ulation, and is therefore appropriate for applications such as solid mechanics or particle
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND  RELATED  W O RK 13
simulations, where interactions between com putational objects tend to  occur between neigh­
bors. A topological m ethod explicitly bases its partitioning on the connectivity of objects 
in a com putation. Typically a com putational dom ain is represented as a graph which is to 
be equipartitioned in a way th a t minimizes the edge-cut, which approxim ates the am ount 
of comm unication required.
2.1.2.1 G eom etric partitioning
The geometric approaches can be classified into two m ajor types: recursive bisection ap­
proaches and octree approaches. These categories could also be described respectively as 
approaches th a t work w ith coarse-grained or fine-grained divisions of the geometry. The 
basic idea of recursive bisection approaches is to  cut a dom ain into two parts  such th a t each 
part contains an approxim ately equal num ber of objects. Each part is then  further cut into 
two smaller parts, and so on, until the required num ber of partitions have been generated. 
The sim plest such m ethod is Recursive Coordinate Bisection (RCB) [10], also known as 
C oordinate Nested Dissection (CND). RCB recursively divides the set of objects using a 
cutting  plane orthogonal to  the coordinate axis along which the objects are most spread 
out. The cut is orthogonal to the long axis to  reduce the size of the boundaries between 
subdom ains, across which comm unication will have to  occur. RCB is simple, fast, easily 
parallelizable, and requires little memory, bu t also tends to produce lower-quality partition ­
ings. A m odest improvement is Unbalanced Recursive Bisection (URB) [42], which divides 
the geometry, ra ther than  the objects, in half. This minimizes the geometric aspect ratio  of 
the partition  and thus reduces comm unication volume, as the am ount of inform ation th a t 
m ust be exchanged across boundaries is approxim ately proportional to their length or area. 
T he effectiveness of bo th  RCB and URB are lim ited by the somewhat artificial restriction 
th a t cuts are always made along coordinate axes. Recursive Inertial Bisection (RIB) [83] 
eliminates this restriction by calculating the principal axes of inertia  of the collection of 
objects as if they were a collection of particles in a physical system. The dom ain is then  cut
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 14
by a bisecting plane orthogonal to  the principal axis associated w ith the m inim um  moment 
of inertia, which is a  naturally  “long” direction across which to  cut. Geometric bisection 
m ethods need not be lim ited to the use of cu tting  planes. Based on some interesting theo­
retical results [34], G ilbert, Miller, and Teng [32] describe a geometric bisection scheme in 
which circles or spheres are used to bisect domains.
Octree techniques (also known as space-filling curve techniques) for geometric partition­
ing s ta rt w ith a fine-grained division of the com putational domain, and then  aggregate these 
fine-grained pieces into sets th a t will make up the partitions. The fine-grained division is 
obtained by recursively dividing a space into a hierarchy of octants. For a  3D dom ain, the 
root octant is a cube th a t encloses the entire domain. This octan t is divided into child 
octants by sim ultaneously cu tting  the root octan t in half along each axis (forming eight 
child octants in 3D space). Child octants th a t contain multiple objects are then  recursively 
divided into sub-octants, and so on, until each object is enclosed by a  term inal octan t (i.e., 
one th a t is not further subdivided). The relationship of octants in the hierarchy is described 
by a tree d a ta  s tructu re  known as an octree. Octrees are used to represent space in mesh 
generation, com puter graphics, and n-body sim ulations, among other applications. They 
also provide a convenient way to  partition  geometry among processors: By ordering the 
term inal octants according to their positions along a space-filling curve such as a Peano- 
H ilbert curve [55], a A;-way partitioning can easily be generated by cu tting  the list into k 
equal parts [88]. The speed and quality of octree partitioning is roughly equal to  th a t of 
geometric recursive bisection techniques.
2.1.2.2 Topological partitioning
Geometric partitionings can be calculated very quickly and can be quite effective for com­
putations in which interaction between objects is a function of geometric proximity, such as 
in n-body or solid mechanics problems. If geometric proxim ity is a poor indicator of inter­
action between objects, however, using a more expensive topological m ethod th a t explicitly
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 15
considers the connectivity of the objects can yield far superior partitionings.
One of the  sim plest topological m ethods is Levelized Nested Dissection (LND) [31]. LND 
starts  w ith an initial vertex (a pseudo-peripheral vertex is the best choice), and then  visits 
neighboring vertices in a breadth-first manner. W hen half of the vertices have been visited, 
the graph is bisected into the set of vertices th a t have been visited, and those th a t have 
not. This process is applied recursively until the desired num ber of partitions have been 
created. The idea behind LND is actually quite sim ilar to geometric bisection m ethods: 
distance from the initial vertex is used to cut across a “long” dimension of the graph.
Spectral m ethods [69, 78] use a very different approach for com puting partitionings. 
The partitioning problem  is really a discrete optim ization problem: minimize the edge-cut 
while dividing the graph into k  equal parts. This discrete optim ization problem  is NP- 
complete, bu t the problem  can be approxim ated by a continuous optim ization problem  
th a t is more easily solved. Spectral m ethods solve the continuous problem  by finding some 
extrem al eigenvalue/eigenvector pairs of a m atrix  derived from the connectivity inform ation 
of the graph. Spectral m ethods tend to produce high quality partitions, bu t solving the 
eigenproblem can be com putationally expensive.
Partition  refinement m ethods refine a sub-optim al partitioning of a  graph, and are of­
ten  used as graph partitioners themselves by applying them  to  a random  partitioning. 
K ernighan and Lin [48] developed an iterative algorithm  (known as the KL algorithm ) th a t 
improves a graph bisection using a greedy approach to  determ ine pairs of vertices to  swap 
between partitions. Variations of the KL algorithm  are used for refinement in several graph 
partitioning contexts, some of which we discuss below.
Some of the most popular graph partitioning m ethods being used today are multilevel 
m ethods [37, 45]. Multilevel m ethods form a sequence of increasingly coarse approxim ations 
of the original graph by collapsing together selected vertices from finer representations of the 
graph. The graph is initially partitioned at the coarsest level using a single-level algorithm  
such as those described above. This partitioning is then  propagated back through the finer
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 16
approxim ations, being refined at each level via a KL-type algorithm . Multilevel approaches 
are very effective for two reasons. F irst, the initial partitioning can be easily com puted for 
the coarsest graph, because it hides many edges and vertices of the original graph. Second, 
increm ental refinement algorithm s can quickly make large changes to  the partitioning by 
working w ith coarser versions of the graph. Chaco [36], M E T IS /ParM E T IS  [43, 44], and 
JO STLE [87] are some popular software packages th a t implement multilevel methods.
2.1.2.3 Load balancing adaptive com putations
If external load is not a factor, and if the am ount of work per object is known and does 
not vary during the course of the com putation, a sta tic  partitioning can be used. The 
partitioning can be com puted as a sequential preprocessing step a t the beginning of a 
sim ulation, using a package such as M ETIS or Chaco, or in parallel using a package such as 
ParM ETIS or JO STLE. In many com putations, however, the am ount of work assigned to 
each processor can vary unpredictably, either because the  num ber of objects in a  partition  
can vary, or because the am ount of work per object is unknown. For example, if the objects 
are grid points in an adaptive mesh refinement (AMR) calculation [12, 11], the am ount of 
work for a partition  may grow as the mesh is refined. If objects are cells in a particle-in-cell 
sim ulation, the am ount of work associated w ith an object varies as particles advect into 
and out of the cells. Some sim ulations utilize adaptive physics, in which the physics model 
used at a mesh point—and thus the work associated w ith th a t point— may change as the 
calculation progresses. In all of these cases, the com putational dom ain m ust be repartitioned 
as the com putation progresses, because a partitioning th a t results in good load balance at 
the beginning of the com putation may yield very poor load balance later.
Dynamic repartitioning can be done by simply com puting a new global partitioning from 
scratch. However, this generally incurs an unacceptable am ount of costly d a ta  m igration, 
because the new partitioning can be very different from the existing one. Instead, the 
old partitioning should be adjusted in a way th a t m itigates load imbalance, while also
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 17
minimizing the am ount of da ta  m igration needed to  arrive at the new partitioning. Scratch- 
rem ap repartitioners [66] a ttem p t to  do this by calculating a new partitioning of the graph 
from scratch, and then  perm uting the labels of the new partitions so their overlap w ith the 
old partitions is maximized.
Scratch-rem ap approaches result in excellent load balance and low edge-cut, bu t still 
involve a  relatively high am ount of d a ta  movement. For this reason, increm ental approaches 
th a t make a series of small changes to  the partitions to  restore load balance are usually 
favored instead. Most of these m ethods employ a diffusive approach, in which objects move 
from heavily loaded processors to more lightly loaded neighbors by a diffusion-like process. 
The num ber of objects to be moved from a processor is based on the processor workload, 
while the choice of which objects get moved is usually made using KL-like criteria. In 
the simplest incarnations [21], a first-order finite difference discretization of the diffusion 
equation is used to  calculate the am ount of work th a t should be moved between processors 
a t each iteration  of the m ethod. Because the first-order stencil is compact (only involving 
nearest neighbors), such m ethods can be executed locally by a processor. These m ethods 
can be slow to converge to  a load balanced state , however, so more complex m ethods use 
some global inform ation to  accelerate convergence to  load balance. The m ethod of Hu, 
Blake, and Em erson [38] (implem ented in ParM ETIS and JO STLE) uses inform ation from 
all processors to perform  an implicit solve for the steady-state solution of the diffusion 
equation, and then m igrates da ta  in one step to reach this state  of load balance.
A lthough local m ethods are usually slower th an  global m ethods to achieve load balance, 
they can be executed asynchronously, whereas global m ethods cannot. This is a significant 
advantage for two reasons. F irst, load balance can be corrected as it arises, w ithout having 
to wait for the next synchronization. Second, although most scientific applications have 
natu ral synchronization points, certain applications [28, C hapter 14] do not, so the use of 
a global m ethod introduces artificial (and costly) synchronizations.
Parallel partitioning packages such as ParM ETIS or JO STLE provide algorithm s for
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 18
com puting increm ental repartitionings, bu t once the new partitioning is obtained, perform ­
ing the actual da ta  m igration can be a difficult chore for the application program m er. 
Packages such as Zoltan [22] and PREM A [7, 8] provide tools and frameworks for m anaging 
and autom atically m igrating the objects in adaptive com putations. PREM A uses m ulti­
threading to  support asynchronous load balancing, and thus is particularly  suited to very 
irregular com putations.
2 .1 .3  L oad  b a la n c in g  u n d er  variab le  e x te r n a l load
In our discussion so far, we have only considered the load th a t is internal to  an applica­
tion. In a m ultiprogram m ed environm ent, however, the to ta l load on a processor can vary 
unpredictably due to  load from processors external to  a given application.
Provided th a t the granularity of the tasks is not too coarse, the pool of tasks paradigm  
can work well (and autom atically) in the presence of external load. D ata  partitioning 
approaches, however, face m ajor difficulties in dealing w ith such loads. External load can 
conceivably be accounted for by weighting each processor according to  its load and then 
sizing the partitions based on those weights. However, the high cost of repartitioning 
combined w ith the dynamic and unpredictable natu re  of external workloads makes this 
im practical. If external load conditions stay fairly constant, the cost of repartitioning to 
reflect th a t load can be worthwhile. If the decision to repartition  proves wrong— th a t is, if 
the external workload changes dram atically  soon after partitioning— it can be very costly. 
Because of the difficulties involved in external load prediction, repartitioning is generally 
not used to  cope w ith external load.
In C hapter 3 we present a load balancing scheme for a very specific bu t im portan t class of 
scientific algorithm s, inner-outer iterative m ethods th a t employ local inner iterations. These 
algorithm s are not amenable to  a pool-of-tasks approach, and—for the reasons discussed 
above—load balancing by dynamic repartitioning is im practical. We present a scheme th a t 
provides good load balance for these algorithm s in the face of dynam ic and unpredictable
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 19
load, w ithout any reliance on d a ta  m igration or performance predictions.
W hat we call our “load-balancing” scheme differs fundam entally from other load-balancing 
schemes in the conventional sense of the word. The pool-of-tasks and d a ta  partitioning ap­
proaches we have described d istribu te a fixed set of tasks in a balanced way across a set of 
processors. Our load balancing scheme also d istributes tasks in a balanced m anner across 
a set of processors, bu t the set from which these tasks are drawn is not fixed. T h a t is, the 
set includes optional tasks: each task  (hopefully) helps move the algorithm  to  its target, 
bu t not all tasks m ust be completed for this target to be reached. The tasks to  execute 
are chosen to maximize load balance and reach the target as quickly (in wall-clock time) 
as possible: tasks are associated w ith d a ta  on a particu lar processor, and slower processors 
execute fewer of the associated tasks. This flexibility in choosing which tasks to execute 
stems from the intrinsic properties of the inner-outer iterative m ethods.
2.2 A d d ressin g  m em ory sh ortage
Jobs in non-dedicated environm ents contend not only for CPU tim e, bu t for memory re­
sources as well. Memory pressure can cause extrem e degradation of performance. Its effects 
are especially pronounced if an affected node is part of a synchronous parallel job, as the 
immense slowdown experienced by th a t node can cause incredible load imbalance. In  many 
cases th is imbalance cannot be corrected by load balancing techniques th a t may be in place 
w ithin the application. For instance, if the job is running on a shared-m em ory com puter, 
m igrating work to o ther processors on th a t machine fixes nothing. Even in d istribu ted  
memory environments, problems remain: The usually excellent load balance provided by 
a pool-of-tasks scheme may disappear as the slowdown on an affected node increases the 
granularity of its tasks to excessive levels. If load balancing is performed through dom ain 
repartitioning, memory pressure may render the very cost of com puting a  new partition ­
ing prohibitive! Even when calculating the new partitioning is practical, deciding how to 
weight each partition  is a difficult problem, as operating systems typically provide very little
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 2. BACKGROUND AND RELATED  W O RK 20
inform ation about memory usage. For these reasons and others, explicit consideration of 
memory pressure and its m itigation is needed in non-dedicated environm ents, in addition 
to  good CPU-load balancing approaches.
In th is section we discuss several approaches th a t have been used to  remedy memory 
pressure in dedicated and non-dedicated environments.
2 .2 .1  O u t-o f-co re  a lg o r ith m s
Since the earliest days of the com puter age, researchers have been faced w ith problems too 
large to  solve using the m ain memory of a com puter. In response to  these problems, many 
out-of-core algorithm s have been developed. An out-of-core algorithm  [84, 86 , 25, 71] is one 
th a t has been designed to  provide acceptable performance despite the slow natu re  of access­
ing secondary storage. Such an algorithm  may be obtained by taking a  conventional (in-core) 
algorithm , im plem enting it to  access large, contiguous blocks of data , and re-scheduling its 
independent operations in a m anner th a t maximizes the re-use of da ta  th a t have been 
brought into m ain memory. Some out-of-core algorithm s go beyond simply rescheduling 
independent operations, altering conventional algorithm s so th a t their d a ta  dependencies 
are more amenable to da ta  re-use. These alterations may sacrifice some num erical stability  
or require more CPU  operations, bu t adm it much better out-of-core schedules.
Out-of-core algorithm s bear many sim ilarities to  cache-optimized algorithm s, as the goal 
of both  is to  maximize use of faster levels of a memory hierarchy by scheduling operations 
in a m anner th a t provides good spatial and tem poral locality. Both types of algorithm s 
benefit from many of the same optim ization techniques: S tructuring loop nests to elim inate 
strided accesses is im portan t a t all levels of the memory hierarchy. M any dense m atrix  
algorithm s are optim ized for out-of-core use by structuring  them  to operate on blocks of 
a m atrix  a t a time. Such blocking techniques are also used to  achieve cache optim ality 
in numerical libraries such as BLAS [53, 24, 23] and LAPACK [2]; they are so im portant, 
in fact, th a t com puter vendors make great efforts to  precisely tune blocking factors in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 2. BACKGROUND AND RELATED  W O RK 21
their im plem entations of these libraries, and the ATLAS project [89, 90] provides software 
to  autom atically tune these factors. Cache-oblivious algorithm s [30, 15], such as the one 
employed by the popular FF T W  library [29], operate using a divide-and-conquer approach 
th a t splits a problem  into subproblems th a t fit into the cache; this approach elim inates the 
need to  tune hardware-specific param eters. A lthough these algorithm s have been proposed 
w ith cache-optimality in m ind, these techniques are also beneficial in out-of-core settings 
[20].
Despite their similarities, there are im portant differences in the design of out-of-core 
versus cache-optimized algorithm s. These stem  from two fundam ental differences between 
the levels of the memory hierarchy th a t the two types of algorithm s target. F irst, the 
m ain-m em ory/disk bandw idth  ratio  is usually much higher th an  the cache/m ain-m em ory 
bandw idth ratio  in a com puter system. This means th a t modifications th a t increase CPU 
operations but improve out-of-core performance by reducing disk I /O  may perform  worse 
th an  unmodified versions when executed in-core, despite possibly higher cache efficiency. 
Second, cache memory usually has lim ited associativity— th at is, blocks brought from m ain 
memory cannot be placed anywhere in the cache, bu t m ust be placed in a lim ited num ber of 
specific locations in the cache. Many cache-optim ization techniques are designed to reduce 
so-called “conflict” misses th a t stem  from lim ited cache associativity. Because main-memory 
is fully associative (i.e., blocks from disk may be placed anywhere), such optim izations do 
not improve out-of-core performance.
Out-of-core algorithm s have been in use for decades to  allow com puters to  solve prob­
lems too big for their physical memory. These codes have been designed to operate w ith a 
fixed am ount of memory, however, and are unsuitable to deal w ith the transient memory 
shortages encountered on m ultiprogram m ed systems. An out-of-core code could certainly 
be used to  avoid memory pressure, bu t the u tility  of such an approach is dubious: Though 
the algorithm  would work efficiently when memory is scarce and an in-core algorithm  would 
thrash , when memory is plentiful the out-of-core algorithm  will not take advantage, contin­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 22
uing to  use the slow disk instead of the much faster m ain memory.
2 .2 .2  V ir tu a l m em o ry  s y s te m  m o d ifica tio n s
A virtual memory (VM) system  th a t can transparently  handle a high degree of memory 
pressure is highly desirable, as th is could free the application program m er from having to 
worry about memory shortages. T ransparent VM systems fall short, however, because no 
generic page replacem ent policy th a t is appropriate for the memory access pa tterns of all 
applications is known. Most systems employ an LRU-like policy. Under such a policy, an 
application th a t repeatedly loops through a d a ta  set, for example, will incur thrashing even 
if memory shortage is slight. Recently proposed general-purpose replacem ent algorithm s, 
such as LRFU [54] and LIRS [40] and the related CLOCK -Pro algorithm  [41], elim inate 
some of the shortcomings of LRU at the expense of introducing some additional complexity. 
None of these algorithm s has yet made its way into a production operating system.
In addition to replacem ent policy problems, the generic prefetching policy and the fixed 
page size employed by most VM systems may be inappropriate for the granularity  of d a ta  
access in an application. This is especially true  when memory pressure forces pages to 
be w ritten  to  swap space, which can be highly fragm ented and generally does not support 
prefetching. The shortcomings of generic VM systems have led many researchers to propose 
operating system  modifications th a t expose portions of the VM subsystem  to applications. 
The Mach microkernel [1] provided an external pager interface th a t allowed applications 
to control the movement of pages between m ain memory and secondary storage. It did 
not export control over page replacem ent policy, however. McNamee and Arm strong [60] 
extended M ach’s external pager interface to allow application program m ers to  control page 
replacement. H arty and Cheriton [35] designed and implem ented an external page-cache 
management scheme in the V + -1- kernel. In  their scheme, the VM system  provides a  cache 
of physical pages to an application, and it is the responsibility of the application to  manage 
th a t physical memory. Krueger et al. [49] devised a similar scheme, bu t gave it a m od­
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 23
ular, object-oriented design to allow program m ers to use application-defined replacem ent 
policies w ithout having to  re-implem ent large portions of the VM system. Cao et al. [18] 
implem ented a system  to give applications control over file caching, prefetching and disk 
scheduling.
A lthough application-specific virtual memory m anagement schemes are useful in some 
situations, they suffer some shortcomings. These schemes have generally not been designed 
w ith m ultiprogram m ed environm ents in mind, and assume a fixed am ount of physical pages 
will be available to the process a t runtim e. Additionally, such schemes are not useful to 
the typical com puter user precisely because they require significant modifications to an 
operating system  kernel. This precludes their use in most com puting settings because the 
operating system  does not support such a scheme and cannot be modified to  do so. Even 
if such a system  is available to  a code developer, writing an application to  use such a 
system  incurs severe portability  penalties. Memory pressure solutions th a t do not require 
OS m odification are more desirable to  the typical application program mer.
Nikolopoulos [64] developed a runtim e system  th a t transparently  adapts the memory 
footprint of running processes as memory availability fluctuates. This solution requires 
no m odification of the operating system  and is targeted specifically a t m ultiprogram m ed 
environments. This approach bears many sim ilarities to  the m em ory-adaptation framework 
we present in C hapter 5, although our framework is even more application-centric: memory 
adaptation  decisions are based solely on m easurem ents local to  the application, and we use 
application-specific replacem ent policies and units of d a ta  transfer.
2 .2 .3  M e m o r y -a d a p tiv e  a lg o r ith m s
Not a great deal of literature  on m em ory-adaptive algorithm s exists. Most of it comes from 
the field of database management, in which dealing w ith huge d a ta  sets is commonplace. 
Pang et al. [67] conducted one of the earliest memory adap tation  studies, investigating 
external sorting. External (viz., out-of-core) sorts are frequently used in database systems
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 24
to order query results and to  perform  certain join operations. An external sort is conducted 
in two phases, a “sp lit” phase and a  “merge” phase. In the split phase, an in-core algorithm  
is used to  arrange unsorted d a ta  into a set of sorted runs. In the merge phase, the runs 
are merged into a single sorted run. Pang et al. discussed how the algorithm s used for the 
split phase could adapt to  memory shortage by writing out a run  of sorted tuples to reduce 
buffer usage. To respond to  a memory increase, more relations should be brought into 
the buffer for sorting. They also introduced an adaptation  strategy for the merge phase, 
which they term ed dynamic splitting: If memory shortage arises during a merge, it should 
be split into smaller steps th a t can be performed w ith the available memory. If there is 
a memory surplus, then  more merges are run  at once. Pang et al. did not implem ent any 
of their proposed strategies, however, choosing to  only utilize sim ulation in their study. 
A later study  by Zhang and Larson [92] also addressed memory adap tation  for external 
mergesort; it included experim ental evaluation of their strategy. Their im plem entation was 
designed only to  address memory pressure arising from contention between concurrent sorts. 
Because com peting sorts all obtain memory through the same buffer manager, determ ining 
the am ount of available memory is not a problem. In a general-purpose tim e-shared system, 
however, one of the chief difficulties in im plem enting m em ory-adaptive algorithm s is devising 
an effective way to determ ine the am ount of available memory.
Barve and V itter [9] have developed a theoretical framework for proving lower bounds 
on the resource consum ption of m em ory-adaptive algorithms. Their m otivation comes from 
database m anagement systems, bu t they also prove bounds for standard  m atrix-m atrix  
m ultiplication, fast Fourier transform  com putation, and LU factorization. Their model 
assumes th a t a m em ory-adaptive algorithm  is aware of the am ount of memory available to 
it a t all times. Although operating systems do not provide such inform ation, in C hapter 
5 we present an algorithm  th a t uses a “gray-box” approach [4] to closely approxim ate this 
inform ation w ith minimal reliance on system -provided information.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 25
2.3 A p p lica tion -cen tric  ad ap tive schedu ling
Schedulers of parallel program s in non-dedicated environm ents m ust consider current and 
future system  loads in an a ttem pt to  m atch a job w ith a set of resources th a t will meet 
its requirem ents. This is a rem arkably difficult task, however, and conventional, system- 
oriented schedulers usually yield poor results in such settings. In response to  the lim itations 
of such systems, newer, more application centric approaches have emerged. Though very 
diverse, these new approaches follow a common th read  of interaction between systems and 
applications to dynamically adapt their schedules according to resource availability.
Polychronopoulos and Nikolopoulos [65] have presented extensions to  the kernel sched­
uler combined w ith a user-level scheduling m odule th a t coordinates the scheduling of com­
m unicating threads of parallel jobs and also prevents thrashing by suspending threads a t 
memory-allocation points if memory becomes overcommitted. Such a scheduler could also 
be used w ith m em ory-adaptive applications (such as those presented in C hapter 7) th a t 
could modify their memory requirem ents instead of suspending when an allocation request 
cannot be met.
Projects such as AppLeS [13, 14] and Active Harmony [46] present complete frameworks 
for m atching applications to appropriate resources and updating  application schedules as 
resource availability changes during execution. A p p L e S  (A pplication  Level Schedulers) 
use a completely application-level methodology. An AppLeS-enabled application has its 
own, specific application-level scheduler th a t discovers and identifies viable resource sets, 
determ ines candidate schedules, selects a schedule th a t best satisfies the user’s performance 
criteria, and executes the application using th a t schedule. For long-running applications, 
the AppLeS agent may refine and re-deploy the schedule as resource availability changes. 
AppLeS does not provide a custom  runtim e system; instead, it relies on the existing infras­
tructu re  (e.g. Globus [27], NetSolve/GridSolve [3], M PI [79]) for discovering and utilizing 
compute resources, along w ith the Network W eather Service [91] to gauge current and pre­
dict future resource utilization. Because the AppLeS m ethodology is entirely application-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 26
driven, it is especially suitable to  non-centrally controlled com puting environments.
Like AppLeS, Active Harmony m atches applications to appropriate  resources and re­
schedules them  as resource availability fluctuates. In Active Harmony, however, decisions 
about application scheduling are made by a centralized resource m anager ra ther th an  the 
applications themselves. Centralized decision m aking is employed in an effort to  maximize 
overall efficiency and throughput. Active Harmony provides a “tuning interface” through 
which applications export inform ation about possible application configurations. A manager 
then  dynam ically tailors the application configuration according to  resource availability. 
For example, a hybrid client-server database th a t can process queries on either the client or 
server side could be configured to  process queries at the server when there are few clients, 
bu t be reconfigured to  process queries on client nodes when many clients are using the 
system.
Systems like AppLeS or Active Harmony can place quite a burden on the user, as 
often very detailed resource-utilization inform ation m ust be provided. For instance, Active 
Harmony requires th a t the user specify the resource requirem ents of different options, the 
granularity a t which configuration changes can be performed, the cost of switching between 
the configurations, and (optionally) a way in which the response tim e of a given configuration 
can be calculated. Chang and Karam cheti [19] have developed a resource m anagement 
framework th a t eases the burden on the user by providing a v irtual execution environm ent 
which perm its autom atic, off-line generation of resource-utilization information.
The adaptive scheduling approaches we have ju st cited and the techniques for dynamic 
adaptation  to CPU and memory load presented in th is dissertation are complementary. 
Systems such as AppLeS and Active Harmony a ttem p t to  identify the “best” schedule for 
an application, and to  refine th a t schedule as resource availability changes. O ur techniques, 
on the other hand, provide applications with a degree of m alleability th a t enables them  to 
cope w ith a set of resources th a t may at times be far from “best” due to fluctuations in 
resource utilization. Using a system  such as AppLeS to select the optim al schedule, and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 2. BACKGROUND AND RELATED  W O RK 27
techniques such as ours to  optimize the use of the scheduled resources makes sense. This 
seems especially true when one considers the overheads involved in re-scheduling onto a 
different set of resources. Applications th a t can achieve acceptable performance on less- 
than-ideal resource sets can reduce the need for re-scheduling and thus avoid some of the 
associated costs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
Load balancing flexible-phase 
iterations
In the first phase of this work, we have investigated a novel dynam ic load-balancing tech­
nique for parallel inner-outer iterative m ethods th a t execute inner iterations locally on each 
processor. Our approach takes advantage of what we term  the flexibility  of the inner phase, 
using the fact th a t the am ount of work done by the processors during the inner phase can 
vary (w ithin bounds) w ithout detrim ent to the end result: we limit all processors to  the 
same tim e in the inner phase (which performs the bulk of the work), v irtually  elim inat­
ing load imbalance there. We explore this idea in the context of Krylov-type m ethods, 
which form the core of a num ber of scientific and engineering applications. We experim en­
tally  dem onstrate th a t our scheme can cope w ith severe load imbalance in a coarse-grained 
Jacobi-Davidson eigensolver th a t applies local corrections to each block. Furtherm ore, in 
a version of the eigensolver code th a t employs m ultigrain parallelism , we show th a t our 
scheme allows effective use of heterogeneous networks of clusters. Finally, we also demon­
stra te  the u tility  of our technique for sm oothing load balance in a linear system  solver th a t 
employs a dom ain-decom position preconditioner.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 29
3.1 F lex ib le-p h ase  itera tion
A common algorithm ic paradigm  for parallel program s is th a t of synchronous iteration, in 
which processors perform  local work to  complete an iteration of a loop, and then  synchronize 
before proceeding to  the next iteration. The following illustrates synchronous iteration  on 
a single program , multiple da ta  (SPMD) com puter where the unique ID of each processor 
is denoted by myjrank:
w h ile  {target state has not been reached) { 
body(m y_rank);
Synchronous interaction(s)
}
We are interested in a very specific bu t im portan t case of synchronous iteration: one in 
which the am ount of work completed by each processor during an execution of the body 
may be varied arb itrarily  w ithout detrim ent to  the end result. W ith  smaller am ount of work 
per iteration, the target can still be reached, bu t more iterations are required. A common 
example is inexact Newton iteration [47, pp. 95-112], where solving the Jacobian system  
to  a lower accuracy may require more Newton steps, although eventually the same result 
will be reached. We decompose algorithm s in th is category into two phases: 1) A control 
phase, during which synchronous interactions update  global knowledge of the current state, 
allowing each processor to make better decisions later. 2) A flexible phase, during which 
local execution of the body occurs. It is “flexible” insofar as each processor can vary the 
am ount of work th a t it does during th is phase. We designate this type of parallelism  flexible 
phase iteration.
Although it is very specific, several im portant algorithm s used in the sciences and en­
gineering fit th is model. One class of such algorithm s includes stochastic search optim izers 
such as genetic algorithm s and sim ulated annealing. In  their synchronous formulations, 
processors independently pertu rb  an initial set of configurations to  search for more optim al
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERATIO N S 30
ones. After a certain  num ber of successes are obtained by each processor, they synchronize 
to  decide upon new configurations, and then  continue their search. Because synchronization 
could occur before each processor has had a certain  num ber of successes, the search phase 
is a flexible phase. The decision phase is the control phase.
A nother im portan t class of algorithm s th a t are amenable to  a flexible phase iteration 
structure  are Krylov-like iterative m ethods [77]. These m ethods are widely employed to 
solve systems of linear equations, eigenvalue problems, and even non-linear systems. Their 
main iteration involves vector updates and synchronous dot-products, which allow for little 
flexibility. However, flexibility can be introduced w ith preconditioning. At each outer, 
Krylov iteration, preconditioning improves the current solution of the m ethod by finding 
an approxim ate solution to  a correction equation. Flexible variants of Krylov m ethods 
can solve this equation iteratively [75]. In parallel im plem entations, th is allows different 
processors to  apply different num ber of iterations on a local correction equation [73]. Thus, 
the inner, preconditioning step is a flexible phase, and the outer (Krylov) iteration, where 
the processors update  their corrections, is a control phase.
Since each processor need not do the same am ount of work as its peers, near-perfect 
load balance can be achieved during the flexible phase: If all processors are lim ited to  the 
same tim e T  for an execution of the flexible phase, any differences in the speed of the 
processors—even due to  changes in the external load—cannot cause load imbalance. We 
have developed an object-based C library, L B lib , to  facilitate the use of this load balancing 
approach. A ppendix A describes its basic functionality and illustrates its usage.
3.2 C oarse-grain  Jacob i-D av id son  im p lem en ta tion
Any iterative m ethod which employs independent, local preconditioners on each node is 
a candidate for the load balancing approach th a t this work investigates. For the bulk 
of our experim ents, we have adapted a recent im plem entation of a coarse-grain paral­
lel, block Jacobi-Davidson solver for eigenvalue problems, JDcg [80]. JDcg employs the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 31
Jacobi-Davidson (JD) m ethod to approxim ate a few extreme eigenpairs (X *,x*) satisfying 
Ax* = \*x* , where A is a large, sparse m atrix. For simplicity of presentation, we assume A  
to  be symm etric. The JD  m ethod is a Krylov-like m ethod th a t s ta rts  w ith an initial vector 
and successively builds a basis V  for a space K  from which approxim ations (A,, ) for the
eigenpairs are obtained, usually w ith the Rayleigh-Ritz procedure [26]. At each step, the 
residual =  {A — X-lI ) x l is com puted, and the solution et of the correction equation
(.I  -  X i x J ) ( A  -  \ i l ) ( l  -  X i x f ) e i  = r{ (3.1)
for some approxim ate eigenpair (Ritz pair) (Aj, Xj) is obtained and then used to expand the 
basis. A block version of JD proceeds in a similar m anner, bu t begins w ith k  in itial vectors 
and builds the basis by a block of k vectors a t a time, w ith each vector being the solution 
of one of k  correction equations, each for a different Ritz pair.
In our im plem entation, we use the iterative m ethod BCGSTAB to  solve the correction 
equation (3.1) because of the indefiniteness of the systems, and because BCGSTAB utilizes 
a recurrence w ith only a small num ber of term s, allowing us to  avoid having to  store a 
large num ber of vectors. In addition, available preconditioners can be used in solving the 
correction equations.
The JD  m ethod can be easily implem ented using d a ta  parallelism. Each block vector is 
broken into subsets of rows, and each processor keeps one of those subsets. This requires 
a global reduction (sum m ation) to  be performed for each inner product. A parallel m a­
trix  vector m ultiplication routine and preconditioning operations are provided by the user. 
Figure 3.1 outlines the d a ta  parallel JD  algorithm .
In our coarse-grain Jacobi-Davidson code, JDcg, the m ethod is not implem ented in a 
data-parallel fashion. Instead, it employs a hybrid coarse/fine-grained ( “m ultigrain” ) ap­
proach [80, 81] th a t reduces comm unication costs and, coincidentally, introduces flexibility 
into the inner phase. This approach was m otivated by the observation th a t, in m any situa­
tions, every processor can have access to  the entire m atrix  A. This is often possible because
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 32
JD
V  sta rting  w ith k  tria l vectors, and let W  = A V  
W hile not converged do:
1. H  = V T W  (local contributions)
2. G lobaLSum (iJ) over all processors.
3. Solve H yi = \ y i ,  i = 1 : k  (all procs)
4. Xi = V yi, Zi = W y i,i  =  1 : k  (local rows)
5. ri = Zi — XiXi, i =  1 : k  (local rows)
6. Correction equation Solve eq. (3.1) for each e;
7. Orthogonalize e ;,i =  1 : k. Add in V
8. M atrix-vector W) =  AV), i =  1 : k 
end while
F ig u re  3.1: The data parallel Jacobi-Davidson algorithm.
of the large m ain memory sizes available today (e.g., a 12th order, 3-D finite difference 
m atrix  of 1 million unknowns can be stored on a PC  w ith 512 MB) or, more im portantly, 
because many applications do not explicitly store the m atrix, providing instead a function 
for com puting the m atrix  vector product. In JDcg, the m atrix  A  is available on all proces­
sors, bu t the basis vectors V  are still partitioned by rows. The m ain JD  step (projection  
phase), or outer loop, is applied in the trad itional d a ta  parallel (fine grain) fashion, ig­
noring the m atrix  redundancy. The inner loop (correction phase) is then  performed in a 
control parallel (coarse grain) m anner, w ith each node gathering all the rows of one of the 
block vectors and performing the m atrix-vector product and preconditioner in BCGSTAB 
independently of the other processors. The following pseudocode describes the coarse-grain 
interface:
Coarse grain correction equation  
A ll-to-all: send local pieces of to  proc i, 
receive a piece for x myid , r myid from proc i 
Apply m  steps of (preconditioned) BCGSTAB on eq. (3.1) w ith the gathered x myid, r myid 
A ll-to-all: send the *-th piece of emyid to proc i , 
receive a piece for e% from proc i
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 33
Although an expensive all-to-all comm unication is required, the coarse grain interface allows 
us to arb itrarily  improve the parallel speedup of the m ethod because of the independent 
solution of the correction equations. By increasing the num ber of BCGSTAB iterations 
(m), we can shift work from the outer, projection phase to the inner, correction phase, thus 
increasing the am ount of work performed between comm unication operations. Larger values 
of m  and block size k typically reduce the num ber of outer JDcg iterations but increase 
the to ta l num ber of m atrix-vector products, so the algorithm  is not work conserving. Large 
values of m, however, are required for the solution of numerically difficult problems, and the 
coarse granularity presents an ideal opportunity  to hide the large com m unication latencies 
of slow networks.
3.3  Load balancing  J D cg
JDcg fits the flexible-phase iteration  model: The corrections et need not be com puted to 
the same accuracy, so the correction phase is flexible. The highly-synchronous projection 
phase is the control phase. Thus we can load-balance JDcg by restricting each processor to 
a fixed tim e T  in the correction phase [63]. Even though imbalances will persist during the 
brief projection phase, overall load imbalance is v irtually  elim inated, because the correction 
phase dom inates the execution time. Also, some vectors e* may be com puted to  lower 
accuracy, bu t this only increases the num ber of outer iterations and often decreases the 
am ount of to ta l work.
To determ ine an appropriate T, we follow the commonly used guideline th a t BCGSTAB 
be iterated  to a convergence threshold of 2~lteT, where iter  is the num ber of outer iterations 
th a t have been performed [26]. Using classical convergence bounds for Conjugate G radi­
ent [77], we determ ine heuristically an “optim al” num ber of iterations m  th a t corresponds 
to the 2~lter threshold. To avoid hurting convergence by too large an m, we set a maxi­
mum bound m a x its  for m . T  is then  the tim e required by the fastest processor to  complete 
m  BCGSTAB steps. The algorithm  for the load-balanced correction phase proceeds as follows:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 34
Load-balanced correction phase o f the JD cg
1. In the first JDcg iteration, do no load balancing. Each processor performs m a x its  
BCGSTAB iterations, calculates the ra te  a t which it performed them , and communicates 
its ra te  to  all o ther processors.
2. In subsequent JDcg iterations, use the ra te  m easured in the previous iteration  to rank 
the processors from fastest to  slowest. In the all-to-all comm unication of step 8 of the 
JDcg, faster processors gather the extreme-most eigenpairs and residuals.
3. Use the fastest rate  to  determ ine T, and then  iterate on the correction equation for 
this time.
In the load balanced correction phase, faster processors gather the extrem e-m ost eigen­
pairs and residuals, and the tim e T  is calculated using the ra te  of the fastest processor. 
This ensures th a t the extrem e eigenpairs get most of the work, and is necessary because 
Krylov-like m ethods naturally  find extreme eigenpairs first; tim e spent on interior eigenpairs 
benefits convergence relatively little compared to  tim e spent on exterior ones.
3 .3 .1  E x p e r im e n ta l ev a lu a tio n
We have run  the load balanced code on machines loaded with additional “dum m y” jobs 
which introduce external sources of static  and dynamic load imbalance by doing m eaning­
less bu t CPU  intensive calculations. Experim ents were run using four Sun Ultra-5 Model 
333 machines w ith 256 MB of RAM, connected via switched fast E thernet. Two large, 
sparse, sym m etric test m atrices available from M atrixM arket1 were used. The first m atrix  
is NASASRB, of dimension 54870 with 2677324 non zero elements. In the experim ents us­
ing NASASRB, we seek its lowest eigenvalue. Because the lowest eigenvalues of NASASRB 
are closely clustered, this is a difficult problem  th a t benefits from using a large num ber of
xh t t p : / / m a t h . n i s t . gov/MatrixMarket
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PHASE ITERA TIO N S 35
correction equation iterations (we use m a x its  = 150). The second m atrix  is S3DKQ4M2, 
of dimension 90449 with 2455670 non zero elements. We use JDcg to  find the largest eigen­
value of S3DKQ4M2, which is an easier problem  requiring only a small num ber of correction 
phase iterations (we use m a x its  = 20). For bo th  cases, BCGSTAB is preconditioned w ith a 
local ILUT(0,20) preconditioner [76]. To aid our experim ental evaluation of the load bal­
ancing scheme, we instrum ented JDcg w ith profiling code th a t tim e-stam ps the beginning 
and end of each comm unication operation. Synchronizing the initial tim estam ps allows us 
to determ ine precisely where each JDcg process is spending its tim e, and thus to  precisely 
quantify the am ount of load imbalance: for each comm unication th a t occurs, the tim e bi 
wasted by processor i due to  load imbalance is the tim e the processor took to complete the 
comm unication minus the minimum  of the tim es taken by all processors.
Tables 3.1a and 3.1b summarize the performance of JDcg when run  against a sta tic  set 
of dum my jobs. The results for bo th  m atrices show th a t the load balancing scheme performs 
quite well, cu tting  execution times roughly in half, even for the most heavily loaded cases.
We also ran  our JDcg codes against dynamic workloads. Each processor was loaded 
w ith an additional dum my job which executes an endless loop th a t sleeps for a random  
am ount of tim e and then performs register based com putations for a random  am ount of 
time. The duration  of the sleep and com putation phases are sampled from exponential 
d istributions w ith means A and p. seconds, respectively. Tables 3.2a and 3.2b summ arize 
our results for m atrices NASASRB and S3DKQ4M2, respectively. Since the behavior of 
the dummy jobs varies between experiments, ten  trials were run  for each set of param eters, 
and we report average tim ing and standard  deviations. The load balancing scheme works 
well in the presence of a dynamic load imbalance. Performance of the scheme generally 
worsens as A and fi decrease. This is not surprising, because as the average duration  of the 
com putation done by the dum my jobs decreases, the ability to forecast the speed of a node 
based on its performance during the previous correction phase is lessened. Note, however, 
th a t because our scheme forces each processor to  spend the same am ount of tim e T  in the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERATIO N S 36
correction phase, this lessened ability does not result in poor load balancing, bu t only worse 
overall convergence due to  poorer estim ation of the optim al value of T.
A
B
T able 3.1: Performance when using JDcg to find the smallest eigenpair of NASASRB (A) and 
S3DKQ4M2 (B) on a system subject to various static external loads. The “Load” column indicates 
the number of jobs (including JDcg) running on each node. For example, the notation “2211” 
indicates that two processors are running an additional job and the two other processors execute 
only the JDcg job. “Its” denotes the number of outer iterations, “Mvecs” the total amount of matrix- 
vector multiplications, “Time” the total wall-clock time in seconds, and “% Imbal” the percentage 
of time wasted due to load imbalance over the lifetime of the the application.
3.4 Load balancing a fu lly  m ultigra in  Jacob i-D av id son  solver
3 .4 .1  B e n e fits  o f  m u ltig ra in  p a ra lle lism
The hybrid coarse/fine grain parallelism  employed by JDcg is a special case of w hat we 
term  “m ultigrain” parallelism. The projection phase is carried out using a d a ta  parallel al­
gorithm  suitable for fine-grained parallel architectures, while the correction phase is carried 
out in a coarse-grained fashion, w ith each processor working independently to com pute its 
own correction. It may not be appropriate (or even possible), however, for each processor to 
compute a correction: If we wish to  use many processors, it becomes inappropriate  because 
the block size m ust equal the num ber of processors, and large block sizes are algorithm i­
cally inefficient, dram atically increasing the num ber of m atrix-vector products th a t must
W ithout load balancing W ith  load balancing
Load Its Mvecs Time % Imbal Its Mvecs Time % Imbal
1111 19 9813 1745 4.3 18 9205 1666 3.8
2111 19 9813 3509 38.5 19 8517 1783 5.6
2211 19 9813 3526 26.4 19 7381 1788 4.4
2221 19 9813 3599 15.5 21 6941 1997 3.6
W ithout load balancing W ith load balancing
Load Its Mvecs Time % Imbal Its Mvecs Time % Imbal
1111 24 1959 601 4.0 25 1921 564 5.5
2111 24 1959 1096 38.0 25 1720 601 8.3
2211 24 1959 1120 27.4 28 1686 712 10.4
2221 24 1959 1121 15.5 35 1706 892 10.4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 37
A
W ithout load balancing W ith  load balancing
A t* Time % Imbal Its Mvecs Time % Imbal
400 400 3135 (141) 23.6 (3.1) 21.0 (1.3) 8586 (486) 2094 (192) 5.4 (1.0)
300 300 3256 (165) 25.9 (2.3) 21.8 (1.1) 8800 (568) 2161 (160) 5.2 (0.5)
200 200 3094 (107) 25.4 (1.7) 21.5 (1.4) 8873 (531) 2122 (189) 5.5 (0.8)
200 100 2778 (115) 23.5 (1.5) 20.9 (1.1) 9250 (569) 2010 (124) 4.8 (0.4)
100 200 3356 (113) 21.1 (1.4) 22.7 (0.7) 9019 (328) 2539 (191) 6.4 (0.9)
100 100 2977 (185) 21.6 (1.8 ) 22.6 (0.9) 9448 (486) 2313 ( 86) 5.5 (0.6)
B
W ithout load balancing W ith  load balancing
A V Time % Imbal Its Mvecs Time % Imbal
200 200 1074 (38) 25.2 (3.0) 28.6 (1.3) 1731 ( 37) 732 ( 46) 10.3 (0.9)
100 100 1028 ( 58) 25.6 (3.4) 28.5 (2.8) 1735 ( 58) 733 (100) 9.9 (1.9)
100 50 866 (67) 24.0 (3.8) 27.8 (1.6) 1800 ( 86) 684 ( 42) 10.3 (1.4)
60 60 1018 (47) 25.2 (1.8) 29.3 (1.2) 1775 ( 59) 751 ( 43) 10.1 (1.3)
50 100 1089 (32) 21.0 (1.8) 30.2 (3.0) 1792 (141) 857 (100) 10.3 (1.7)
30 30 974 (45) 23.6 (1.7) 30.4 (4.3) 1861 (196) 776 (132) 10.4 (2.3)
T able 3.2: Performance averages and their standard deviations (in parenthesis) for 10 different runs 
on NASASRB (A) and S3DKQ4M2 (B) with come-and-go dummy jobs on each node. Dummy jobs 
execute a loop in which they sleep for 7 seconds and then perform computations for £ seconds. 7 ’s 
and <f’s are sampled from exponential probability distributions with averages A and /r. respectively. In 
cases without load balancing, 19 outer iterations and 9813 matrix-vector multiplications a performed 
for NASASRB; for S3DKQ4M2, 24 iterations and 1959 mat-vecs.
be calculated. In other cases, the m atrix  may simply be too large to  fit in memory on a 
single compute node. W hen it is not practical for each processor to com pute a correction, 
fully m ultigrain parallelism  should be employed, in which a subset of processors—a “solve 
group”—rather th an  a single processor, are used in parallel to solve a correction equation. 
The only real difference from JDcg is th a t a d a ta  parallel algorithm  is used w ithin each solve 
group to  solve the correction equation corresponding to the block vector gathered onto it. 
The fully m ultigrain algorithm  is implem ented in JD m g [58, 59] and is advantageous in sev­
eral ways: F irst, it allows the num ber of processors P  to  exceed the block size k. Second, 
it eliminates the requirem ent th a t processors be able to  hold the entire m atrix  in prim ary 
storage. T hird, it facilitates use of com putational resources connected by heterogeneous 
networks. For example, if two clusters bo th  utilize M yrinet internally, bu t the interconnect
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PHASE ITERATIO N S 38
between the clusters uses much slower Fast E thernet, a block size of 2 could be used w ith 
each cluster forming a solve group. The brief projection phase would be lim ited by the slow 
connection between the two clusters, bu t during the correction phase each cluster could 
take advantage of its fast internal network.
3 .4 .2  L oad im b a la n ce  in  th e  fu ll m u ltig ra in  case
Though the m ultigrain algorithm  has several benefits, it can accentuate or even introduce 
load imbalance. For instance, if three identical processors are used to run  JD m g w ith a 
block size of two, one solve group will contain two processors and the other only one. Since 
the la tte r group is roughly only half as fast as the former, the load imbalance is now 33%! 
Clearly the u tility  of the m ultigrain scheme is lim ited if not used in conjunction w ith a 
load balancing scheme. Fortunately, like the coarse grain version, JD m g fits the flexible- 
phase iteration model and can be load balanced in the same m anner [57]. The only real 
difference is th a t solve groups, ra ther th an  individual processors, are the entities th a t work 
independently during the correction phase. Processor 0 of each solve group is responsible 
for coordinating the load balancing: those processors track the speed of the subgroups, 
determ ine the tim e T  to iterate on the correction equations, and inform the other members 
of their solve group when they should halt the correction phase.
We used L B lib  to  load balance JD m g and tested the performance of the scheme using 
SciClone, a heterogeneous cluster of workstations a t the College of W illiam & Mary. Figure
3.2 details its architecture a t the tim e of the experim ents. To enable m easurem ent of load 
imbalance in the m ulti-grain experiments, we tim estam p synchronous comm unication calls 
in JDmg, much as we did w ith JDcg. We do not tim estam p comm unications internal to 
the solve groups during the correction phase: there is no need, as load imbalance does 
not arise because the machines in a solve group are homogeneous. Table 3.3 summarizes 
the performance of the m ultigrain code, w ith and w ithout load balancing, on a variety 
of solve group configurations. W ithout load balancing, load imbalance is quite severe in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S 39
MHz Mem Cache
UltraS 333 256 2MB
Ultra60 360 512 2MB
Ultra420 450 4GB 4MB
SciClone Cluster
12-port Gigabit Ethernet
Typhoon
36-port Fast Ethernet Switch
T}  ... 32 SUN UltraSs ... [Jjj
36-port Fast Ethernet Switch
[|]  ... 32 SUN Ultra5s ... J
(B)
Tornado
36-port Fast Ethernet Switch
E 32 Dual processor I -  SUN Ultra60s *" □©
Hurricane
12-port Gigabit Ethernet
5 4 Quad processor _L ™ Ultra 420Rs | |(d)
F ig u re  3.2: SciClone: The William and Mary heterogeneous cluster of three homogeneous clusters: 
Typhoon, Tornado (also called C), and Hurricane (also called D). We distinguish between A and B, 
the subclusters of Typhoon, because their intercommunication passes through the Gigabit switch. 
There are three levels of heterogeneity: node architecture, number of nodes, and networks.
many of the configurations. Employing load balancing significantly improves performance, 
sometimes doing better th an  halving execution times. Load imbalance is alm ost always 
held below a tolerable level of 10%.
3.5 A  load b alanced  A dd itive-Schw arz p recon d ition er for 
F G M R E S
So far we have applied our load balancing m ethod only to the coarse/m ulti-grain Jacobi- 
Davidson eigenvalue solver. In this section we dem onstrate its applicability in a different 
setting. The Generalized M inimal Residual (GMRES) m ethod [72] is one of the most pop­
ular iterative m ethods for solving unsym m etric linear systems. A variant known as Flexible 
GMRES (FGM RES) [75] allows a different preconditioner to be used a t each step. In par­
ticular, an iterative m ethod may be used. We have applied our load balancing scheme to 
a parallel FGM RES solver th a t employs a  one-level additive Schwarz m ethod (ASM) pre­
conditioner. ASM is an iterative m ethod, the basic idea of which is to  break the problem  
dom ain into (possibly overlapping) subdom ains and to  approxim ately solve for a correction 
vector on th a t dom ain while ignoring dependencies upon the o ther subdom ains. The correc-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PH ASE ITERA TIO N S  40
W ithout load balancing W ith load balancing
Nodes Time Mvecs %imbal Time Mvecs %imbal
AD 3265 13058 36.17 1746 10515 4.47
Ai6Ai6DgD8 4022 16910 38.96 1692 11208 5.14
C32D 3282 13058 39.57 1631 10478 5.01
A16A16C 16C 16 1405 12424 11.02 1546 14698 5.24
CieCieDgDg 4037 16910 41.71 1544 10711 6.22
AC32 1585 12730 9.46 1450 12833 2.05
CD 3495 13996 52.37 1381 9608 7.68
C32C32D8D8 3132 13124 58.32 1284 11202 9.94
(AB)C 1198 12656 11.97 1214 13653 5.98
(AB)D 3500 13996 55.42 1126 8870 8.97
ABC32C 32 981 12240 21.00 991 14167 8.99
ABD8D8 3152 13124 61.58 941 8680 11.78
(AB)C32C 32D 1870 14534 52.64 724 9481 15.05
Table 3.3: Performance of the multigrain JD running on different node configurations, with and 
without load balancing. “Time” is wall-clock time in seconds and “Mvecs” is the number of matrix- 
vector products computed. “%imbal” is the percentage of time wasted due to load imbalance. Strings 
within the “Nodes” column specify what nodes are used for an experiment: For each subcluster that 
is utilized, its letter is given. If a subscript n is appended to that letter, it indicates that only 
n  processors of the subcluster are utilized; if no subscript is present, all processors are utilized. 
For instance, “C” means that all 64 processors of cluster C are used, while C32D indicates that 32 
processors from cluster C are used together with all the processors from cluster D. For each subcluster 
utilized, its letter is given. When multiple subclusters are assigned to one block vector, they are 
grouped together with parentheses. E.g., “(AB)” indicates that subclusters A and B work together 
on the same block vector (are in the same solve group), whereas “AB” indicates that subclusters A 
and B work on different block vectors (each composing their own solve group).
tions are then added to  the current approxim ation to  the solution to  obtain  the next iterate. 
Because the components of each subdom ain are not updated  until the end of an iteration, 
the subdom ain solves can be performed independently, and therefore in parallel. Because 
the solves can proceed independently and need not be performed to the same accuracy, 
ASM fits the flexible phase iteration model.
W hen one step of ASM is used to  precondition an FGM RES iteration, the application 
of the preconditioner can be load balanced by restricting all processors to  spend the same 
am ount of tim e on their subdom ain solves. Using this approach is considerably more com­
plicated for ASM th an  for Jacobi-Davidson, however. For Jacobi-Davidson, we select the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PHASE ITERA TIO N S 41
tim e T  for the correction phase by identifying an optim al num ber of iterations and then 
estim ating the tim e required for the fastest processor to  complete th a t num ber. If slower 
processors complete significantly fewer iterations, this does not prevent convergence. (In­
deed, in the proceeding chapter we dem onstrate how we cope w ith memory pressure on a 
node by having it do no iterations.) Such an approach is not possible for ASM precon­
ditioned FGM RES. Convergence may stagnate if each subdom ain solve does not reach a 
certain level of accuracy. Furtherm ore, it is difficult to predict how many iterations are 
needed to reach a given accuracy: some subdom ains are much “harder” th an  others, and on 
the same subdom ain the num ber of required inner iterations may vary considerably from 
outer iteration to  outer iteration. For these reasons, and to enable b e tte r response to  dy­
namic load variations, we determ ine the tim e T  dynamically by building a consensus: The 
inner phase ends when all processors have reached a m inimum  accuracy for their subdo­
m ain solve. This incurs some comm unication overhead, bu t because dom ain decomposition 
m ethods lose their effectiveness as more subdom ains are added, the num ber of subdom ains 
should not become great enough to  cause problems.
We added functions to  L B l i b  to  support determ ining T  by consensus and used the 
library to  write a load-balanced ASM preconditioner for the PET Sc framework [5, 6] de­
veloped at Argonne National Laboratory. The load balancing functionality is completely 
encapsulated w ithin the preconditioner and can be used by any application th a t uses a 
PETSc solver th a t supports an iterative preconditioner.
Table 3.4 summarizes the performance of FGM RES when used with the load bal­
anced ASM preconditioner to  solve a two-dimensional convection-diffusion problem  with 
a Reynolds num ber of 1000. Differences in the conditioning of the subdom ains cause load 
imbalance even when no external jobs are present; in this case use of the load balancing 
scheme reduces execution tim e by an average of 18%. In the worst-loaded case, execution 
tim e is reduced by 24%. Such gains are significant, bu t it is clear th a t the load balancing 
scheme cannot fully correct any substantial load imbalance. This is because increasing the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 3. LOAD BALANCING  FLEXIBLE-PHASE ITERA TIO N S 42
accuracy of a subdom ain solve yields dim inishing returns: once the solve reaches a certain  
accuracy, going beyond th a t does nothing to speed convergence of the outer m ethod. For 
dom ain decomposition m ethods, it appears th a t repartitioning is the only way to correct 
substantial external load imbalance. Because repartitioning is costly and cannot be done 
frequently, our load balancing approach could play the role of reducing load imbalance in 
between repartitionings. We also note th a t our m ethod is useful for sm oothing out load im­
balances due to differences in the condition num ber of the subdomains; partitioning m ethods 
th a t can ensure th a t all domains are similarly conditioned are not known.
non-lb lb
Case Its Time Its Time
1111 56 77 45.0 63.6
2111 56 137 41.3 105.7
1211 56 118 44.0 102.7
2211 56 142 43.0 122.6
1112 56 104 40.7 83.9
Domain Min its M ax its Avg its
0 2 34 19.7 (7.7)
1 3 29 16.5 (6.2)
2 4 32 12.3 (5.2)
3 3 25 14.6 (4.5)
Table 3.4: Performance of FGMRES when preconditioned with ASM without load balancing (“non­
lb”) and with load balancing (“lb”). Four processors from the Typhoon subcluster (A in figure 3.2) 
were used to solve a convection-diffusion problem with Reynolds number 1000, discretized with a 
5-point centered-difference scheme. Subdomains are solved to a relative tolerance of le-1 without 
load balancing and to a higher accuracy on lightly loaded machines when load balancing is used. 
Some subdomains are harder to solve than others, leading to natural load imbalance. The right table 
shows the minimum, maximum, and average number (std. dev. in parentheses) of iterations required 
to solve each subdomain problem to a relative tolerance of le-1. The left table depicts performance 
for different external load cases; the notation used is as in table 3.1. For example, “2111” means 
that the node 0, which has the hardest domain, is loaded with a dummy job. We ran experiments 
for the cases which result in the most load imbalance. Because execution time is variable when load 
balancing is used, we report average times for the load balanced code.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Load balancing under memory 
pressure
The load balancing scheme we described in the preceding chapter works well when external 
load is CPU intensive in nature. In  many cases, however, external jobs compete for scarce 
memory resources as well. If the memory requirem ents of the processes on a node far 
exceed the available physical memory, a sizeable num ber of CPU  cycles will be wasted as 
the system swaps pages to  disk. Furtherm ore, if the page replacem ent policy of the v irtual 
memory system  is ill-suited to  the memory access pa tterns of the jobs, the m ajority  of CPU 
cycles will be wasted as the system  thrashes. Such memory pressure on a node can cause a 
synchronous parallel job to  grind to  a standstill under trem endous load imbalance. In  this 
chapter, we explore how our load balancing scheme can be modified to  improve performance 
under memory pressure.
4.1 L oad-balanced  J D cg  under m em ory pressure
To determ ine the robustness of our load balancing scheme under memory pressure, we 
ran  some experim ents in which we loaded nodes w ith a dum my program  requiring a large 
am ount of memory. The dum my job continuously performs m atrix-vector m ultiplications 
w ith a m atrix  of size M  MB. Our initial experim ents were conducted on machines running 
Solaris 7, bu t we found th a t the operating system  would prevent thrashing by essentially
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 44
“starving” either JDcg or the dum my job, scheduling one job consistently in favor of another. 
This policy is docum ented in [56]. Because of this behavior, we instead conducted our 
experiments on a system  of four 1 GHz Pentium  III machines w ith 128 MB of RAM running 
RedHat Linux 6.2, which utilizes a version of the Linux 2.2 kernel. The machines were 
interconnected via fast E thernet.
We found th a t a dum my job of size M  = 80 MB was sufficient to  consistently cause 
excessive swapping. Furtherm ore, increasing M  did not increase the am ount of swapping 
observed, so we used th a t value in our experiments. Table 4.1 displays our results for 
NASASRB (S3DKQ4M2 is too large to  use on these systems). For load balanced JDcg, 
the execution tim e becomes considerably worse as the external load is increased. This is 
largely due to the fact th a t in these experiments, a node th a t is running a dum m y job is 
so much slower th an  the fastest node(s) th a t it only completes one b c g s t a b  iteration each 
tim e through the correction phase (and even with it performing only one iteration, the other 
nodes still m ust wait for it). However, when compared w ith the performance of non-load 
balanced JDcg, the num bers are very encouraging. W ithout load balancing, JDcg fares very 
poorly when m ultiple nodes are burdened with external jobs (2-2-1-1 and 2-2-2-1 cases)— 
much worse th an  when only a single node is burdened (2-1-1-1 case). This is somewhat 
surprising: since non-load balanced JDcg is lim ited by the  speed of the slowest processor, 
one might expect the difference between its performance under loads of 2-1-1-1 and 2-2-1-1 
to  be minimal, as it is in the experiments described in section 5.1. This is perhaps due to 
ex tra  load imbalance th a t is introduced because the tim ing of swapping activity may not 
coincide w ith th a t of other nodes subject to  external loads.
4.2 A vo id in g  th rash in g  from  w ith in  th e  a lgorith m
O ur experim ents dem onstrate th a t the load balancing scheme significantly improves per­
formance under memory pressure. It is clear, however, th a t even when using the scheme, a 
great deal of tim e is still lost to disk swapping and the resultant load imbalance.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 45
W ithout load balancing W ith  load balancing
Load Time % Imbal Its Mvecs Time % Imbal
l - l - l - l 823 5.4 20 10312 843 4.3
2-1-1-1 3288 56.4 21 8464 1566 32.7
2-2-1-1 6677 51.7 20 5225 2067 34.0
2-2-2-1 7500 36.7 25 3535 2719 23.9
T able 4.1: Results for NASASRB with some nodes externally loaded by static, memory intensive 
dummy jobs. Externally loaded nodes run a dummy job that performs repeated matrix-vector 
multiplications using a matrix of size 80 MB. In the cases without load balancing, 19 outer iterations 
and 9959 matrix-vector multiplications are always performed.
In an a ttem pt to improve the performance of load balanced JDcg in the presence of 
memory-intensive external loads, we have im plem ented a heuristic th a t recedes JDcg during 
the correction phase on a node th a t is thrashing. The idea is to  let the com peting job 
utilize 100% of the CPU and memory resources, hopefully speeding its completion and 
hence relinquishm ent of resources.
Obviously, this strategy will only be effective if the lifetime of the com peting job is 
less than  th a t of JDcg. Performance prediction models [13, 91] could be used to estim ate 
these lifetimes, bu t obtaining useful estim ates may be difficult. In the current work, we use 
no such predictive models and instead use a na tu ra l upper lim it on how long JDcg stays 
backed off before it resumes: the tim e T  th a t each node is given to perform  the correction 
phase. If the lifetime of the com peting job is longer th an  th a t of JDcg, this means th a t the 
anti-thrashing scheme will “starve” the correction phase on a node th a t is thrashing. The 
load balancing scheme will, however, ensure th a t JDcg still converges because o ther nodes 
will shoulder the load.
Upon entering the correction phase (excluding the first tim e, in which no load balancing 
is done), the anti-thrashing algorithm  checks to see if the thrashing was occurring during 
the ju st completed projection phase. If so, it recedes for Twait seconds. The choice of Twait is 
discussed later. After this period, it checks to  see if it is appropriate to  resume com putation 
or to continue waiting. If the application decides to resume com putation, it checks to see 
if the node is thrashing after the completion of each BCGSTAB iteration. If so, it recedes
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 46
once again for Twait seconds. The following pseudocode describes the basic steps of the 
algorithm:
A nti-thrashing m odification o f th e correction phase
Reset the clock Teiapse(i =  0
answers to  two tests: (1) should it recede if currently  running and (2) should it resume 
if currently sleeping. T h a t is, it m ust be able to  accurately identify when the system  
is thrashing and to  predict whether or not resum ing com putation will cause thrashing. 
W hile it may be better to  answer these tests using performance m easurem ent/prediction 
middleware, our experim ents have shown th a t one can obtain  useful results using some 
simple performance measures th a t the operating system  makes available.
For the first test (test for receding), JDcg checks the memory page swap-out ra te  swapout 
and recedes if it exceeds a given threshold sw ap-threshold. We also considered using such 
inform ation as page fault rates and page swap-in rate, bu t we found the swap-out ra te  to 
be a  very reliable indicator. Additionally, it may be desirable to check to  see if the CPU  
idling tim e exceeds some tolerable threshold, since if there is some paging but little  CPU 
idling, there is little  benefit to  receding. We did try  using an idle tim e threshold, bu t, again, 
we found the swap-out ra te  alone to  be a very reliable indicator so we did not resort to  a 
more com plicated criterion. In  a following section we describe how we choose an acceptable 
swap-out threshold. The pseudocode below describes the first test:
LI: If then
repeat sleep {Twait)
until  ^ Test for resum ing 
e n d if
w h ile  (T >  Teiapsed +  e s tim a te d  t im e  to  co m p le te  a b c g s t a b  iter a tio n )
p erform  o n e  BCGSTAB ite r a tio n
if then  goto LI
end
The performance of the algorithm  depends on its ability to  correctly determ ine the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 47
Test for receding
O btain  the swap-out ra te  for the ju st completed com putation phase: swapout
If ( swapout > swap-threshold ) 
then  re tu rn  TRUE
Once JDcg has receded, it m ust be able to determ ine whether beginning/resum ing 
BCGSTAB iterations (and thus swapping JDcg back into memory) will cause thrashing. 
This may be as easy as determ ining th a t the am ount of free memory on a node is enough 
for JDcg to  run. Unfortunately, lower priority processes or the buffer cache may utilize 
large am ounts of memory th a t they would free if asked by a higher priority job, so the free 
memory size reported by the system  usually greatly underestim ates the am ount of memory 
th a t is actually available to  JDcg. Therefore, in addition to checking the am ount of free 
memory, JDcg also checks the idle tim e of the system: if the memory-intensive external job 
has finished and there is no other CPU intensive job running on the system, the idle tim e 
should be very high because JDcg is recessed. (Note th a t JDcg considers tim e spent by 
the CPU running nice’d jobs as idle tim e, so low priority jobs th a t are using lots of CPU 
tim e will not interfere w ith this strategy.) The following pseudocode describes the test for 
resuming:
Test for resum ing
If (Relapsed T  'Bwan >  T) then
re tu rn  TRUE and exit the correction phase
else
O btain  the CPU  idling percentage over the waiting period: idle 
O btain  the current available free memory: free-mem  
O btain  the current resident memory for JDcg: JD-res-mem  
E stim ate the memory JDcg requires to  avoid thrashing: JD-req-mem  
If ( idle > idle-sufficient ) OR ( JDjreqjmem — JD-res-mem  <  free-mem  ) 
then  re tu rn  TRUE
endif
Note th a t the first line of the above algorithm  checks to  see if waiting for Twan seconds 
will cause JDcg to  spend more th an  the allotted T  seconds in the correction phase. If so, 
JDcg exits to the correction phase to ensure th a t this does not happen.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 48
4 .2 .1  C h o o s in g  th e  p a ra m eters
The sleep tim e Twau  m ust be short enough to  allow JDcg to  quickly respond when the 
system  becomes available, bu t it should also not be so short th a t an excessive am ount of 
CPU tim e is spent checking the sta tus of the node. We choose the longest Twait th a t yields 
a tolerable ratio  Twait/T ,  which is the upper bound on the am ount of tim e th a t may be 
wasted by a node when the com peting memory-intensive job finishes while the JDcg process 
is sleeping.
Page swap-out activity comes in bursts, w ith the frequency depending on the par­
ticular memory access patterns of the program s th a t are running. This means th a t the 
sw ap-threshold  is sensitive to the tim e interval over which swapout is sampled. Ideally the 
com putation interval between the two m easurem ents used to calculate swapout should be 
longer th an  the interval between bursts. It is not feasible, however, to  adap t JDcg to  ensure 
th a t the com putation intervals exceed the burst intervals. Alternatively, the burst interval 
tb could be estim ated and then  a program  th read  could accum ulate the swapout ra te  over 
the past tb seconds. In our experim ents we found th a t simply m easuring swapout over 
the com putation interval was adequate. Note th a t determ ining an appropriate  value of the 
sw ap-threshold  is not difficult, since even low levels of swap-out activity signify memory 
contention: a very small sw ap-threshold  is sufficient.
The value of idle-sufficient should be large enough to  ensure th a t a non-idle system  is 
not m istaken as an idle one, bu t m ust also be small enough so th a t jobs w ith low CPU  
utilization are ignored. We have found th a t a value between 80 and 90% works well.
The swap-out rate, idle percentage, and current am ounts of free and JDcg resident 
memory can all be calculated from statistics provided by system  counters th a t are accessible 
on most Unix-like systems through the /p ro c  pseudo-filesystem.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 49
4 .2 .2  E x p e r im e n ta l resu lts
As sta ted  previously, we have observed on the Sun platform s th a t the Solaris scheduler 
prevents thrashing by arb itrarily  “starving” one of the processes. This increases the overall 
th roughput on a node but can be very detrim ental to parallel jobs. In the case of JDcg, 
bo th  the projection and correction phases are starved, im peding progress on all of the nodes 
the JDcg is using.
Because of this behavior, we conducted our experim ents w ith the anti-thrashing scheme 
using a system  of four 1 GHz Pentium  III machines interconnected via switched Fast E th ­
ernet and running RedH at Linux 6.2. O ur test m atrix  is NASASRB. JDcg requires about 
67 MB of memory when applied to  NASASRB, so we chose J D jreq jm e m  — 66 MB, ju st 
below th a t value. We observed th a t, for a m easurem ent interval >  tb, average swap-out 
rates never dipped below 150 K B /s, so we used this value for sw ap-threshold. Twan  was set 
to  a sizable 20 seconds. The values of sw ap-threshold  and Twail are fairly conservative and 
may result in JDcg receding fewer times th an  optim al, bu t are unlikely to  result in JDcg 
receding erroneously.
All of our experim ents were run  on four nodes, w ith node 0 running a dum my job th a t 
alternates between sleeping and performing m atrix-vector com putations. We have found 
th a t using a  dum my job of size 80 MB is sufficient to consistently cause excessive swapping 
when the dum my job and JDcg are running. Furtherm ore, we found th a t increasing the 
size of the dum m y job did not increase the am ount of swapping because the memory access 
pa tte rn  does not change. For these reasons, we have chosen to  use a  dum my job size of 80 
MB in our experim ents. The dum m y job is described by the following pseudocode:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 50
C ode for dum m y job:
W ait 60 seconds to allow JDcg to initialize.
Create a vector v  and a random , dense m atrix  A  of size 80 MB.
For i =  1 to  lim  do
For j = 1 to  nm v  do 
b =  A v  
End for
Sleep for A seconds.
End for
The dum my job can be viewed as sim ulating a series of jobs th a t do a specified am ount 
of work and then  term inate. Each of the lim  passes through the outer loop can be viewed 
as the execution of one of these “jobs” . The inner loop executes the n m v  m atrix-vector 
m ultiplications th a t each “job” performs. The inter-arrival tim e between the “jobs” is A 
seconds. Unlike in the experim ents described in previous sections, it is necessary to  use a 
dum my job th a t does a specified am ount of work and then stops, since the idea of the an ti­
thrashing scheme is to  speed up JDcg by speeding the completion of the memory intensive 
jobs th a t compete w ith it. Therefore, we chose values of lim  and n m v  so th a t the (actual) 
dum my job can complete all of its work w ithin the lifetime of JDcg (w ithout the an ti­
thrashing scheme). Table 4.2 summarizes our results. Because there is some variability in 
the performance obtained using a particular set of param eters, we report averages over five 
trials and their standard  deviations in parentheses.
W ithout anti-thrashing W ith  anti-thrashing
A nmv lim Its Mvecs Time Its Mvecs Tim e
n /a 46 1 19.4 (0.9) 8681 (714) 1125 (175) 19.6 (0.5) 9625 (287) 913 (27)
n /a 93 1 19.6 (0.5) 8202 (319) 1256 ( 92) 19.2 (0.8) 9317 (516) 928 (39)
30 3 4 19.6 (0.9) 9011 (654) 1004 ( 89) 19.0 (0.0) 8820 (108) 961 (46)
60 7 3 20.4 (0.9) 9088 (649) 1136 ( 98) 19.6 (0.9) 9088 (222) 943 (15)
60 15 2 20.6 (1.3) 9136 (812) 1157 (131) 19.6 (0.5) 9199 (416) 946 (49)
T able 4.2: Performance of the anti-thrashing scheme for NASASRB with a come-and-go memory 
intensive dummy job on one node. The tests for receding/resuming in the anti-thrashing algorithm  
are performed through system measurements. The dummy job performs lim repetitions of nmv 
matrix vector multiplications with an 80MB matrix, sleeping for exactly A seconds between each 
pass.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 51
Our experim ents show th a t the anti-thrashing scheme yields an appreciable performance 
gain, often reducing execution tim es by 20% compared to load balanced JDcg w ithout 
anti-thrashing. To put these num bers in perspective, it should be noted th a t the original 
load balancing scheme often mimics the anti-thrashing scheme, usually completing only one 
BCGSTAB iteration  per correction phase on a node th a t is thrashing because the performance 
of the node is so poor. Obviously, the non-load balanced code is simply not viable for such 
situations.
The performance of the scheme is dependent on its ability to  correctly determ ine when 
to recede and when to resume. Because our scheme cannot make these decisions perfectly, 
we obtained an upper bound on the scheme’s performance by running some experim ents 
in which we allowed our application to  “cheat.” R ather than  using system  statistics to 
determ ine when to  recede and resume, we modified the dummy job to create a specific file 
when it begins a com putation phase and to  delete it when it ends. JDcg could then check 
for the existence of the file to  make its decisions, which would be made perfectly. Our 
results are sum m arized in Table 4.3. Com paring these ideal results to those in Table 4.2, 
we can see th a t the performance is essentially the same.
W ith  anti-thrashing
A nmv lim Its Mvecs Time
n /a 46 1 20.2 (1.1) 9936 (651) 942 (71)
n /a 93 1 19.4 (0.5) 9233 (313) 920 (39)
30 3 4 19.5 (0.6) 9346 (323) 970 (48)
60 7 3 19.6 (0.9) 9414 (461) 917 (42 )
60 15 2 20.0 (0.7) 9774 (440) 961 (45)
T able 4.3: Performance of the “ideal” anti-thrashing scheme for NASASRB with a come-and-go 
memory intensive job on one node. The JDcg knows exactly when the system is thrashing by 
communicating directly with the dummy job.
In addition to  shortening the execution tim e of JDcg, the scheme also has the added 
benefit of increasing to ta l system  throughput. Table 4.4 shows th a t the reduction in the 
execution tim e of the dum my job can be dram atic.
One might be concerned w ith how our scheme performs when the com peting memory-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 52
w ithout anti-thrashing w ith anti-thrashing w ith ideal testing
A nmv lim Time Time Time
n /a 46 1 736 (409) 113 (17) 140 (15)
n /a 93 1 1055 (129) 187 (22) 202 (26)
30 3 4 554 ( 66) 223 (73) 171 (40)
60 7 3 808 ( 97) 152 (50) 141 (19)
60 15 2 864 (150) 158 (20) 141 (22)
T able 4.4: Improvements in execution time of the dummy job because of increased processor 
throughput due to our anti-thrashing scheme.
intensive application outlives JDcg. In th is case, the scheme cannot reduce the runtim e of 
JDcg, and could conceivably even degrade performance, since a node th a t is thrashing will 
never do any BCGSTAB iterations. We ran  some experim ents w ith a dum my job th a t outlives 
JDcg and verified th a t, regardless of whether or not the memory-intensive job  outlives JDcg, 
the scheme does not result in any m easurable performance degradation. As noted previously, 
when JDcg runs w ithout the anti-thrashing scheme on a node th a t is thrashing, typically 
the node is so slow th a t it only completes one BCGSTAB iteration per correction phase. 
Reducing this num ber to  zero w ith the anti-thrashing scheme does not have a noticeable 
effect on convergence. The anti-thrashing scheme does, however, significantly improve the 
overall system  throughput.
4.3  L im ita tion s o f  th e  m em ory-ba lan cing  approach
We have presented some favorable experim ental results using our “memory balancing” 
scheme th a t dem onstrate the practicality of im plem enting program s th a t dynam ically adap t 
to  memory pressure. However, we have ultim ately found the scheme to possess significant 
lim itations th a t we cannot remedy.
One of the  biggest technical shortcomings is th a t the tests for deciding when to recede 
or resume are easily fooled. They work in our controlled experim ents, bu t have proven 
to  be im practical in more realistic situations. The test for receding is fairly reliable, al­
though swap-out activity does not necessarily indicate th a t our process is experiencing
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 53
memory pressure—a lower-priority process or one th a t has been long idle may have pages 
swapped out while our pages rem ain untouched, for example. The test for resuming, how­
ever, has serious shortcomings. A recessed JDcg process will resume correction iterations if 
either idle > id lesu ffic ien t or J D jreq jm e m  — J D jre s jm e m  <  fre e -m e m .  Unfortunately, 
the value of fre e -m e m  is usually meaningless: free memory is alm ost always close to zero 
because any memory th a t would otherwise be free is usually consumed by the buffer cache. 
The test for resuming, therefore, generally reduces to checking to  see if the CPU  idle per­
centage exceeds id lesu fficien t. This works in our controlled experiments, bu t in a real-world 
situation, there is no guarantee th a t the CPU will stand  idle after a com peting job has fin­
ished its memory-intensive work. Thus our inability to  determ ine the actual am ount of 
memory available to JDcg is a serious obstacle to  the practical application of our scheme. 
Unfortunately, operating systems provide no inform ation th a t can be used to reliably esti­
m ate this quantity.
Another problem  stem s from the inability of the scheme to rem edy load imbalance during 
the projection phase. W hen load imbalance is due solely to external CPU load, this is of 
little  concern, as the solver spends little  tim e in this phase. W hen a system  is thrashing 
however, the tim e spent in the projection phase can tu rn  into an eternity. In the worst case 
scenario, which we have observed under Solaris 7, the entire parallel job can grind to  a halt 
as the memory scheduler swaps out the process during the projection phase.
Perhaps the most fundam ental shortcoming is th a t our scheme is applicable only to 
a class of algorithm s far narrower th an  th a t to  which the load balancing scheme is itself 
applicable. The flexible phase m ust be extremely flexible: a node m ust be able to  abdicate 
all responsibility for performing com putations during the phase. This is the case for JDcg, 
bu t neglecting one of the subdom ains in our load-balanced additive Schwarz preconditioner 
can result in stagnation of convergence: o ther processors may continue to work, bu t the 
solve cannot progress w ithout the preconditioner completing some work on the subdom ain.
For the above reasons and others, we have pursued another, less restrictive approach in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 4. LOAD BALANCING  UNDER M EM O RY PRESSURE 54
which applications gracefully degrade their memory requirem ents and performance in the 
face of memory pressure, ra ther th an  completely receding their com putations. We describe 
this approach in the following chapters.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
A dynam ic m em ory adaptation  
framework
In our earliest memory adap tation  work, we worked w ith JDcg and exploited its flexibility. 
If memory shortage was detected on a node, the node would perform  no correction phase 
iterations. This would allow a competing, mem ory-hungry job to  utilize 100% of the CPU 
and memory resources, hopefully speeding its completion and relinquishm ent of resources. 
O ur experiments yielded an appreciable performance gain, often reducing execution tim es by 
20% compared to load balanced JDcg w ithout memory adaptation. The m ethod suffers from 
several shortcomings, however, the biggest of which is th a t it is applicable only to  a  lim ited 
subset of flexible-phase algorithm s. Additionally, the mechanisms we used for identifying 
when to recede and resume correction phase iterations are not reliable in general.
In more recent work [61, 62] we have developed a memory adap ta tion  framework which 
is widely applicable and highly portable. We describe it in this chapter.
5.1 A  p ortab le  fram ew ork for m em ory  a d a p tiv ity
Many scientific algorithm s, such as iterative m ethods for linear and nonlinear systems, dense 
m atrix  m ethods, and Monte Carlo m ethods, operate on large d a ta  sets in a  predictable, 
repetitive fashion. To best utilize hierarchical memory, applications often operate in a 
block-wise fashion to  increase locality of memory access. Algorithms designed to  run  in-
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 56
core are blocked to  effectively utilize LI and L2 caches, while out-of-core algorithm s employ 
a  similar strategy to  best utilize DRAM. In the lingo of out-of-core algorithm s, blocks are 
sometimes referred to as panels to  distinguish from disk blocks; we adopt th a t term inology 
here. W ith  d a ta  partitioned into P  panels, the processing p a tte rn  of a blocked algorithm  
can be represented schematically as 
for i =  1 :P
Get panel pi from lower level of the memory hierarchy
Work on pi
W rite results back and evict pi to the lower level of the memory hierarchy
end
The above structu re  suggests a simple mechanism for memory adaptation: control the 
resident set size by varying the num ber of panels cached in core. If the program  has enough 
memory, it caches its entire da ta  set in core and runs as fast as a standard  in-core algorithm . 
If main memory is scarce, the num ber of panels cached is reduced, and an application- 
specific cache replacem ent policy is used. The m agnitude of the reduction varies according 
to  the memory shortage; so performance should degrade gracefully as memory becomes 
scarce (provided an appropriate cache-replacement policy is used). Thus if the am ount 
of physical memory available is only slightly less th an  the size of the adaptive program ’s 
d a ta  set, its performance should be close to  its in-core performance. A non-adaptive in- 
core program , on the other hand, may th rash  under the same conditions if the replacem ent 
policy of the VM system  is inappropriate for its access pattern . This is often the case for 
scientific applications, which commonly access large da ta  sets in a cyclic fashion. For such 
access patterns, a most recently used (MRU) replacem ent policy should be used, bu t the 
generic replacem ent policy used by the operating system  is usually an approxim ation of 
least recently used (LRU) replacement
The software required to  support the above memory adap tation  strategy can be easily 
encapsulated into a software library. An existing blocked code can then  be easily modified 
by the insertion of a few library calls. Essentially, all th a t needs to  be done is to make the 
code call a  library function th a t returns a pointer to  panel pt before working w ith it. The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 57
function call should handle all memory m anagement in a way th a t is completely transparent 
to the application program mer. We have w ritten  such a library, which we call MMLIB (for 
“memory m alleability library”).
5.2 E lem en ts o f  th e  im p lem en ta tion
To use our adap ta tion  strategy we m ust be able to  grow and shrink the memory space of an 
application. Most im plem entations of m a l lo c ( ) / f r e e ( )  are not appropriate  because they 
do not provide a mechanism to release memory back to  the operating system. A successful 
m a llo c Q  call can grow a program ’s heap. W hen f r e e O  is used to deallocate memory, 
however, the heap size may not be decreased, so freed memory is not returned  to the 
operating system. Our solution is to use memory mapping, which is universally available in 
m odern operating systems and provides more explicit control over memory usage. Memory 
obtained via a memory m ap can be easily returned  to  the operating system; additionally, 
many operating systems allow program mers to  provide hints to the OS via m advise () about 
how a m apped region will be used. Using nam ed m appings to files (viz., m em ory-m apped 
I/O ) confers o ther advantages as well. Because I /O  is handled transparently, codes can be 
simplified: explicit I /O  calls are not needed, and an adaptive code can greatly resemble 
and in-core one. I /O  traffic is optim ized because non-dirty pages in m apped regions can 
be freed w ithout a write, while pages th a t are d irty  will be w ritten, bu t not to  the swap 
device. W riting d irty  pages to  swap space incurs software overhead and results in poor 
d a ta  placement on disk, since pages th a t are part of a contiguous array may be scattered 
between several non-contiguous blocks on the swap device.
In our im plem entation, each panel is w ritten  to a disk file. A panel to  be cached in 
core is m apped via an mmapO call, and the VM system  is asked to  prefetch th a t panel via 
m ad v ise ( ) .  A panel to be evicted is unm apped via an munmapO. A memory adap tation  
decision is made each tim e the program  calls our library function to  fetch a new panel. Based 
on its current estim ates of memory availability, it chooses to  increase, decrease, or m aintain
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 58
the num ber of panels cached in core. If panels m ust be evicted, victim  panels are selected 
using a user-specified, application specific policy. This allows an optim al replacem ent policy 
to  be used, in contrast to  the general policy employed by a VM system, which may be highly 
suboptim al for a given application.
5.3 A lgorith m s for ad ap tin g  to  m em ory availab ility
An adaptive application should quickly decrease its memory usage when shortage is de­
tected, while still caching as many panels as possible. W hen surplus memory becomes 
available the application should quickly m ap additional panels in order to utilize th a t mem­
ory. The m ain challenge involved in im plem enting our memory adap ta tion  framework is 
devising a simple, reliable mechanism for answering two basic questions: (1) Is there a mem­
ory shortage? and (2) If not, is there a memory surplus? Because many systems export 
very lim ited inform ation about memory usage, these questions should be answered using as 
little  system  inform ation as possible to ensure portability. We have found ways to  answer 
bo th  questions using only the resident set size (RSS) of the program.
D etecting memory shortage is the easier of the two tasks. M emory shortage can be 
inferred from a decrease in the program ’s RSS th a t occurs w ithout any unm apping on the 
p art of the program. The m agnitude can be ascertained by com paring the current RSS 
to what the program  estim ates it should be if all panels it is currently try ing to  cache are 
actually resident in memory.
Detecting surplus memory is more challenging. Ideally the system  would provide an 
estim ate of the am ount of available memory, and the program  would use this to determ ine 
when there is enough room  to cache additional panels. The best th a t most operating systems 
provide, however, is the am ount of free memory, which typically greatly underestim ates the 
am ount of additional memory actually available. O n many system s, the  am ount of free 
memory is usually very close to  zero, because any memory th a t is not needed by running 
processes is used by the file cache. The system  might still service a  large memory request,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 59
however, by reducing the file cache size. Because the operating system  simply does not 
provide much information, the most reliable way to determ ine if a quantity  of memory is 
actually available is to try  to use it and see if it can be m aintained in the resident set.
5 .3 .1  D e te c t in g  m em o ry  sh o rta g e
Consider an application whose memory is managed by our library. We denote by P anels_ in  
the num ber of panels th a t are cached in memory. We can calculate what the RSS should be 
if all panels th a t we are a ttem pting  to  cache are actually resident. We refer to th is quantity  
as desired RSS and denote it by dRSS: 
dRSS =  Panels_in * Panel_size +  sRSS, 
where sRSS indicates the size of what we term  the static memory, th a t is, memory not 
managed by our mem ory-m alleability library. (We will explain how we determ ine sRSS in 
section 5.4.1.) By definition, the program  is under (additional) memory pressure when it 
cannot m aintain RSS equal to  this desired RSS. This suggests the following scheme:
if ( RSS <  dRSS ) then
diff =  (dRSS-RSS) /  PaneLsize 
Unm ap diff panels 
P ane lsJn  =  Panels_in — diff 
dRSS =  dRSS — diff * Panel_size
This “solution” is not sound, however, because when the program  unm aps panels in re­
sponse to  memory pressure, the victim  panels may not correspond to the memory paged 
out by the operating system. Using the above scheme can lead to  a cascade of unm appings 
all the way down to P anels_in  =  1. Consider an example (illustrated in Figure 5.1) where 
the replacem ent policy is MRU and the program ’s d a ta  set is broken into five panels, all of 
which are currently m apped (Panels_in  =  5). A memory shortage has caused the system 
to evict portions of panels 1 and 2 from memory, b u t there is enough memory available to 
keep four panels m apped ( d i f f  =  1). W hen the program  accesses panel 4, the condition 
(RSS < dRSS) holds, so P anels_in  is decreased to 4 by evicting the MRU panel 3. However, 
panel 3 was fully resident, so its unm apping causes RSS to  reduce even further by exactly
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FR AM EW O RK 60
P a n e l_ s iz e .  T h is is th e  sam e am ount th e  dRSS is reduced, so w hen th e  program  tries to  
access panel 5, th e  con d ition  (RSS <  dRSS) still ho lds— d esp ite th e  availab ility  o f  mem ory. 
T h e above process repeats until all panels but one are unm apped.
1111 ll l l dRSS = 5 
RSS = 4
B
1 2
i l l l l l l l
1 2
B ill
dRSS = 4 
RSS = 3
dRSS = 3 
RSS = 2
t
F igu re 5.1: Detecting and responding to memory shortage with the naive (and incorrect) “solution” . 
A program runs with 5 panels; each represented with a box at times A, B, and C. White regions 
of a panel are mapped and resident, while shading represents memory that is mapped but not 
resident. Black represents a panel that is not mapped. At time A, portions of panels 1 and 2 have 
been evicted by the operating system, but there is enough memory to keep four panels mapped. 
Panel 4 is accessed, and since (RSS <  dRSS) holds, P an els_ in  is decreased by one panel (since 
d i f f  =  dRSS - RSS = 1) to the new value of 4 by evicting the MRU panel 3. Because panel 3 was 
fully resident, unmapping it causes RSS to reduce even further by exactly P a n e l_ size . At time B, 
we access panel 5. The condition (RSS <  dRSS) still holds, so we unmap the MRU panel 4—even 
though there is actually enough memory to keep four panels mapped. At time C, we access panel 
1. Again, RSS <  dRSS, so panel 5 will be unmapped, further reducing RSS by P a n e l_ size . This 
pattern will continue until all but one panel has been unmapped, despite the availability of enough 
memory to cache four panels. Clearly this is undesirable behavior!
T h e problem  responsib le for th e  cascade o f  unnecessary m appings described  above is 
th at dRSS is not an accurate lower bound (it is to o  high) on  w hat RSS can be w ith ou t  
in d icatin g  a m em ory shortage. To fix th is we introduce th e  variable la s tR S S , w hich tracks 
th e  value o f RSS at th e  last panel access. W hen  a panel is unm apped  due to  m em ory  
shortage (i.e., w hen P a n e ls _ in  is reduced), la s tR S S  is set to  th e  value o f  RSS im m ed iately
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 61
after the eviction. This accounts for the fact th a t RSS may drop below the new dRSS. At 
panel accesses th a t are not accompanied by a decrease in Panels_in , if RSS > lastR SS  
then  we set lastR SS =  RSS. M aintaining lastR SS in this fashion, we can conclude th a t 
there is a memory shortage whenever RSS < lastRSS. Figure 5.2 summarizes the test for 
memory shortage, and Figure 5.3 illustrates its use. Note th a t no explicit reference to  dRSS 
is required: If there is enough memory to keep P anels_in  panels resident, then  lastR SS  
will eventually grow to m atch dRSS once all pages evicted by the operating system  during 
the last memory shortage have been touched again.
A lgorithm : D etect shortage
RSS =  Get current RSS 
if ( RSS < lastRSS )
diff =  (lastRSS-RSS) /  PaneLsize 
Unm ap diff panels
Panels_in =  Panels-in — diff, dRSS =  dRSS — diff * PaneLsize 
endif
lastRSS =  Get current RSS 
F igure  5.2: The algorithm for detecting and responding to memory shortage.
We note th a t some operating systems employ a page-fault frequency (PFF) strategy for 
preventing thrashing; Windows NT is an example. An OS employing a P F F  strategy will 
sometimes reclaim  pages from a process under no memory pressure. If the page-fault ra te  
of the process falls below a certain threshold, the OS removes page frames from the process; 
if the process then  increases its page fault rate, the OS will give frames back. The idea is 
to  prevent thrashing by ensuring th a t processes do not consume more memory th an  they 
require. Our algorithm  may in terpret th is sort of probing by the OS as memory shortage 
and evict a panel. The page-fault frequency of the program  will increase, prom pting the 
OS to allocate more page frames. Panels_ in  will eventually re tu rn  to  its original level, bu t 
some performance penalty  will have been incurred by the unnecessary eviction. We believe 
this problem  could be elim inated by requiring th a t (lastR SS -  RSS) exceed a threshold 
before action is taken, bu t we have not yet had the opportunity  to  test our algorithm  under
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 62
B
1 2JlJjl mu
1 2* Bll
1 2
dRSS = 5 
RSS = 4 
lastRSS = 5
dRSS=4 
RSS = 3 
lastRSS = 3
dRSS = 4 
RSS = 3.5 
lastRSS = 3.5
t
dRSS = 4 
RSS = 4 
lastRSS = 4
Figure 5.3: An example of detecting and responding to memory shortage using the algorithm in 
Figure 5.2. A program runs with 5 panels; each represented with a box at times A, B, and C, D. 
W hite regions of a panel are mapped and resident, while shading represents memory that is mapped 
but not resident. Black represents a panel that is not mapped. At time A, portions of panels 1 
and 2 have been evicted by the operating system, but there is enough memory to keep four panels 
mapped. We access panel 4. RSS <  lastR SS, so we unmap MRU panel 3. At time B, we access 
panel 5. RSS <  dRSS, but RSS =  lastR SS, so we don’t unmap. At time C, we access panel 1. By 
touching panel 1, the portion of it that was swapped out by the OS is brought back into memory. 
Now RSS =  3.5, and lastR SS <  RSS so we set lastR SS =  RSS =  3.5. At time D, we access panel 2, 
bringing its contents fully into memory and increasing RSS to 4. We set lastR SS  =  RSS =  4. Note 
that now lastR SS matches dRSS.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 63
a P F F  system.
5 .3 .2  D e te c t in g  m em o ry  su rp lu s
Because the operating system provides no mechanism for determ ining memory availability, 
we m ust employ an invasive approach. We periodically probe the system, a ttem pting  to 
increase memory usage by one panel. If enough memory is available, RSS should grow by 
one panel. If memory is not available, then  RSS will rem ain constant, or decrease as the 
operating system  responds to memory pressure by evicting pages.
The question is w h en  sh ou ld  w e probe? We should not probe for more memory if 
RSS < dRSS. This condition indicates th a t parts of m apped panels have been paged out 
by the system. If memory is available, RSS will grow as panels are touched and pages are 
brought back into memory. W hen RSS =  dRSS, and if there are additional panels to map, 
then we may probe, performing the next m apping of a  panel w ithout replacem ent. If the 
new dRSS cannot be m aintained, RSS will eventually decrease below lastR SS and the Detect 
Shortage algorithm  will take memory usage back to a safer level.
5 .3 .2 .1  P ro b lem s w ith  overly  aggressive  prob ing
The simplest policy is to  a ttem p t to increase P anels_ in  whenever RSS =  dRSS. This policy 
is too aggressive, however. It continually pushes P anels_in  above a safe level, incurring a 
significant performance penalty  each tim e this happens. Figure 5.4a depicts experim ental 
results th a t illustrate this. In the experim ent, there is room  to keep 40% of the panels 
in memory. O ur program  is able to tem porarily  obtain  enough memory to  hold up to 
60% of the panels. Quickly, however, the operating system  senses a memory shortage and 
begins reclaiming pages from the program , sometimes reducing RSS significantly below 40% 
of the panels. The program  adapts by decreasing Panels_in  back to  the safe value of 
40%. Eventually all m apped pages come back into resident memory, and the cycle repeats. 
Continual page reclam ation by the OS keeps average memory utilization of the program
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N FRAM EW O RK 64
significantly below the desired 40%.
5.3.2.2 Balancing perform ance penalties w ith  a dynam ic probing policy
We can reduce the aggressiveness of our policy by delaying growth of P a n e ls_ in  for a tim e 
after it has been reduced by the Detect Shortage algorithm . Choosing an appropriate  delay 
is a balancing act between two sources of performance penalties. If a probe is unsuccess­
ful, th is induces w hat we call an “incursion” penalty because it will induce paging and a 
subsequent performance decrease. On the other hand, if the program ’s memory usage stays 
below the am ount of memory available, it suffers an “inaction” penalty  because some pan­
els will be loaded from secondary storage when they could instead reside in m ain memory. 
We assume a simple model in which the tim e T  to  fetch M  words from disk depends only 
on the bandw idth  B w of the disk. Despite its simplicity, the model is appropriate in our 
case because we access large, contiguous blocks of data; seek tim es are largely hidden by 
prefetching. Define raaxRSS to be the m axim um  am ount of memory currently available to 
our program. If the program  stays a t RSS, then for each iteration, or cycle through all 
panels, (maxRSS - RSS) of d a ta  which could have been kept in-core will be brought from 
disk, incurring an inaction penalty  of (maxRSS - RSS)/ B w seconds. If the program  probes 
beyond maxRSS, the operating system  responds by decreasing RSS. As figure 5.4a shows, in 
the worst case P a n e ls_ in  may be reduced all the way down to  1. The incursion penalty  
then  is roughly (maxRSS/!?„,), because all of the evicted panels will have to be brought back 
in.
We a ttem p t to choose a delay th a t balances the two penalties. This suggests th a t we 
consider the quantity
Rpen =  m axR SS/(m axRSS - RSS),
which is the ratio  of the incursion and inaction penalties. W hen RSS is zero, the inaction 
penalty is as great as the incursion penalty, so we have nothing to  lose by probing for more 
memory; thus when R pen =  1 we should probe as soon as possible. W hen the  ratio  is greater
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 65
RSS (solid line) and desired RSS (dashed line) for simple algorithm
60
Average iteration time: 17.5 sec55
50
■r-
40
|  35 \  _
25
100 150
Seconds elapsed
RSS (solid line) and desired RSS (dashed line) for dynamic delay with ratio
60
Average iteration time: 4.4 sec
50
45
co
2
25
100 150
Seconds elapsed
F ig u re  5.4: Graphs depicting the necessity of a dynamic delay parameter in the algorithm for 
detecting memory availability. RSS (solid line) and dRSS (dashed line) versus time are shown for 
two versions of a memory-adaptive program that performs dense matrix-vector multiplications with 
a 70 MB matrix broken into 10 panels, each consisting of consecutive rows. Circles denote the 
completion of a matrix-vector multiplication. The program runs against a 70 MB dummy job on a 
1 GHz Pentium III machine with 128 MB of DRAM and an IDE hard disk. The top graph (a) uses 
the original adaptation algorithm with no delay. It is too aggressive, continually pushing against 
the memory limit. In response, the operating system evicts pages from the program, causing a 
significant performance penalty. The bottom graph (b) utilizes a dynamically determined delay to 
reduce this penalty: after a memory shortage is detected, attempts to grow memory usage must 
wait until the delay has elapsed. Using the dynamic delay, the algorithm settles at what is close to 
the optimal value for dRSS (dashed line) and diminishes RSS (solid line) fluctuations.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 66
th an  unity, it indicates th a t the possible incursion penalty  outweighs the possible inaction 
penalty by th a t ratio; thus suggests we should wait R pen times as long as we would in the 
Rpen — 1 case before probing. Given a base delay tim e, then, we can scale it by R pen to 
determ ine the delay:
delay =  (base delay) * m axRSS/(m axRSS - RSS)
We have noted th a t when RSS is close to  0, we should probe for memory as soon as possible. 
Since we never probe unless RSS =  dRSS, after Detect Shortage causes an unm apping the 
program  may have to wait for a full iteration (cycle through the panels) for RSS to  grow to 
dRSS. Thus the tim e for an iteration provides a reasonable approxim ation for the minimum  
delay, and thus is a na tu ra l value for the base delay. We m aintain a queue of tim estam ps for 
the last P  panel accesses to determ ine the base delay. This delay is dynam ic in character, 
constantly changing to  reflect system  conditions. For instance, if a com peting memory­
intensive job exits, the delay will decrease as the system  uses the ex tra  space to  hold more 
panels in the buffer cache; as a result, a m em ory-adaptive job will decrease the tim e it waits 
to probe for more memory.
Unfortunately, we do not know maxRSS—if we did, we wouldn’t need the algorithm  
for detecting memory surplus—so we must approxim ate it somehow. We use peakRSS, 
which is the most recently observed peak in RSS. To determ ine it, we introduce the variable 
max_since _ last_su rp lu s. This variable assumes the value of zero upon initialization. Then 
it is updated  in the following manner: W henever a panel is fetched, m ax_since_last_surplus  
is set to  m ax(m ax_since_last_surplus, RSS). Then, if shortage is detected, peakRSS is as­
signed the current value of m ax_since_last_surplus. Otherwise, if surplus is detected (i.e., 
if the algorithm  decides to  probe for more memory) we set m ax_since_last_surplus to  the 
current value of RSS. This simple scheme, illustrated  in Figure 5.5, effectively identifies the 
most recent peak in the RSS versus tim e curve at negligible expense.
Figure 5.4b dem onstrates how the introduction of a dynamic delay param eter d ram ati­
cally improves performance by greatly dim inishing RSS fluctuations while m aintaining dRSS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FR AM EW O RK  67
m
m, p m, pm
mm m
m
m
Time
F ig u re  5.5: An example of how peakRSS is determined. Shown is a cartoon curve depicting RSS 
versus time. Points where m ax .sin ceJL ast.su rp lu s is updated are marked with an “m”; points 
where peakRSS is updated are marked with a “p”. An astute reader may notice that at the point 
marked A, RSS grows to exceed peakRSS, but because shortage is not yet detected, peakRSS is not 
updated to match RSS. This is by design; if RSS has grown beyond peakRSS, this means that until 
the next shortage, probing will no longer be delayed, and the value of peakRSS is therefore irrelevant. 
The reason will become clear in section 5.4.2, where the calculation of Rpen is explained fully.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 68
at close to  th e  op tim al value. T h e com bined algorithm  for d etectin g  m em ory shortage and  
surplus is sum m arized in  figure 5.6.
A lgorithm : A dapting to  m em ory variability
RSS =  Get current RSS
m ax-sinceJast_surplus =  max (max _s ince Ja s  t _s ur pi us, RSS)
/ /  Check for memory shortage 
if ( RSS <  lastRSS ) & ( Panels_in >  1 ) 
diff =  (lastRSS-RSS) /  Panel-size
unm ap diff panels, Panels J n  — =  diff, dRSS — =  diff * PaneLsize 
peakRSS =  m ax_sinceJast_surplus
/ /  Check for memory surplus
else if ( dRSS = =  RSS ) & ( Panels J n  <  P  )
base_delay =  Time to access the last P  panels 
if (Time since last shortage >  basc.delay * R pen )
Panels J n  + +  
dRSS + =  Panel-size 
m ax_sinceJast_surplus =  RSS 
endif 
endif
lastRSS =  Get current RSS
F ig u re  5.6: The combined algorithm for detecting and adapting to memory shortage or surplus. It 
is executed whenever a panel is requested. It is parameter free and requires no system information be­
yond the resident set size (RSS) of the program, making it highly portable, m a x .s in c e .la s t .su r p lu s  
is initialized to zero. The initial value of peakRSS is immaterial; it is not needed for calculating R pen 
until after the first shortage is detected, at which point peakRSS will be updated. We have not given 
the details of how R pen is calculated; that will be explained in section 5.4.2.
5.4 Further d eta ils  o f  th e  a d ap ta tion  a lgorith m s
T h e preceding section  presented the basic ideas b eh ind  the algorithm s th at enable our 
m em ory-m alleab ility  approach, but d id  not d iscuss som e im portant d eta ils. F irst, we have 
not explained  how th e  sta tic  m em ory size sRSS, w hich is required for ca lcu la tin g  dRSS, 
can be determ ined. Second, we have not fu lly  exp la ined  how the ratio Rpen— crucial in  
determ ining the frequency o f probes for surplus m em ory— is calcu lated . In fact, R pen can
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 69
be com puted in some different ways, each appropriate for different situations. We fully 
discuss bo th  of these issues in this section.
5 .4 .1  E s tim a tin g  th e  s ta t ic  m em o ry  s ize
In  the description of our algorithm  for adapting to  memory surplus, we have assumed th a t 
the program  has an accurate estim ate of the size of its static memory, i.e., the memory not 
associated with objects managed by our memory-malleability library. This inform ation is 
necessary for calculating the desired RSS and determ ining the portion of the observed RSS 
th a t belongs to  the managed objects. Unfortunately, this size may not be easily com puted 
or even available to the program  if a large am ount of the static memory is allocated w ithin 
linked, th ird  party  libraries. Moreover, for some program s the static  memory may fluctuate 
between different program  phases, and even during a pass through the panels. A more 
serious problem  arises when the static  memory is not accessed during a com putation phase: 
Under memory pressure, most operating systems will consider the static  memory as least 
recently used and slowly swap it out of DRAM. This can cause a false detection of memory 
shortage, and the unm apping of as many panels as the size of the swapped static  memory.
Our solution to  this problem  relies on a system  call named m incoreO  under Unix- 
like systems and VirtualQ uery () under Windows. The call retu rns inform ation about 
which pages from a certain  memory segment of a process are resident in core. Because our 
adap tation  algorithm  has no knowledge of d a ta  th a t it does not manage (i.e., it only knows 
about the panels), it uses m incoreO  to  com pute the “managed RSS” , mRSS, which is the 
am ount of d a ta  w ithin all the panels th a t is actually in-core (resident). This is precisely 
w hat is needed—from this we can calculate an accurate estim ate of the static  memory: 
sRSS =  RSS -  mRSS, with RSS obtained from the system. W ith  sRSS, we can now calculate 
dRSS =  P anels_in  * P an el_size  +  sRSS. Furtherm ore, if we use lastR SS to  track mRSS, 
instead of RSS, we can use mRSS to  determ ine if a detected “shortage” is real or simply 
an artifact of fluctuations in sRSS. (We will likewise use peakRSS to  track the most recent
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 70
peak in mRSS, ra ther th an  RSS, and so on for any other variables th a t track a resident set 
size value. We wish to  isolate our algorithm  from fluctuations in sRSS, because it does not 
manage the static  memory.)
There is a small bu t non-negligible overhead (see Figure 5.7) associated with checking 
the residency of the pages of all m apped panels, so using the above technique every tim e 
a panel is fetched is im practical. Instead, we adopt a more feasible but equally effective 
strategy: We m easure mRSS (and hence sRSS) at the beginning of each com putation phase 
involving the panels. As long as no memory shortage is detected, we use the calculated sRSS 
during th a t phase. W hen shortage is detected, we recom pute mRSS and sRSS to  ensure th a t 
a detected shortage is actual and not simply due to  variation in sRSS. Since shortage is 
not a frequent occurrence the overall m incore overhead is tiny, especially compared to the 
slow down the code experiences under memory pressure. In practical term s, reading 128 
MB from a 40 M B /s disk takes approxim ately 3.2 seconds, bu t it takes only 0.028 seconds 
to  check the residency of a  128 MB region. The overhead of checking the residency of the 
pages a few tim es during a com putation phase is not a concern.
5 .4 .2  C a lcu la tin g  R pen
5.4.2.1 A low frequency probing approach
To prevent the adap ta tion  algorithm  from probing too aggressively for memory, we scale a 
base delay (the tim e for a sweep through the panels) by the ratio  Rpen, which approxim ates 
the ratio of the penalty  for unsuccessfully probing and the penalty  for failing to  utilize all 
available memory. Ideally we would calculate this ratio  as R pen — maxRSS /  (maxRSS - mRSS), 
bu t because we have no way of knowing maxRSS we instead approxim ate it w ith peakRSS. 
There is a potential problem  here, because peakRSS may significantly underestim ate the 
actual value of maxRSS (and it becomes more likely to do so as we move further in tim e 
from the last peak). For example, say th a t peakRSS =  5 panels, bu t a com peting job th a t 
was causing memory pressure on the node has recently exited. O ur memory adaptive job
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 71
mincore() overhead
0.12
co 0.1DO0)
X
<D
O  0.08
ooc
f  0.06
o
05-acoo
<15
05
0.04
<DE 0.02
l-
0
0 100 200 300 400 500
Size of region (MB)
F ig u re  5.7: Overhead as a function of size of the region upon which m incoreO  is called; the 
overheads are small but non-negligible. The data pictured were collected on a Sun Enterprise 420R  
running Solaris 7, and are similar to what is observed on other systems.
begins growing its memory usage, bu t when it reaches RSS =  5 panels, the denom inator 
in the ratio  R pen becomes zero and our calculated delay before probing goes to infinity. 
This behavior is correct if peakRSS really does equal maxRSS, since then  there is nothing 
to  gain from probing for more memory, as RSS is a t its maximum. B ut because peakRSS 
in general is not equal to maxRSS, we m ust introduce a mechanism to ensure th a t RSS does 
not become forever fixed at the level of the last observed peak. The sim plest solution is to 
set a m axim um  R pen-max which the value of R pen cannot exceed: th a t is, we calculate
R pen — 111i II ( Ilp( j) _,ria;r • peakRSS/(peakR SS -  RSS)). (5.1)
The value of R pen-max should be high enough to  prevent overly aggressive probing, bu t not 
so high as to  cause an unreasonable delay in probing when RSS approaches peakRSS. We
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 72
have found th a t once R pen-max becomes sufficiently large, performance becomes relatively 
insensitive to  further increases in its value. This is in part due to  the ability of the buffer 
cache to  use available memory to  help speed up access to panels th a t the memory adap tation  
algorithm  does not keep m apped. We have found R penjmax =  10 to  work well in practice. 
In C hapter 7 we present some experim ents th a t examine the choice of R pen.max-
5.4.2.2 Probing w ith  higher frequency
Although the strategy using R pen-max we have ju st described adjusts to  memory availability 
quite well when external memory load is not highly variable, it has the drawback of being 
very slow to ad just to  large jum ps in memory availability: In  general, a t least R pen-max 
sweeps will be required, which can be a very long tim e if a large am ount of com putation 
occurs over a sweep. Although we have noted th a t the bulfer cache helps m itigate this 
problem, b e tte r responsiveness to  the appearance of surplus memory is desirable because it 
is be tter for the memory to  be managed by our application-specific policy, ra ther th an  the 
generic OS one. We can achieve b e tte r responsiveness by making some changes th a t reduce 
the conservativeness of our scheme. The first (relatively minor) change is based on the obser­
vation th a t maxRSS (and hence peakRSS) usually overestimates the incursion penalty  for an 
unsuccessful probe. We have found th a t unsuccessful probes typically drop memory usage 
a sizeable am ount, bu t usually come nowhere near dropping RSS to zero unless it is already 
at a very low level. To remedy th is problem, we introduce another variable, troughRSS, 
th a t tracks the most recent trough or dip in the RSS (or mRSS, more accurately) versus tim e 
curve. troughRSS is set to  zero a t program  initialization and is then  m aintained in the fol­
lowing manner: W hen a panel is fetched, we update  troughRSS =  m in(troughRSS, mRSS). 
W henever memory shortage is detected, troughRSS is set to the current value of mRSS. 
Figure 5.8 illustrates how troughRSS is m aintained. Using troughRSS, we now calculate
Rpen — ini II ( lipfn J . (peakRSS — troughR SS)/(peakR SS — mRSS))
The in troduction of troughRSS to be tte r approxim ate the incursion penalty  enables
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FR AM EW O RK 73
P
t
Tim e
F ig u re  5.8: An example of how troughRSS is determined. Shown is a cartoon curve (the same as 
in Figure 5.5) depicting RSS versus time. Points where troughRSS is updated are marked with a 
“t” . Also shown are the points where peakRSS is updated, marked by a “p”.
us to respond slightly faster to  memory surplus, bu t in practice this change m atters little  
because we still face the problem  of the denom inator (peakRSS — RSS) going to  zero. To 
address this problem  we employ the following reasoning:
peakRSS is likely a very good approxim ation to  maxRSS im m ediately after the peak is hit, 
bu t as we move further in tim e from the peak w ithout detecting shortage, the probability 
th a t peakRSS is significantly wrong increases. We cannot quantify how this probability 
increases, bu t we can say th a t if RSS becomes close to  or exceeds peakRSS and stays there 
w ithout causing shortage, then  it is likely th a t peakRSS is no longer a  good approxim ation 
for maxRSS. This is because peakRSS is actually the most recent known point a t which the 
adaptation  algorithm  probed too high (recall th a t peakRSS is only updated  im m ediately 
after a detected shortage). W hen RSS can be m aintained close to  or above peakRSS, then, 
this suggests th a t the am ount of available memory has increased, because the last tim e RSS 
h it peakRSS, shortage ensued. Since peakRSS is likely invalid, we replace it by the  only other 
upper bound for maxRSS th a t we we have: the value mRSS takes if all panels are m apped
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FRAM EW O RK 74
and resident, which we call MAXRSS. So, our heuristic does the following: Once RSS and 
peakRSS become very close, we then  reset peakRSS =  MAXRSS and recalculate R pen- (We 
take “very close” to  mean th a t the ratio  of penalties evaluates to  greater th an  R p en_max-) 
W hen the next shortage is encountered, peakRSS will be reset to  m ax_since_last_surplus. 
Note th a t, using th is heuristic, if RSS is high (and hence close to  MAXRSS), then  probing will 
still be quite conservative, bu t if RSS is small, then  probing becomes more aggressive. This 
makes intuitive sense.
We face a dilem m a in choosing between the m ethod for calculating R pen we have ju st 
described (which we denote m ethod II, or “high frequency probing” ) and the m ethod shown 
in equation 5.1 (m ethod I, “low frequency probing” ). High frequency probing is indeed 
quicker to  respond to memory surplus, bu t because it generally results in more frequent 
probing, it will result in somewhat worse performance than  low frequency probing when 
running under long-lived, static  memory pressure (and perhaps under o ther loads as well). 
Because which m ethod is most appropriate depends on the characteristics of a system  and 
the workload it experiences, we cannot categorically sta te  th a t one m ethod is superior 
to  another. We discuss the effects of the choice of m ethods for calculating Rpen in an 
experim ental context in C hapter 7.
We depict our complete algorithm  for adapting  to memory surplus using low frequency 
probing in Figure 5.9. All of the details we have discussed in section 5.4 are included. The 
complete algorithm  using high frequency probing is shown in Figure 5.10.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FR AM EW O RK 75
A lgorithm : A dapting to  m em ory variability w ith  low frequency probing
panel_access_count+ +
RSS =  Get current RSS
if (panel_access_count modulo P  = — 0) then  sRSS =  gauge_static_size() 
mRSS =  RSS -  sRSS
dRSS =  ( Total size of all m apped panels ) +  sRSS 
m ax_sinceJast.surplus =  max(max-since_last_surplus, mRSS)
/ /  Make sure th a t any “shortage” is not due to  variations in sRSS. 
if ( mRSS <  lastRSS )
sRSS =  gauge_static_size() 
mRSS =  RSS -  sRSS
/ /  Check for memory shortage 
if ( mRSS <  lastRSS ) 8z ( Panels_in >  1 ) 
diff =  (lastRSS - mRSS) /  PaneLsize
unm ap diff panels, Panels J n  — =  diff, dRSS — =  diff * Panel_size 
peakRSS =  m ax_sinceJast_surplus
/  /  Check for memory surplus
else if ( dRSS = =  RSS ) Sz ( P an e lsJn  <  P  )
base_delay =  Time to access the last P  panels 
Hpen =  mi 11 ( Rj)cn m(ls. peakRSS/(peakR SS — mRSS)). 
if ( Shortage has never occurred
OR Tim e since last shortage >  base_delay * Rpen )
Panels J n + +
dRSS + =  PaneLsize
m ax_sinceJast .surplus =  mRSS
RSS =  Get current RSS 
lastRSS =  RSS -  sRSS
F ig u re  5.9: The final, complete algorithm for detecting and adapting to memory short­
age or surplus, using low frequency probing. It is executed whenever a panel is requested, 
m ax_sin ce_ last-su rp lu s is initialized to zero. The initial value of peakRSS is immaterial because 
Rpen is not used until after the first shortage is detected, at which point peakRSS will be updated. 
The gauge.static-size() function uses a system call such as mincore() to calculate the size of the 
“managed RSS”, mRSS, and returns sRSS =  RSS — mRSS. Because the overhead of gaugej3tatic_size() 
means that it cannot be called whenever mRSS is desired, mRSS is sometimes approximated in the 
algorithm by calculating mRSS =  RSS — sRSS using an older value of sRSS.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 5. A  D YN AM IC M EM O RY AD APTATIO N  FR AM EW O RK 76
A lgorithm : A dapting to  m em ory variability w ith  high frequency probing
panel_access_count+ +
RSS =  Get current RSS
if (paneLaccess_count m odulo P  ==  0) then sRSS =  gauge_static_size() 
mRSS =  RSS -  sRSS
dRSS =  ( Total size of all m apped panels ) +  sRSS 
max_since_last_surplus =  m ax(m ax_sinceJast_surplus, mRSS) 
troughRSS =  m in(troughRSS, mRSS)
/ /  Make sure th a t any “shortage” is not due to variations in sRSS. 
if ( mRSS <  lastRSS )
sRSS =  gauge_static_size() 
mRSS =  RSS -  sRSS
/ /  Check for memory shortage 
if ( mRSS <  lastRSS ) &; ( Panelsdn >  1 ) 
diff =  (lastRSS - mRSS) /  Panel-size
unm ap diff panels, Panels_in — =  diff, dRSS — =  diff * Panel_size 
peakRSS =  max_since_last .surplus 
troughRSS =  mRSS
/ /  Check for memory surplus
else if ( dRSS = =  RSS ) & ( Panels-in <  P  )
base.delay =  Time to  access the last P  panels 
Rpen =  (peakRSS — troughRSS) /  (peakRSS — mRSS)
if Rpen ^  Rpen-max
peakRSS =  MAXRSS
Rpen = TQin(Rpen-maxi (peakRSS — troughR SS)/(peakR SS — mRSS)). 
if ( Shortage has never occurred
O R Time since last shortage >  base.delay * R pen )
Panels_in++
dRSS + =  Panel_size
max_sincc_last .surplus =  mRSS
RSS =  Get current RSS 
lastRSS =  RSS -  sRSS
Figure 5.10: The final, complete algorithm for detecting and adapting to memory shortage or 
surplus, using high frequency probing. It is executed whenever a panel is requested. The algorithm  
is essentially identical to that shown in Figure 5.9, except for the following differences: 1) troughRSS 
is used in addition to peakRSS to calculate Rpen, and 2) when the ratio used to calculate Rpen exceeds 
Rpen-max, peakRSS is set to MAXRSS, the value that mRSS would assume if all panels are fully resident, 
m ax.sin ceJL ast.su rp lu s is initialized to zero.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Design of MMlib: 
A m em ory-m alleability library
Regardless of its other m erits, if our m em ory-adaptation framework is to  see use in ap­
plication codes, it m ust be simple to  embed in an actual code. To facilitate use of our 
m em ory-adaptation strategy, we have developed an object-based C-library, M M lib , th a t 
hides bookkeeping and m em ory-adaptation decisions from the program m er. We present an 
overview of the interface design and some of the technical details in this chapter.
6.1 T h e general fram ew ork su p p orted  by M M lib
The framework we have described in section 5.1, w ith repeated, exhaustive passes through 
one object broken into P  panels, is overly simplistic, bu t served for elucidating our adap ta­
tion strategy. In reality, scientific (and other) applications are much more complex, often 
working w ith many large memory objects a t a tim e, w ith various access pa tterns for each 
object, sometimes working w ith only portions of panels, sometimes working persistently 
w ith only a few panels, etc. M M LIB is designed around a more general framework th a t can 
accom modate these applications; Figure 6.1 depicts this framework.
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y 78
Identify memory objects M \ , M 2, . . . ,  M& needed during this phase 
for i =  [ Iteration  Space for all Objects ]
for j =  [ all Objects needed for iteration  i ]
for panellD  =  [ all panels in accessPattern(M j, i) ] 
paneL ptr =  mmlib_get_panel(Mj, panellD) 
endfor  
endfor
Work on required panels or subpanels 
for j=  [a ll  Objects needed for iteration  i ]
for panellD  =  [ all panels in accessPattern(M j, i) ] 
if  panel panellD  was modified 
W rite Back(panellD) 
if  panel panellD  not needed persistently 
m m lib_release_panel(M j, panellD)
endfor
endfor
endfor
Figure 6.1: Extended framework modeling the memory access needs of a wide variety of scientific 
applications. Note that although write-backs are represented explicitly, in our implementation writes 
are made to named memory-maps, and thus are handled transparently by the virtual memory system  
unless the user wishes to force a write-back.
6.2 D a ta  stru ctu res
A d ata  set to be m anaged by M M l ib  is broken into a num ber of panels, for which a backing 
store is created on disk. Access to  panels occur through mmapO system  calls executed by 
M M LIB. Each d a ta  set is associated w ith an MMS (“memory m alleability s tru c t” ) object, 
which contains all bookkeeping inform ation required for memory-management of the d a ta  
set, and through which all access to the d a ta  set occurs. Because an application may require 
multiple MMS objects associated w ith different d a ta  sets, and because global knowledge of 
all managed objects is required to make m em ory-adaptation decisions, M M l ib  m aintains 
a global registry (type MMReg) of all MMS objects.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y  79
6.3 C ore in terface and fu n ction a lity
T h o u g h  M M l ib  is  a  co m p lic a te d  p iece  o f  so ftw are c o n s is t in g  o f  th o u sa n d s  o f  lin es  o f  co d e , 
th e  core o f  it s  in terface  a n d  fu n c tio n a lity  ca n  b e  d esc r ib ed  b y  th e  fo llo w in g  few  fu n ction s:  
void m m lib Jn itia lize()
Initializes M M LIB, creating the registry th a t will track all of the bookkeeping information.
M M S m m lib_new _m m struct(int typ e, char *filenam e, int P, void *context) 
Creates a new MMS object of a given type (such as MMLIB_TYPE_MATDENSE for a 
dense two-dimensional array or M M LIB_TYPE_VECTORS for a one-dimensional array). 
The filename specifies the name of the backing store, which m ust already exist. (To allow 
maximum  flexibility, the MMS constructor does not create the backing store itself. We do 
provide functions to  aid in its creation.) P is the num ber of panels into which the d a ta  set 
associated w ith the MMS will be broken. The final, optional, argum ent allows the user to 
provide a pointer to a user-defined context containing private d a ta  to  be made available to 
the function im plem enting the user-specified replacem ent policy.
void *m m lib_get_panel(M M S m m s, int p)
This function is the basic building block of the library. It returns a pointer to the beginning 
of panel p. Additionally, it transparently  handles adap tation  decisions, updating  the prior­
ity queues used to  make replacem ent decisions, checking for memory shortage or surplus, 
updating  the num ber of panels cached, and performing requisite cache evictions.
void m m lib_release_panel(M M S m m s, int p)
A pointer returned  by mmlib_get_panel() is guaranteed to  rem ain valid until the panel is re­
leased by mmlib_release_panel(). This is necessary because some applications require certain 
panels to  persist throughout the m apping and unm apping of other panels. For example, if an 
application m ust com pute the interaction of a panel X m w ith panels A ,, * =  { l , . . . ,m  — 1}, 
we must ensure th a t the call to mmlib_get_panel() does not unm ap panel X rn when it fetches 
one of the panels A*. In this case, the panel A, m ust rem ain “locked” during the compu­
tation  to  ensure its persistence, and afterwards can be released. Note th a t the release does
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY  L IB R A R Y 80
n o t e v ic t  a  p a n e l from  th e  cache; it m ere ly  u n lo ck s it so  th a t  M M l ib  ca n  d o  so  if  d eem ed  
n ecessary .
void m m lib_set_update_queue(void (*func) (M M R eg, M M S, in t))
The M M l ib  registry contains a global priority queue th a t orders the panels of all MMS 
objects according to  the panel-cache eviction policy. W hen a given am ount of space 
must be freed, the M M l ib  eviction function evicts panels according to  their ordering in 
the queue until enough space has been freed. The priority queue is updated  each tim e 
th a t mmlib_get_panel() is called, inserting the newly accessed panel in the proper place. 
mmlib_set_update_queue() allows the user to specify the function th a t should be called to 
perform  this update  and m aintain any other d a ta  structures th a t may be required to im­
plement the eviction policy, such as queues local to  each MMS object. M M l ib  defaults to 
Most Recently Used (MRU) replacem ent, as this is suited to the cyclic the access patterns 
of many scientific applications.
We should note th a t to  provide m aximum  flexibility, M M LIB also provides an interface 
for the user to  specify the function th a t performs panel evictions. The preferred m ethod for 
specifying an eviction policy is to  use mmlib_set_update-queue() when possible, however.
6.4  O p tim ization s and ad d ition a l fu n ction a lity
6 .4 .1  M em o r y  a ccess  a t su b -p a n e l g ra n u la r ity
Blocking d a ta  objects into a set of panels is central to achieving good performance from 
M M LIB. Fairly large panel sizes are preferred to  keep overheads low. However, some 
program s naturally  employ a small granularity of memory access (for example, accessing 
small vectors instead of a sub m atrix), and if an application does not already (or cannot) 
use a blocked access pattern , then  injecting memory adaptiv ity  by the get_panel() semantic 
may be awkward.
More problematically, when get.panelQ  is called directly by the user, M M l ib  has no
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y 81
way of knowing what portion of the panel will be accessed. It can only assume th a t all 
of the panel will be touched, which means th a t the calculated desired resident set size 
dRSS =  P a n e ls_ in  * P a n e l_ s iz e  will be an overestim ate if in fact only a  portion of the 
panel is used. Consequently, the program  cannot probe for more memory even if all of the 
pages th a t are actually in use are resident. This can also cause “shortage” to be detected 
when in fact none exists!
For example, consider a program  w ith P a n e ls_ in  =  5 and mRSS =  5 panels. W hen 
the next panel is requested, assume there is no shortage detected, bu t we do not probe 
for memory either. At the end of the adap tation  algorithm , we set lastR S S  =  mRSS =  5. 
Because we are not probing for more memory, if the requested panel is not already m apped, 
we will have to  evict a panel from the cache to make room. Assume th a t this is the case. We 
evict a panel, bringing mRSS down to 4 panels, and then  m ap the new panel. Say th a t only 
25% of the new panel is needed. We touch th a t portion, and then  call mmlib_get_panel() 
to  fetch the next panel. There is a problem  now: Because we only touched 25% of the new 
panel, RSS =  4.25, bu t lastR S S  =  5. The adap ta tion  algorithm  will incorrectly conclude 
th a t there is memory shortage. Figure 6.2 depicts an example of th is problem  in the ISING 
application described in C hapter 7.
We solve these problems by introducing some higher-level M M LIB functions th a t serve to 
decouple the unit of memory access required by the user from the panel size used internally 
by M M lib . The user still specifies the panel size, bu t does not have to  access the d a ta  
via mmlib_get_panel() calls. Instead, they can define a smaller unit of d a ta  access via the 
function
void *m m lib_define_unit(M M S m m s, int num _units, int N , int data_type)
This function informs M M l ib  th a t the specified MMS object is to  be broken into num m nits 
logical units. Each unit will consist of N elements of type d a ta_ ty p e , where d a ta_ ty p e  
is a defined constant such as M M L IB .T Y P E JN T , MMLIB_TYPE_DOUBLE, or MM- 
LIB_TYPE_OTHER. The logical units are accessed via the following functions:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y 82
ising_mem vs. transient (30 sec) 60 MB job
70000
redundant get_panel() calls 
get_unit() calls
60000
50000
m*
<13
N’</)
40000
o5
03
c
<13■D
03
CDoc
30000
20000
10000
0 50 100 150 200 250
Time elapsed (seconds)
Figure 6.2: Benefits of using smaller memory transfer units: RSS vs. time is shown for two im­
plementations of the ISING code described in Chapter 7. The first (red lower), uses get panel calls 
without fully using all of the panels, and thus erroneously detects memory shortage and reduces its 
RSS to a minimum. The second implementation uses mmlib_get_unit() calls to access the data in 
sub-panel sized units and avoids this problem.
void *m m lib_get_unit (M M S m m s, int u) 
void m m lib_release_unit(M M S m m s, int u)
mmlib_get_unit() returns a pointer to unit u  and places a lock on th a t unit (note th a t locks 
from m ultiple gets of the same unit do stack). A pointer to  a unit is guaranteed to rem ain 
valid until all locks on the unit are released by calls to  mmlib_release-unit().
Internally, mmlib_get_unit() calls mmlib_get_panel() when the panel in which u resides 
is not currently m apped (cached), or when the num ber of accesses to  th a t parent panel 
exceeds a given threshold. The threshold criterion ensures th a t mmlib_get_panel() is called 
periodically to allow memory adap tation  decisions to be made even if all panels are m apped. 
W ithout it, once all panels were m apped, memory shortage would never be detected because
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y 83
fetching any unit would not require an m m lib_get.panel() call.
The problems w ith dRSS and false shortages are solved by m aintaining an M M l ib  reg­
istry  variable pages_ in , which tracks the num ber of pages we are attem pting  to cache. 
W henever a unit is fetched via mmlib_get_unit(), p ag es_ in  is increm ented by the num ber 
of pages in th a t unit; when a panel is evicted, pages_ in  is decremented by the to ta l num ber 
of pages in all of the units in th a t panel th a t were accessed via mmlib_get_unit(). M aintain­
ing pages_ in  in th is m anner enables accurate calculation of dRSS =  pages_ in  * p ag e_ size  
even though regions of m apped panels may rem ain untouched, so long as all of each re­
quested unit is touched. W ith  an accurate value of dRSS, the problem  of detecting false 
shortages can also be eliminated. In  the example described above, the algorithm  will still 
detect RSS <  lastR SS . However, it will then  calculate mRSS by checking the residency of 
all pages in the m apped panels. Only if mRSS < dRSS will it conclude th a t the detected 
“shortage” is real. Figure 6.2 illustrates the effectiveness of th is strategy.
6 .4 .2  E v ic t io n  o f  “m o st-m iss in g ” p an els
One of our design goals for M M LIB is to preem pt the page replacement policy of the v irtual 
memory system  by holding the RSS of an application below the level a t which the system  
begins to  swap out its pages. This is not always possible, however, and under high memory 
pressure the v irtual memory system  can be quite antagonistic, paging out the d a ta  th a t 
M M l ib  strives to keep resident. In th is case, it may be beneficial to  “concede defeat” and 
lim it our losses by evicting those panels th a t have had most of their pages swapped out, 
ra ther th an  evicting according to the application-specific policy, say MRU. The rationale 
is thus: if the system  has evicted LRU pages, we will have to load those pages again when 
we access the parent panels, regardless of whether we keep those panels m apped or not. 
Therefore, it is preferable to evict those panels th a t have been m ostly swapped out. The 
inform ation provided by the m in co reO  calls used to  estim ate the static  memory size (see 
section 5.4.1) can be used to  implement such a “most-missing” eviction policy. This strategy
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y 84
is not a t odds w ith the user-specified replacem ent policy, as it is not used for routine panel 
replacements; it is only used to reduce the panel cache size when shortage is detected (and 
antagonism  with the VM system  policy can occur). In fact, following the user-specified 
policy and evicting MRU panels in th is situation would make things worse: we would have 
to  load the LRU pages th a t have been paged out, as well as the MRU pages th a t were 
resident until we evicted their panels.
We have implem ented support for the most-missing eviction strategy in M M LIB (a 
user-set flag specifies whether or not th is policy should be used). Some results indicate 
th a t the strategy can have clear benefits. In the scenario shown in Figure 6.3, the job th a t 
uses strict MRU eviction exhibits very slow performance initially after external memory 
pressure begins, because it m ust load the pages evicted by the operating system  as well as 
the MRU panels th a t M M LIB has evicted, which usually do not coincide w ith the panels 
from which the system has taken pages. The job th a t employs the “most-missing” policy 
adapts more nimbly to the sudden increase in memory pressure, because it does not make 
the mistake of autom atically throwing out many panels th a t have been untouched by the 
V M  system. Note th a t after the application adapts to  the sudden decrease in available 
memory, it autom atically  reverts from most-missing to  MRU replacem ent when no further 
shortage is detected.
A lthough the most-missing policy does show promise, we do not use it in the experim ents 
we report in C hapter 7. The chief reason is th a t we do not wish to further complicate the 
behavior of M M lib in our experiments; it is difficult enough to analyze its behavior w ithout 
introducing this additional complexity.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 6. DESIGN OF M M L IB : A  M EM O RY-M ALLEABILITY L IB R A R Y  85
Performance with evict_mru() vs. evict_most_missing()
24
evict_mru()
evict_most_missing()
22
20
18
16
14
12
10
8
6
4
2
0 1 2 3 4 5 6 7 8 9
Iteration number
F ig u re  6.3: Benefits of evicting partially swapped out panels. The chart shows the times for 
the first 10 sweeps of two versions of the memory-adaptive ISING code described in Chapter 7. 
(Note that sweep 0 is an initialization sweep that runs more quickly than subsequent sweeps, which 
perform Monte-Carlo steps.) In each case, the code runs with a 60 MB lattice on a Linux machine 
with 128 MB of RAM. After 12 seconds, a dummy job that randomly writes to 60 MB of memory 
begins. The memory-adaptive job that evicts panels with the largest number of missing pages, 
labeled ev ic t_m ost-m issin g , has lower and less variable times than the job employing strict MRU 
replacement, labeled evict_m ru. For both cases, the number of panels P  =  40 and low frequency 
probing is used with Rpen-max =  10.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7
Experim ental evaluation of 
memory- adapt at ion
7.1 A p p lica tion  kernels
We have used M M l ib  to  test our memory adap ta tion  technique in three scientific application 
kernels which together represent a range of scientific applications. All of these application 
kernels exhibit memory access patterns for which MRU replacem ent is most appropriate, 
pu tting  them  at odds w ith the LRU-like policy employed by the operating system. (See also 
section 7.7, where we modify the CG code to  use an LRU-friendly pattern .) In  other respects, 
however, the memory access pa tterns differ between the applications. In  this section, we 
describe these applications and their memory access characteristics.
7 .1 .1  C o n ju g a te  G rad ien t (CG)
The first application is a conjugate gradient (CG) linear system  solver using the CG routine 
provided in SPARSKIT [74], The com putational s tructu re  of CG is common among many 
iterative m ethods, and these m ethods constitu te the basic kernel in a wide range of scientific 
applications. (This is why CG is one of the SPEC benchmarks.) Each iteration  requires 
calculation of one sparse m atrix-vector product and a few inner products and vector updates. 
The bulk of the memory dem and comes from the sparse coefficient m atrix, which we manage 
w ith M M l i b . CG also requires storage for four work vectors plus the solution and right-
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 87
hand side vectors. In principal, these vectors can be m anaged with MMLIB, and in fact 
th is might be necessary if large and very sparse m atrices are used, in which case the vectors 
make up a greater portion of the memory requirem ents. In our experim ents, however, we 
do not manage these vectors w ith MMLIB (so they are “sta tic” memory) as they occupy 
far less memory th an  the m atrices we use. Still, the size of the  work vectors is significant, 
as they represent around 13.5% of the storage requirem ents in our CG experim ents using a 
70 MB m atrix.
Our test code, CG, does not construct a m atrix, bu t loads from disk a pre-generated 
sparse m atrix  in diagonal form at. Figure 7.1 depicts the algorithm  for m atrix-vector mul­
tiplication of a  sparse m atrix  in this format; this is the algorithm  in which M M l ib  will be 
used to  enable memory adaptivity. The m atrices used in our experim ents are generated from 
a three-dimensional, eighth order finite-difference discretization of the Laplacian operator 
on the unit cube using a regular grid and Dirichlet boundary conditions. In the memory- 
adaptive code, they are partitioned row-wise into panels of consecutive rows. M atrix-vector 
m ultiplications sweep through each panel in typew riter fashion, left to right and top to 
bottom . We note th a t, as far as M M l ib  is concerned, CG is a read-only application: writes 
do not occur to  the m atrix  managed by the library.
7 .1 .2  M o d ified  G ra m -S ch m id t (MGS)
The second application is a code th a t progressively builds a set of orthonorm al vectors. It 
uses the modified Gram -Schm idt (MGS) procedure to  orthogonalize the vectors, a very com­
mon kernel in scientific computing. An example of a very memory dem anding application 
for MGS comes from m aterials science, where large eigenvalue problems of dimension on 
the order of one million m ust be solved for about 500-2000 eigenvectors [82]. For such prob­
lems, the storage and orthogonalization of the eigenvectors are the lim iting com putational 
factors. MGS sees very common use in im plem entations of the GM RES algorithm  [72], 
which progressively builds a vector basis from which solutions are extracted. At each step,
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 88
A lgorithm : D iagonal form at sparse m atrix-vector m ultip lication
current =  1 
for row =  1 to N do 
y[row] =  0
for i =  1 to nonzeros_per_row do 
col =  row +  offset [i] 
if  (0 <  col <  N) then
y[row] =  y[row] +  A [current] * x[col] 
endif
current =  current +  1 
enddo  
enddo
Figure 7.1: Matrix-vector multiplication algorithm for a sparse matrix of dimension N consisting 
of a number of diagonals, x is the input vector and y is the output vector. The array A[] consists of 
the elements from the first row, followed by the elements from the seconds row, and so on. Note that 
all rows consume the same number of entries in A[], so some entries will not be used: for example, 
the first row of the matrix does not contain any elements from diagonals below the main diagonal, 
so some empty elements will be “stored” in A[], The offset[] array stores the offset of each of the 
diagonals with respect to the main diagonal.
an additional basis vector is generated, which m ust be orthonorm alized against the other 
basis vectors. If the vectors are very large (as required to  solve PD Es on high resolution 
meshes, for instance) or if many vectors m ust be retained to ensure convergence, then  the 
memory requirem ents for the MGS orthogonalization may make up the bulk of the memory 
demand. O ur test code, MGS, simulates the behavior of a GM RES-type solver (such as 
Jacobi-Davidson), though it generates new vectors random ly rather th an  through a m atrix- 
vector m ultiplication, because our goal is to  focus solely on the memory dem ands imposed 
by the vectors. At each step, a new vector is generated, orthogonalized against previously 
generated vectors, normalized, and then  appended to  them . The code allows a  “restart 
size” m ax-basissize  to be specified: th a t is, once the basis has grown to m ax-basissize  vec­
tors, it discards all bu t m in-basissize  vectors from the basis and begins building a new set. 
R estarting  is commonly employed w ith GMRES and related solvers because as the basis 
grows, memory and com putational costs may become prohibitive. A rem edy is to  restart 
the algorithm , retaining the current approxim ate solution vector and discarding the basis
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 89
vectors. We note th a t MGS is the only one of our test codes whose memory requirem ents 
vary considerably throughout its lifetime, as the basis grows or is discarded. Figure 7.2 
depicts the algorithm  executed by the MGS code.
A lgorithm : M GS test code
for j = l  to min_basis_size do 
Vj =  random (N) 
enddo
for restart =  0 to  num _restarts do
for j =  min_basis_size to  max_basis_size do 
w  =  random (N)
for i =  1 to j do / /  The MGS orthogonalization 
h =  w  ■ Vj 
w =  w  — h ■ Vj 
enddo
v i+ 1  =  w / l l w ll2
enddo  
enddo
Figure 7.2: The algorithm executed by our MGS test code, which simulates the behavior of a 
GMRES-type solver, generating random vectors of dimension N which are added to an orthonormal 
basis via modified Gram-Schmidt. After the basis size grows to a set maximum, the basis is dis­
carded and the computation is “restarted”. To ensure that a minimum level of memory pressure is 
maintained, one can specify a minimum basis size, below which the size of the basis never drops.
7 .1 .3  M o n te -C a r lo  Isin g  m o d e l s im u la tio n  (ISING)
O ur th ird  application is a M onte-Carlo sim ulation of the two-dimensional Ising model [33, 
pp. 550-556], which is used to model m agnetism  in ferromagnetic m aterials, liquid-gas 
transitions, and other similar physical processes. We chose the Ising model for several 
reasons: Though rem arkably simple, it captures many aspects of phase-transitions, and 
after being introduced in 1928 it is still studied by physicists today. Com putationally, it is 
very sim ilar to a great variety of lattice “spin” models used to  study  phenom ena as diverse 
as molecular-scale fluid flow and order-disorder transitions in binary alloys [33, pp. 589-593]. 
And because laboratory-scale systems consist of a t least 1018 spins, people are interested in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 90
studying very large lattices, which puts memory at a premium.
The ISING code works w ith a two-dimensional lattice of atom s, each of which has a 
spin of either up (1) or down (-1). It uses the M etropolis algorithm  [33, pp. 566-568] to 
generate a series of configurations th a t represent therm al equilibrium. The memory accesses 
are based on a simple 5-stencil com putation common to many other scientific applications. 
Figure 7.3 presents a  pseudocode sum m ary of the operation of our ISING code. For each 
iteration the code sweeps through the lattice and each lattice site is tested to  determ ine 
whether its spin should flip. If the flip will cause a negative potential energy change, A E , 
the flip is autom atically accepted. Otherwise, the spin flips w ith a probability equal to 
exp , where k  is the Boltzm ann constant and T  is the am bient tem perature. Note 
th a t this means th a t the higher the tem perature, the more spins are flipped (equivalent 
to  a “m elting” of m agnetization or evaporation of liquids). In  com putational term s, T  
determines the frequency of writes to the lattice sites a t every iteration. The memory- 
adaptive version of the application partitions the lattice row-wise into panels to  simplify 
the panel access pattern , although other partitionings could also be easily implemented. 
To calculate the energy change at panel boundaries, the code needs the last row of the 
above neighboring panel and the first row of the below neighboring panel. Note th a t unlike 
CG, which performs no writes to  the panels, and MGS, which writes only when a vector is 
added to  the basis, ISING performs frequent writes when higher values of T  are used. Also, 
ISING is the only code th a t requires m ultiple panels to  be active simultaneously, so th a t 
interactions across panel boundaries can be computed.
We should note th a t the version of the ISING code we use here is w ritten  in a m anner 
th a t allows us to  avoid the problems w ith sub-panel access size shown in Figure 6.2 of 
the preceding chapter. This involves a bit of awkward bookkeeping to  ensure th a t we do 
not release panels before using all of them . The code can be w ritten  in a much more 
natural m anner (with very little im pact on performance) using the M M lib functions for 
fetching sub-panel sized units, bu t we choose to avoid th a t in our experim ents here to  avoid
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H A P T E R  7. E X P E R IM E N T A L  E V A L U A T IO N  OF M E M O R Y -A D A P T A T IO N  91 
c o m p lic a tin g  ou r a n a ly s is  o f  th e  p erform an ce  o f  M M l i b .
A lgorithm : M etropolis Ising m odel sweep
for row =  1 to L do 
for col =  1 to  L do
up =  spin[i-l, j]; down =  sp in [i+ l, j] 
left =  spin[i, j-1]; right =  spin[i, j + 1]
A E  =  2 ■ spin[i, j] • (left +  right +  up +  down) 
if  random () <  w[AI? +  8] then  
spin[i, j] =  -spin[i, j]
E  =  E  + A E  
M  = M  +  2 • spin[i,j] 
endif 
enddo  
enddo
Figure 7.3: The algorithm for executing a Metropolis sweep through the LxL spin lattice of the Ising 
model. Sites in the lattice possess either spin up (+1) or down (-1). Periodic boundary conditions 
are used to calculate the spins (up, down, left, right) of the four nearest neighbors. The array 
w[] is a lookup table of Boltzmann probability ratios; these ratios are dependent on the ambient 
temperature in the simulation. The total energy E  and the magnetization M  are scalar quantities 
that track some macroscopic observables of interest; they do not factor into the computations. Note 
that for each lattice site, we always generate a random number to determine whether the spin should 
flip. This could actually be avoided by automatically accepting a spin flip whenever AE  < 0, but 
by always generating the random number we ensure that the amount of computation is the same 
at any temperature. This allows us to ensure that performance differences observed at different 
temperatures are solely due to differences in frequency of writes to memory.
7.2 E xp erim en ta l en v iron m en ts and m eth o d o lo g y
T he experim ental results we present were obtained using two different Unix/Unix-like op­
erating systems, Linux 2.4 and Solaris 9, which bo th  employ global, LRU-like page replace­
ment algorithm s. Experim ents under Linux were performed using Linux 2.4.22-xfs on a 1.2 
GHz AMD D uron system  with a 30GB, 5400 rpm  IDE hard disk, 128 MB of RAM (some 
of which is shared w ith the video subsystem ), and a 128 KB LI and a 64 KB L2 cache. The 
Linux system  ran  the KDE graphical desktop environm ent during the experim ents. Results 
for Solaris were obtained using the typhoon  nodes of the SciClone cluster a t W illiam  & 
M ary (see Figure 3.2). The typhoon nodes are Sun U ltra 5 machines w ith a 333 MHz Ul-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 92
traSparc Hi processor, w ith 256 MB of DRAM, 2MB of external cache, and a 9.1 GB, 7200 
rpm  EIDE internal hard  disk. Access to the nodes occurs through a PBS batch system; no 
memory is consumed by running a graphical user-interface. We note th a t, by current stan ­
dards, the am ount of memory in bo th  our Linux and Solaris machines is small. However, 
running some experim ents on a Mac OS X 10.1 system  with 768 MB of RAM using 500 
MB m em ory-adaptive jobs yielded results qualitatively and quantitatively sim ilar to those 
obtained on our smaller Linux and Solaris systems, suggesting th a t the m em ory-adaptation 
strategy scales well to  larger memory sizes. The smaller available memories in the systems 
we use here have the benefit of allowing us to perform  experim ents under high levels of 
memory pressure in a  reasonable tim e frame.
We induce memory pressure in our experim ents in a variety of ways, bu t when we wish 
to  report statistics, we have mostly used our memlock code running on our Linux system  (on 
which we have the root access privileges required by the code), memlock uses the m lockO  
system  call to pin pages in-core, allowing us to precisely control the am ount of memory 
pressure experienced by an application. This level of control is im portan t because when 
memory resources are oversubscribed by multiple jobs, the behavior of the v irtual memory 
system  can be highly chaotic and unpredictable. For example, Figure 7.4 shows results 
from two runs of the same experim ent on a Linux 2.4 system  w ith 128 MB DRAM, in 
which a m em ory-adaptive CG job runs w ith a 70 MB m atrix  against a dum m y job th a t 
allocates 70 MB and repeatedly accesses random  pages. The dum m y job sta rts  first. In one 
instance the dum m y job keeps its entire working set and CG utilizes the rem aining memory; 
in the other, the system  eventually gives CG enough memory to  cache its entire working set. 
Both runs were conducted under identical settings, bu t our CG code experiences completely 
different levels of memory pressure in the two instances, although in bo th  cases overall CPU 
utilization rem ains high and m em ory-adaptive CG makes good progress. We have found no 
way to predict or control which case CG experiences, and this makes compiling meaningful 
performance statistics very difficult.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 93
Time per iteration versus iteration number
- x -  CG gets all of its memory
-Q - Competing job gets all of its memory
25
20
Q.
0 5 10 15 20 25
Iteration number
Figure 7.4: Two different modes observed under the same experimental settings. Execution times 
for each of the first 25 iterations of memory-adaptive CG show two possible modes of behavior: in 
the first mode, the external load keeps its entire working set while CG utilizes the remaining memory. 
In the second mode, the scenario is reversed. Resource utilization is high in either case.
Using memlock to  precisely control the am ount of memory available to  an application 
simplifies our statistical analysis, although even when we use memlock the variance in the 
d istribu tion  of the tim es to  complete an iteration or sweep through the panels is quite high. 
As figure 7.5 illustrates, these distributions tend to have a significant right skew, w ith some 
iterations taking a very long tim e to  complete. Because of this skewness, when we wish 
to  report confidence intervals for the average tim e per iteration for a program  under given 
conditions, we run  the program  under those conditions for 50 iterations and m easure the 
tim e for each iteration; those 50 m easurem ents comprise one sample. We repeat the run 
of 50 iterations m ultiple times to obtain  m ultiple samples, and then calculate the average 
iteration tim e as the arithm etic mean of the arithm etic means of the samples. The sample 
size of 50 is more th an  enough to  ensure norm ality of the d istribu tion  of sample means, 
allowing us to  utilize the Central Limit Theorem  and S tudent’s 1-distribution to compute 
95% confidence intervals.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N  94
Distribution of CG iteration times under memory pressure
0.14
0.12
0.1
c3Oo
T 3
0
N
0.08
0.06
0.04
0.02
r r i
6 10 12 14 16 18 202 4 8
Time for iteration in seconds
F ig u re  7.5: Relative frequencies of iteration times for memory-adaptive CG running on a Linux 2.4 
machine with 128 MB RAM against a memlock job consuming 70 MB. The distribution displays 
significant right-skewness, with iterations occasionally taking a very long time to complete.
7.3 C h oosin g  th e  M M lib  param eters
We have endeavored to  keep M M lib  as free as possible of param eters th a t m ust be tuned. 
The only quantities th a t m ust be specified are the panel size and the quantity  Rpen-max 
th a t lim its the delay between detection of memory shortage and probing for more memory. 
Additionally, one m ust choose between the two types of probing (low or high frequency) 
discussed in section 5.4.2. In this section we discuss the selection of appropriate values 
for these param eters and experim entally study the sensitivity of M M l ib  to  the choice of 
param eters. Under different workloads th an  those used in our experim ents, the optim al 
choice of param eters might differ. However, because our experim ents reveal M M lib  to 
be fairly insensitive to the choice of param eters, we th ink th a t following the guidelines we
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 95
present here will result in reasonable (i.e., not far from optim al) perform ance under a variety 
of workloads.
7 .3 .1  P a n e l s ize
The choice of panel size involves a trade-off: If a d a ta  set is broken into a large num ber 
of panels, M M lib  can more exactly tune the level of memory usage, and can cope w ith 
higher levels of memory pressure. On the other hand, using many panels results in higher 
bookkeeping and I /O  overheads. Figure 7.6 depicts the performance of mem ory-adaptive CG 
against different levels of memory pressure, using different values of P , the num ber of panels 
into which the 70 MB d a ta  set is broken. Figure 7.7 depicts a sim ilar performance curve for 
m em ory-adaptive ISING. C onstant memory pressure is applied via our memlock code, which 
runs as ro o t  and uses the m lockQ  system  call to lock a portion of memory in-core, disabling 
paging for th a t region. The performance curves for CG show th a t the choice of P  =  20 (3.5 
MB panels) provides the best performance. W ith  P  — 10, the overhead associated w ith 
managing panels is small, bu t overall performance is worse because of inability to  finely 
adapt the level of memory usage. Also notice th a t a t the highest level of memory pressure, 
the P  = 10 case is m arkedly worse because the panel size is too large to cope w ith such a 
small am ount of available memory. For P  = 40, performance worsens because of increased 
panel-management overheads.
The performance trends for ISING are similar, except th a t the P  = 20 curves and 
P  = 40 curves are essentially identical except under the highest level of memory pressure, 
where the larger panels of the P  — 20 case become a bit too large to  allow ISING to reduce 
its memory usage to  the optim al am ount. We believe th a t the ex tra  overhead required 
to  manage 40, ra ther th an  20 panels, is not noticeable in ISING because the am ount of 
com putation per sweep through the panels is relatively high: under no memory pressure, 
ISING requires about 4.2 seconds for a sweep through the panels, compared to about 0.4 
seconds for CG. Choosing too large a panel size (see the P  =  10 case) still results in a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER  7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 96
performance hit because the coarseness of the panels lim its the ability to  tune memory 
usage to  the appropriate level.
The interaction of many factors, including characteristics of the machine, application, 
and workload determines w hat panel size is best for a given situation, and it is impossible 
to  conclude th a t a certain panel size will always be best. As a rule of thum b, however, we 
suggest using a value around P  — 20 (i.e. a panel size th a t is 5% of the m anaged da ta  set), 
unless a need to adapt to  very small levels of available memory necessitates more panels. 
In practice we have found P  = 20 to offer a good compromise between panel-management 
overheads and tunability  of memory usage. We note, however, a user need not agonize too 
much over the choice of panel size. The goal of M M lib  is to allow applications to perform  
better under memory pressure than  if they simply relied on the v irtual memory system, and 
as Figures 7.6b and 7.7b illustrate, w ith any reasonable panel size the performance with 
M M lib  is far superior.
Performance of CG-MEM with different panel si Performance of CG-MEM with different panel sizes
16
14
12|
|
10&£©o 8
IIs
I
6
4
2
0
10 20 so30 40 60 70 60
40
35
30
8
8
251a
o
o  20
I 15
&s
I 10
5
0
10 20 30 40 50 60 70 80
Size of locked region (MB) S z e  of locked region (MB)
(a) (b)
F igu re 7.6: Effect of panel size on performance of MMLiB-enabled CG. Chart (a) depicts perfor­
mance profiles for memory-adaptive CG for some different values of P, the number of panels into 
which the data set (the coefficient matrix) is broken. CG runs with a 70 MB matrix against static 
memory load provided by memlock on a Linux 2.4 machine with 128 MB of RAM. Chart (b) depicts 
the same profiles, but includes the performance profile of in-core CG for comparison. Low frequency 
probing with R pen.max =  10 is used.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 97
Performance of ISING-MEM with different panel sizes Performance of ISING-MEM with different panel si
P .  20
Size of locked region (MS)
(a)
40
35
30I
Iga
25
208
|
15
S
I 10
5
0
30 40 50 60 70 SO20
Size ol locked region (MB)
(b)
F ig u re  7.7: Effect of panel size on performance of MMLlB-enabled ISING. Chart (a) depicts perfor­
mance of memory-adaptive ISING for some different values of P, the number of panels into which the 
data set (the spin lattice) is broken. ISING runs with a 70 MB spin lattice against a static memory 
load provided by memlock on a Linux 2.4 machine with 128 MB of RAM. Chart (b) depicts the 
same profiles, but includes the performance profile of in-core ISING for comparison. The ambient
temperature in the simulation is T  — 2. Low frequency probing with R,■pen-max =  10 is used.
7 .3 .2  R pen-m ax
In section  5.4.2 we exp la ined  th at we m ust cap the m axim um  value th a t Rpen, th e  ratio  
used  in  determ in ing  th e  probing delay, can  assum e; otherw ise M M lib  m ay fail to  ever probe  
for more m em ory after d etectin g  a shortage. Here we discuss the choice o f  Rpen-max and  
M M lib ’s sen sitiv ity  to  its  value.
R'perumax in  essence specifies th e  largest am ount o f  tim e th at we are w illing  to  w ait to
probe for m ore m em ory after a shortage has been  detected: if  th e  con d ition  RSS =  dRSS 
is satisfied , we w ill w ait at m ost Rpen-max sw eeps through th e  M M LlB-m anaged d ata  set 
before probing. Rpen_max should  be large enough to  prevent significant perform ance loss 
due to probing too  aggressively for m emory, but should  not be so large as to  w aste surplus 
m em ory by n eg lectin g  to  try  to  use it. T h e challenge is to  in te lligen tly  pick a value th at  
balances th ese  requirem ents.
Fortunately, we have found in practice th at under sta tic  loads M M lib  becom es insen­
sitive to  th e  value o f  R'pen.max once it becom es sufficiently large, w hich happens at fairly
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 98
low values. Figure 7.8a depicts performance of m em ory-adaptive CG versus static  memory 
pressure, w ith different curves corresponding to  different values of R pen-max ■ Performance is 
clearly inferior for values of R pen-max below ten, bu t once R pen-max — 10, little performance 
is gained by increasing its value: the  performance curve for R pen-max =  100 is nearly iden­
tical, w ith its 95% confidence intervals overlapping w ith those of the Rpen-max = 10 curve. 
Figure 7.8b depicts curves sim ilar to  those in Figure 7.8a, bu t for m em ory-adaptive ISING. 
ISING is much less sensitive to  the value of R pen-max  than  is CG, perhaps because of the 
greater am ount of work per panel th a t it performs. It is clear th a t a value of Rpen-max =  10 
is sufficient for bo th  codes; in the experim ents we report here, we use a value of 10 unless 
otherwise specified.
Although we have discussed selection of an appropriate value for Rpen-max-, we have 
not yet said anything about the choice between the two types of probing (viz., m ethods 
of calculating R pen) described in section 5.4.2. O ur assertion th a t the value of R pen-m ax  
should be between 5 and 10 is valid in either case; though Figure 7.8 depicts results using 
low frequency probing, the curves are nearly identical if high frequency probing is used.
7 .3 .3  C h o ice  o f  p ro b in g  ty p e
The figures we have presented so far depict performance obtained using low frequency 
probing (R pen calculated by m ethod I described in section 5.4.2). Here we present some 
results th a t compare the two types of probing under some different circumstances.
7.3.3.1 S tatic m em ory pressure
Under static  memory pressure, we have seen th a t probing too frequently for surplus mem­
ory can incur significant performance penalties. Thus, one would expect high frequency 
probing to  yield worse performance th an  low frequency probing under sta tic  memory loads. 
Figure 7.9 compares performance between bo th  types of probing on a system  subjected to 
memory pressure from memlock. Figure 7.9a dem onstrates th a t our expectation is justi-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATION OF M EM O RY-AD APTATIO N 99
Performance of CG for different values of Rpen .m ax
7
6.5
6
co 5.5
IS
4) 5
O
O
4.5
0)
E
4)
o> 4
2?
4)
5 3.5
3
2.5
2
i i i i i i
_...............- .... ....:
.
-
Rpen_max = 3 ^ j
i- Rpen_max = 5 - -
Rpen_m ax-  10
Rpen max = 100 ^i i i i i n i
30 35 40 45 50 55
Size of locked region (MB)
60 65
(a) Effect of Rp on CG performance
Performance of ISING for different values of Rpen_max
13
12
11
10
9
8
7
6
Rpen_max = 3 
Rpen_max -  5 
Rpen_max - 1 0 0
5
30 35 40 45 50 55 60 65 70
Size of locked region (MB)
(b) Effect of Rp on ISING performance
F ig u re  7.8: Effects of Rpen.max on M M lib  performance. The charts depict performance profiles 
for a memory-adaptive job running with different values of Rpen_max against static memory pressure 
from memlock on a Linux 2.4 system with 128 MB of RAM. The job in the top chart (a) is CG; in 
the bottom chart (b) the job is ISING. Error bars represent 95% confidence intervals. In all cases, 
P  =  20 and low frequency probing is used. T =  2 in the Ising model simulation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N  100
fied for m em ory-adaptive CG: The performance curves for the different probing types are 
qualitatively very similar, bu t the average tim e per iteration is consistently higher for high 
frequency probing. Somewhat surprisingly, however, when we compare the performance 
of the two types of probing in memory adaptive ISING (Figure 7.9b), we can discern no 
difference between the two m ethods. We suspect th a t the d isparity  between what we see 
for CG and ISING is due to  two key differences between the codes: F irst, ISING does sig­
nificantly more work per panel th an  CG. Second, CG has a significant am ount of “s ta tic” 
memory th a t is not managed by M M l ib , while ISING does not. More frequent probing 
may have a bigger effect on CG because it causes the static  memory to be paged out by the 
operating system; eviction (and subsequent reloading) of the sta tic  memory is less efficient 
th an  th a t of memory handled by M M l ib . Therefore, it seems safer in general to conclude 
th a t under constant, long-lived memory pressure, low frequency probing should be used, as 
high frequency probing offers no performance gains and sometimes (we believe most of the 
time) results in worse performance. E ither m ethod, however, offers far b e tte r performance 
th an  th a t obtained using conventional in-core codes.
7.3.3.2 H ighly variable m em ory pressure
High frequency probing was proposed w ith highly variable memory loads in mind, and here 
we compare the two probing types under such loads. Table 7.1 presents some experim ental 
results com paring the two in m em ory-adaptive CG on our Linux 2.4 system. Variable memory 
pressure is created using our s te p lo c k  code, which works like memlock bu t can increase and 
decrease the am ount of locked memory in discrete steps. We run  CG against a s te p lo c k  job 
th a t cycles ten  times through a sequence in which it locks m \  MB of RAM for t\  seconds, 
then  locks m 2 MB for £2 seconds. We choose the m  and t  values to  create short periods 
of intense memory usage, followed by longer periods of less intense usage. This is the 
type of load th a t low frequency probing should have trouble adjusting to: after dropping 
its memory usage during intense memory pressure, several iterations may be required to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N  101
Effect of probing type on CG
1 1 
low frequency probing 
high frequency probing
— • - ■ r ........... .......t  i i
-
■ > ,
/
. ................ ~
>. ■"
,  ' i - " "
X
. X  
•' X  
. . -X '
1 1 i i i i
-
10 20 30 40 50 60 70 
Size of locked region (MB)
80
(a) Effect of probing type on CG performance
Effect of probing type on ISING
i
low frequency probing i 
high frequency probing -
i “ i i
-
t
-
...J '" ''
- -
-
s ? ' '  i
-
/ X
X  ,  1
x "
X .  '
.. X
x '..
i i i
-
30 40 50 60 70 
Size of locked region (MB)
80
10
(3O
© 6
£  4
<
13
12
(b) Effect o f probing type on ISING performance
Figure 7.9: Comparison of performance using the two methods of calculating R pen. Memory- 
adaptive jobs run on a Linux 2.4 system with 128 MB RAM against static memory pressure from 
memlock. Rpen-max =  10 and P = 20 in all cases. Chart (a) depicts performance of memory adaptive 
CG, while chart (b) depicts performance of ISING running with T = 2. Error bars represent 95% 
confidence intervals. Low-frequency probing is clearly superior for CG, but the two types of probing 
are indistinguishable for ISING.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 102
increase its num ber of cached panels to  take full advantage of the memory available during 
a period of lesser memory pressure.
Somewhat surprisingly, Table 7.1 shows th a t low frequency probing delivers equivalent 
or better performance th an  high frequency probing in our experiments. As Figure 7.10 
illustrates, high frequency probing does tend to  take be tte r advantage of jum ps in memory 
availability, bu t it also pushes memory usage too high, prom pting much more page reclam a­
tion by the operating system  and negating any performance benefits gained by the higher 
memory utilization. W ith  low frequency probing, CG spends less tim e fighting w ith the sys­
tem  for memory, and the lower memory utilization makes little performance difference in 
practice because the buffer cache speeds up panel “misses” by caching some of those panels 
th a t M M l ib  does not keep m apped.
W hile we hesitate to  categorically sta te  th a t high frequency probing should never be 
used, it has become clear to  us th a t w ith the systems and workloads used in our experiments, 
low frequency probing is preferable, as it seems to provide performance equal to  or be tter 
th an  th a t of high frequency probing. In some other situations, the quick response to  memory 
availability th a t high frequency probing offers might be beneficial. For example, under older 
versions of Solaris (unpatched pre-2.7 versions), the file cache could interfere w ith running 
applications and cause them  to thrash; to prevent such problems, file caching might be 
disabled on some file systems. If panels reside on a disk w ithout file caching, then  quick 
adjustm ent to  memory surplus becomes considerably more im portant. This is an unusual 
case, however, and we recommend th a t low frequency probing be used in most cases.
7.4 G racefu l d egrad ation  o f perform ance
Our goal is to allow applications to degrade their performance gracefully as memory re­
sources become scarce. Figure 7.11 depicts the performance degradation of each of our 
three applications under increasing levels of memory pressure. We show one graph per 
application; each showing the performance of three versions of th a t application under dif-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 103
Type m i ti m 2 t2 Avg. its 95% low 95% high stddev
high 70 5 0 5 19.7 17.4 22.0 3.3
low 70 5 0 5 20.2 17.0 23.4 4.4
high 70 5 30 25 59.3 55.7 62.9 5.1
low 70 5 30 25 74.1 72.0 76.2 2.9
high 70 5 50 25 46.0 39.4 52.6 9.4
low 70 5 50 25 60.6 58.1 63.1 3.5
Table 7.1: Comparison of the two types of probing under highly variable memory pressure. 
Memory-adaptive CG runs against a step lock  job that locks m\ MB of RAM for t\ seconds, and 
then locks m 2 MB of RAM for £2 seconds, repeating the cycle a total of ten times. CG runs with a 70 
MB matrix on a Linux 2.4 system with 128 MB RAM; P  — 20 and R pen.max = 10- We report the 
average number of iterations that CG is able to complete during the time that s tep lock  runs, as well 
as the upper and lower 95% confidence intervals for that average, and the standard deviation. In the 
“type” column, “high” denotes high frequency probing, and “low” denotes low frequency probing. 
Low frequency probing equals or out-performs high frequency probing in all cases.
ferent levels of constant memory pressure applied via memlock. For each application we test 
a conventional in-core version (blue top curve), a m em ory-adaptive version using M M lib 
w ith  low frequency probing (red lower curve), and an ideal version of the m em ory-adaptive 
code th a t fixes the num ber of panels cached at an optim al value provided by an oracle. For 
all three applications, the m em ory-adaptive version performs significantly and consistently 
better th an  the in-core im plem entation. Additionally, the performance of the adaptive code 
is very close to the best-case performance, even though the adaptive code has no foreknowl­
edge of memory availability.
7 .4 .1  E ffects  o f  p a n e l w r ite -fre q u en cy
One might question whether M M lib  becomes ineffectual when applications write to  the 
panels frequently, as many d irty  pages m ust be flushed to disk before a panel can be un­
m apped. In the case of CG the panels are never updated  and thus never need their contents 
flushed. MGS does write to  panels, bu t infrequently, doing so only when an orthonorm al­
ized vector is added to the basis. ISING, however, updates the panels with each sweep 
through them , and it does so frequently if the ambient tem perature T  of the sim ulation 
is high. To determ ine if frequent writes to the panels negatively affect performance, we
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N  104
55000 
50000 
45000 
40000
S'
*  35000
a>
N
$ 30000
c
S 25000 
a>oc
20000 
15000 
10000 
5000
Figure 7.10: RSS versus time for the two types of probing in memory adaptive CG under highly 
variable memory load. The plot depicts behavior from the (mi, ti, m2, £2) =  (70,5,30,25) case from 
Table 7.1. Although high frequency probing is able to utilize higher amounts of memory, it also 
prompts more page reclamation by the operating system.
tested MMLlB-enabled ISING under constant memory pressure for different values of T. 
Figure 7.12 displays performance curves generated under static memory pressure on Linux 
2.4 for tem peratures T  = 0, T  =  2, and T  = 50. At T  = 0, the sim ulation quickly reaches 
equilibrium  after a few sweeps through the m atrix, and afterwards never flips any spins. At 
the other extreme, T  =  50, the lattice is in a highly disordered state , w ith an average of 
97.0% of the spins flipping during one sweep through the panels. At T  = 2, a more modest 
16.2% of spins are flipped during a sweep through the panels. The ISING code performs 
the same am ount of CPU  work for each sweep, regardless of the sim ulation tem perature. 
However, we see th a t as memory pressure increases, ISING at T  =  0 performs much better 
th an  in the T  =  2 or T  =  50 cases. This confirms th a t, not surprisingly, frequent writes
RSS vs. time under highly variable memory load
t--------------------1------------------- 1 1 1 r
low frequency probing 
high frequency probing
__________________ 1__________________ 1__________________ 1__________________ 1__________________ 1__________________ 1__
0 50 100 150 200 250 300
Time elapsed (seconds)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 105
Average CG iteration time vs. memory pressure
.9 40
CO
0  35
8  30
c l  25 
v>
1 20o
<£, 15o>
.1 10
20 30 40 50 60 70 80
Size of locked region (MB)
Memory-adaptive
panels in fixed at optimal value
* In-core
*................* -
.... • ...■ ;
Average MGS time for last 10 vectors vs. memory pressure
Memory-adaptive
p a n e ls jn  fixed at optimal value
* In-core
10 20 30 40 50 60 70 80
Size of locked region (MB)
40
35
30
25
20
15
10
5
0
20
Average Ising sw eep time vs. memory pressure
' Memory-adaptive
p a n e ls jn  fixed at optimal value 
In-core
30 40 50 60 70
Size of locked region (MB)
80
F ig u re  7.11: Performance under constant memory pressure. The charts show the average time 
per iteration of a memory-adaptive job running against static memory pressure from memlock on 
a Linux 2.4 node with 128 MB RAM. Low frequency probing with Rpen =  10 is used in all cases. 
The top chart shows the average time per iteration of CG with a 70 MB matrix (P  =  20), which 
requires a total of 81 MB of RAM including memory for the work vectors. The middle chart shows 
the time to generate and orthogonalize via modified Gram-Schmidt the last 10 vectors of a 30 vector 
set. Approximately 80 MB are required to store all 30 vectors, which are stored one vector to a 
panel (P  =  30). The bottom  chart shows the time required for an Ising code running at T  =  2 to 
sweep through a 70 MB lattice (P  =  20).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 106
to  the panels do increase execution times, as there is no avoiding flushing dirty  pages to 
disk when panels are unm apped. Note, however, th a t even for extrem ely frequent writes, 
graceful performance degradation is observed.
One may notice th a t despite the much higher frequency of writes in the T  =  50 case, 
the observed performance is essentially the same as in the T  = 2 case. This makes sense 
because although only around 16% of the flips are accepted in the T  = 2 case, those flips 
are d istribu ted  widely throughout the spin lattice. Because one page can contain over a 
thousand spins, it is likely th a t w ith 16% acceptance, almost every page will be updated  
and therefore m ust be flushed to disk. To the memory subsystem, there is essentially no 
difference between 16% and 97% acceptance.
The im pact of frequent writes on performance explains, at least partially, why in Figure 
7.11, M M lib seems to  confer less benefit to ISING th an  to  CG or MGS in the sense th a t lower 
speedups over the in-core version are observed. The MMLIB-enabled CG and MGS spend very 
little tim e writing to disk, which gives them  an  advantage over the in-core versions which 
m ust write to the swap device. M emory-adaptive ISING on the o ther hand, running at 
T  =  2, m ust devote considerable tim e to  such writes, so it loses some of its advantage 
over the in-core code. (We could make m em ory-adaptive ISING show better performance 
in Figure 7.11 by running at T  =  0, bu t this tem perature  is of no scientific interest, so we 
use T  =  2, a tem perature  th a t physicists might actually wish to simulate.)
7.5 Q uick resp on se to  tran sien t m em ory pressure
Our goal is for MMLIB-enabled applications to  not only exhibit graceful performance degra­
dation under memory pressure, bu t to also respond quickly to changes in memory availabil­
ity. Figure 7.13 illustrates the quick response of our m em ory-adaptive ISING code (using low 
frequency probing) to  transient memory pressure. Depicted is the resident set size versus 
tim e for ISING running w ith a 150 MB lattice on a Solaris 9 system  w ith 256 MB of RAM. 
After running for 120 seconds, a dum my job th a t random ly writes to a 100 MB region of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N  107
16
14
VJ
O J
I  10 
, 2  
<D
E
o><
6
4
20 30 40 50 60 70 80
Size of locked region (MB)
F igure 7.12: Effects of write frequency on M M l i b  performance in ISING. ISING runs with a 70 
MB spin lattice against static memory pressure applied via memlock on a Linux 2.4 system with 
128 MB RAM. P  =  20, R pen.max = 10, and low frequency probing is used. Performance curves are 
shown for temperatures T  = 0, T  =  2, and T  =  50, which correspond to acceptance probabilities of 
0, 16.2, and 97.0 percent. Error bars represent 95% confidence intervals. Performance is markedly 
better in the T  — 0 case because no time is spent flushing panels to disk.
memory begins and runs for 30 seconds. As soon as the dum my job introduces memory 
pressure, M M lib responds by quickly dropping the ISING memory usage to a safe level; 
such quick response is crucial to avoid thrashing the system. Soon after the dum m y job 
exits, M M lib begins to  adjust the memory usage back upwards, decreasing the frequency 
of probes for memory as more memory is obtained. A lthough it takes considerable tim e to 
finally reach the full memory usage, RSS does grow significantly soon after memory pressure 
disappears. We note th a t in practical term s the long tim e to reach full memory usage has 
little  effect, as the buffer cache of the system  caches many of the panels th a t M M lib  does 
not—hence the fast iteration times observed before full memory utilization is resumed.
Effect of write frequency on MMLIB-ISING perform ance
1 1 
i— i— i T = 0 (0% accep tance)
1 '  T = 2 (16% accep tance) 
T = 50 (97% accep tance)
---  1 1 1
-
-
T ____ ____
_________ _— -I----- ----
C ~— " " "
I I 1 1 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 108
150 MB ISING-MEM vs. transient (30 sec) 100 MB dummy job
160000
140000
120000
m 100000
ID
N
01
80000a>0)
c<1)-g<n<Dtr 60000
40000
20000
0 100 300 400200 500 600
Time elapsed (seconds)
Figure 7.13: Quick response by M M l i b  to transient memory pressure. Memory-adaptive ISING 
runs at T  = 2 with a 150 MB lattice on a Solaris 9 system with 256 MB of RAM. After 12 seconds, a 
dummy job that randomly writes to a 100 MB region of RAM begins and runs for 30 seconds. M M l i b  
quickly drops memory usage when the dummy job begins, and then increases memory usage in a 
controlled manner after the job exits. P  = 40, R pen-max = 10 with low frequency probing. Circles 
denote the beginning of a lattice sweep. (One might notice that the second sweep completes very 
quickly: only a simple calculation of the initial energy of the spin system is computed during that 
sweep.)
7.6 A d a p tiv e  versus ad ap tive  job s
For our memory adap tation  technique to  be actually useful, m ultiple instances of jobs em­
ploying it m ust be able to coexist on a machine w ithout causing thrashing on, monopolizing, 
or otherwise d isrupting the system. Our research has shown th a t m em ory-adaptive jobs run 
harm oniously w ith each other, reaching a dynamic equilibrium  where system  memory u ti­
lization is high and good CPU throughput is sustained.
Figure 7.14 depicts the resident set size (RSS) over tim e for various configurations of 
two m em ory-adaptive ISING jobs running sim ultaneously on a Solaris 9 machine w ith 256
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 109
MB of RAM. We show results from three configurations, all of which have a to ta l memory 
requirem ent of 300 MB: a 150 MB versus a 150 MB job, a 200 MB versus a 100 MB job, 
and a 100 MB versus a 200 MB job. In each case, the second job begins 100 seconds after 
the first job has started , in order to  allow the first job tim e to touch all of its panels first.
In all cases, the com peting jobs quickly reach a sort of dynamic equilibrium  in which the 
to ta l memory usage of the two jobs stays in the region of 150-200 MB. As one job probes for 
more memory, the other job drops its memory usage by a roughly corresponding am ount. 
In  all cases, CPU  utilization is high (80% or better is typical) and most of the available 
memory is used: although there are 256 MB available on a node, the param eter settings for 
the Solaris 9 operating system  are somewhat conservative and limit the to ta l am ount of file 
cache memory to  about 200 MB. Because any nam ed memory m apped regions occupy space 
in the file cache, when the to ta l utilization of our two jobs is a t 200 MB, they are using all 
memory available to them  on the system. (Note th a t our experim ents on the Solaris nodes 
also indicate th a t they will only allow about 205 MB of anonymous memory to  be utilized 
by user-level processes.)
The results in Figure 7.14 were obtained using low frequency probing. We have also run  
the adaptive-versus-adaptive experim ents using high frequency probing, to see if they cor­
roborate our results from section 7.3.3. Memory pressure arising from com peting memory- 
adaptive jobs is very different from pressure created by our dum my jobs or by conventional, 
in-core codes, and it seems worthwhile to investigate the effects of the more aggressive 
probing type in this situation. Figure 7.15 depicts RSS versus tim e using high frequency 
probing for the three test configurations used in Figure 7.14. Com paring the figures for 
the two probing types, one can see th a t the more aggressive probing in the high frequency 
case results in much more frequent fluctuations in the RSS of the jobs, as expected. In all 
cases, using high frequency probing results in somewhat higher overall memory utilization 
on the node. This is most noticeable in the 100 MB versus 200 MB case, where using high 
frequency probing results in the second, 200 MB job receiving a fairer share of memory
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER  7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 110
compared to  the low frequency case. However, the higher memory usage when using high 
frequency probing seems to be offset by the increased give-and-take between the com peting 
jobs: In the 100 versus 200 MB experiments, while the 100 MB job runs, the 200 MB job 
averages 131.79 seconds per sweep (standard  deviation 10.78 seconds) w ith low frequency 
probing, compared to  134.53 seconds per sweep (standard  deviation 10.63 seconds) using 
high frequency probing. This is despite the fact th a t the 200 MB job consistently gets 20-30 
MB more memory in the high frequency case. In the 200 versus 100 MB experim ents, for 
bo th  types of probing the 200 MB job completes 99 iterations before the 100 MB job ends. 
Looking a t the tim e th a t th is occurs shows th a t clearly the high frequency jobs fare worse. 
Similarly, in the 150 MB versus 150 MB experiments, the jobs employing low frequency 
probing finish roughly 2500 seconds before the high frequency jobs, even though all of the 
jobs run  for 240 iterations. Clearly, in our adaptive-versus-adaptive experim ents, low fre­
quency probing results in be tte r performance, ju st as in the experim ents in section 7.3.3. 
High frequency probing does provide much quicker response to memory surplus, bu t this is 
offset by the increased give-and-take between the com peting jobs.
7.7  P erform ance w ith  L R U -frien d ly  access p a ttern s
All three of the applications we have tested employ memory access pa tterns for which 
MRU-replacement is most appropriate, and are therefore at extrem e odds w ith the LRU- 
like algorithm s employed by the operating systems. The results we have presented so far 
show th a t the MMLIB-enabled codes yield superior performance compared to conventional 
in-core versions th a t rely on the v irtual memory system  to  handle memory pressure. The 
question remains, however, whether these performance gains are due solely to M M l ib  
enabling MRU replacem ent to  be used, or are a ttribu tab le  to  o ther characteristics of the 
adaptation  strategy, such as avoidance of writes to the swap device or pre-fetching of panels.
To answer this question, we have developed a version of the CG code th a t utilizes an 
LRU-friendly access pattern . Normally, CG performs m atrix-vector m ultiplications starting
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 111
ISING- 160 MB MB
160000
First job 
Second fob
I
§1
ISING-M04: 200 MB versus 100 MB
(a) 150 MB vs. 150 MB, Rpen type I
10000 15000 20000 25000
Time elapsed {seconds)
180000 
160000 
140000 
120000 
100000 
80000 
60000 
40000 ■
" W S S i .
uL
0 2000 4000 6000 6000 10000 12000 14000
Time elapsed (seconds)
(b) 200 MB vs. 100 MB, Rpen type I
ISING-MEM: 100 MB versus 200 MB
200000 r  
160000 • 
160000 - 
140000 - 
120000 -
0  2000 4000 6000 6000 10000 12000 14000
Time elapsed (seconds)
(c) 100 MB vs. 200 MB, Rpen type I
Figure 7.14: Profiles of RSS vs. time for two instances of memory adaptive ISING jobs running 
simultaneously, using low frequency probing. In each experiment, the second job starts 100 seconds 
after the first job. Chart (a) depicts the 150 MB vs. 150 MB case, chart (b) 200MB vs. 100MB, and 
chart (c) 100 MB vs. 200 MB. Circles denote the beginning of lattice sweeps: horizontal distance 
between consecutive circles along a curve indicate the time required to complete the sweep. In all 
cases, the simulated temperature is T  =  2, P  =  20, and Rven-max =  10.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 112
ISING-MEM: 150 MB versus 150 MB
160000
First job 
Second job
ISING-MEM: 200 MB versus 100 MB
160000
160000
140000
S~  t 20000 
a
j j  100000 
|  60000 
60000 
40000 
20000
10000 15000 20000 25000
Time elapsed (seconds)
2000 4000 6000 6000 10000 12000 14000
Time elapsed (seconds)
(a) 150 MB vs. 150 MB, Rven type II (b) 200 MB vs. 100 MB, Rpen type II
ISING-MEM: 100 MB versus 200 MB
I
1
ML
0 2000 4000 6000
Time elapsed (seconds)
8000 12000 1400010000
(c) 100 MB vs. 200 MB, Rpen type II
Figure 7.15: Profiles of RSS vs. time for two instances of memory adaptive ISING jobs running 
simultaneously, using high frequency probing. The charts depict the same experimental cases as 
shown in Figure 7.14, but using high frequency probing instead of low frequency probing. Chart
(a) depicts the 150 MB vs. 150 MB case, chart (b) 200MB vs. 100MB, and chart (c) 100 MB vs. 
200 MB. Circles denote the beginning of lattice sweeps: horizontal distance between consecutive 
circles along a curve indicate the time required to complete the sweep. In all cases, the simulated 
temperature is T  =  2, P  =  20, and Rpen-max =  10.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C H APTER  7. EXPERIM ENTAL EVALUATIO N OF M EM O RY-AD APTATIO N 113
with the first row of the m atrix  and proceeding down the rows until the last row is processed. 
T ha t is, a t each CG iteration, a  sweep from the top to  the bo ttom  of the m atrix  (and, for 
the m em ory-adaptive case, from the first to the last panel) is performed. We can make CG 
LRU-friendly, however, by employing an alternating  access pattern : during one iteration, 
sweep through the m atrix  from top to  bottom , and then  during the next iteration, sweep 
from bo ttom  to top. In our LRU-friendly im plem entations, CG-LRU, the conventional in- 
core code performs the  backsweeps row by row, while the MMLlB-enabled version sweeps 
backwards through its set of panels, bu t processes the rows of each panel in the usual top to 
bottom  fashion. In this m anner the MMLIB version utilizes an LRU-friendly access pa tte rn  
while still taking advantage of the pre-fetching th a t the large units (panels) of d a ta  access 
enable.
Figure 7.16 compares the in-core and MMLlB-enabled versions of CG-LRU w ith the in- 
core CG th a t uses the conventional, M RU-friendly access pattern . The MMLIB-enabled 
CG-LRU performs well (as expected) under memory pressure. B ut somewhat surprisingly, 
the in-core CG-LRU performs only m arginally be tte r th an  conventional in-core CG, despite the 
improved access pa tte rn  used by CG-LRU. Because the activity th a t occurs during thrashing 
can be quite complicated, it is difficult to  say why CG-LRU fails to  exhibit significant perfor­
mance improvements. O f course, in-core CG-LRU does not avoid VM -system overheads and 
write-backs to the swap device, and it cannot take advantage of pre-fetching from backing 
store like MMLIB-enabled codes can, bu t it seems unlikely th a t these factors account en­
tirely for the poor performance we see. It may be th a t CG-LRU loses significant tim e due 
to  the work vectors being paged out: W hile CG-LRU uses an LRU-friendly p a tte rn  to  access 
the m atrix, this m odification does not extend to  the work vectors. The MMLIB-enabled 
CG codes can protect the work vectors from page reclam ation by controlling the num ber of 
panels cached, while the in-core CG-LRU cannot.
To further isolate the effects of using optim al replacem ent from the other improvements 
th a t M M lib affords, we conducted experim ents w ith m em ory-adaptive CG in which we used
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 114
Co
03
o
40 -
35 -
30 -
o  25 hO
<D
E
20
O) 1*;co lo
0
1
10 -
5 -
10
Performance of CG with LRU-friendly access pattern under memory pressure
20
— T "
30 40 50
Size of locked region (MB)
-  + i CG-MEM-LRU 
- CG-INCORE-LRU 
CG-INCORE-MRLJ
60 70 80
F ig u re  7.16: Performance of in-core and memory adaptive versions of CG-LRU, which uses an LRU- 
friendly access pattern. Jobs run under static memory pressure provided by memlock on a Linux 2.4 
system with 128 MB of RAM. The performance profile for in-core, MRU-friendly CG is also depicted 
for comparison. P  =  20 and low frequency probing is used with Rpen-max =  10. Error bars represent 
95% confidence intervals.
the normal, M RU-appropriate access pattern , bu t we instructed  M M lib  to use LRU re­
placement. This ensures th a t under any significant memory pressure, all mmlib_get-panel() 
calls will result in a panel miss, so any benefits conferred by M M lib  will not be due to 
the replacement strategy. Figure 7.17 compares the CG code using the wrong replacem ent 
policy (CG-LRU-WR0NG) with the CG correctly employing MRU. The version using LRU re­
placement performs significantly worse, requiring roughly twice the am ount of tim e required 
by the MRU version to  perform  one iteration. However, when compared to  the performance 
of in-core CG under memory pressure, the CG-LRU-WR0NG code still performs iterations in 
roughly half the tim e of in-core CG at lower levels of memory pressure, and at higher levels 
performs even better.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 7. EXPERIM ENTAL EVALU ATIO N OF M EM O RY-AD APTATIO N 115
Performance of MMLIB-CG using wrong access pattern
40
MMLIB-CG-LRU-wrong — i— 
MMLIB-CG-MRU-correct 
CG-incore
o0)w
co
<5k_
£
CD
°  20
o
Q)
E
<D
———K
10 20 30 40 50 60 70 80
Size of locked region (MB)
Figure 7.17: Performance of memory-adaptive CG versus memory pressure when using the wrong 
panel replacement policy. The appropriate replacement policy is MRU, but LRU is used instead; 
any observed performance gains are not due to the ability of M M lib to allow application-specific 
replacement. Jobs run under static memory pressure provided by memlock on a Linux 2.4 system 
with 128 MB of RAM. P  =  20, and low frequency probing is used with R pen.max = 10.
All of the experim ents w ith LRU replacem ent dem onstrate th a t the value of M M lib  
does not lie solely in its ability to allow use of application-specific replacem ent policies. The 
larger granularity of memory access and preem ption of the virtual memory system  and its 
associated overheads and possible antagonism  also play a vital role.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 8
Conclusions and future work
We contend th a t in order to  effectively utilize m odern, shared com puting environments, 
applications should m onitor and adap t to changing system conditions. This research has 
pursued two distinct bu t complem entary mechanisms to  enable some classes of scientific 
applications to  adapt to variations in CPU and memory availability.
8.1 F lex ib le-p h ase  load balancing
The first mechanism we studied is a strategy for dynamic load balancing of a class of itera­
tive algorithm s th a t possess what we term  a flexible phase, during which the am ount of local 
work completed can vary w ithout detrim ent to  the end result. The technique can dynam i­
cally correct load imbalance from bo th  internal and external sources and requires minimal 
overhead. A lthough lim ited to a narrow class of algorithm s, this class includes some popular 
num erical algorithm s th a t form the com putational kernels of many scientific applications. 
We have experim entally dem onstrated the u tility  of our m ethod in two Krylov-like methods: 
a  m ulti-grain form ulation of a block Jacobi-Davidson eigensolver and an FGM RES linear 
solver employing additive-Schwarz preconditioning. O ur m ethod has proven able to  cope 
with severe load imbalance in the Jacobi-Davidson eigensolver, and it makes practical the 
full-m ultigrain parallel form ulation th a t can hide network latencies bu t may accentuate load 
imbalance. In the additive-Schwarz preconditioned FGM RES solver, perform ance gains are 
significant bu t more m odest due to some intrinsic properties of the num erical m ethod. We
116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 8. CONCLUSIONS AND FU TURE W O RK 117
also note th a t the m ethod can sm ooth load imbalances due to  differences in conditioning of 
the subdom ains, which cannot be fixed by repartitioning.
Our load balancing strategy is able to  rectify large load imbalances in the m ulti-grain, 
block eigensolver we tested, bu t is more lim ited in the dom ain-decom position linear system 
solver. In future work, it may be worthwhile to  see if we can achieve b e tte r results w ith a 
linear solver by working with a m ulti-grain block Krylov-type solver, such as block FGM ­
RES. Additionally, our strategy may show b e tte r performance in a dom ain-decom position 
context involving coarser subproblems, such as an inexact Newton m ethod employing non­
linear additive Schwarz preconditioning [17, 16]. In such a m ethod, a nonlinear problem  
is solved on each subdom ain, involving much more com putation than  the simple, linear 
subdom ain solves of the additive-Schwarz preconditioner for FGM RES.
8.2 A  d yn am ic m em ory-ad ap tation  fram ew ork
The second idea we have explored is a general framework th a t allows scientific and other 
data-intensive applications to  dynam ically adap t their memory requirem ents, facilitating 
graceful degradation of performance as memory pressure arises. It does so w ith m inim al 
reliance on operating system  provided information, and can be embedded in many scientific 
kernels w ith little  coding effort using the supporting library, M M l ib , we have developed. 
The key technical hurdle in realizing our framework is the lack of inform ation about memory 
availability provided by operating systems. We have developed a relatively simple algorithm  
th a t judges memory availability effectively using a single metric, the resident set size of 
the application. The m em ory-adaptation framework is especially suited to  non-centrally 
adm inistered, open systems, and supports the following functionality:
• m anagement of m ultiple memory objects with multiple active panels a t the same tim e
•  memory objects th a t can be modified or vary in size
• access to  memory objects a t user-defined granularities
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 8. CONCLUSIONS AND FU TURE W O RK 118
•  autom atic estim ation of the size of memory not managed w ithin the framework
• user-defined panel replacem ent policies, which can optionally be coordinated w ith the 
system  policy through a “most-missing” alternative.
We have used our M M lib  library to embed memory adaptiv ity  into three scientific ap­
plications: a conjugate gradient linear system  solver, a modified Gram -Schm idt orthogonal- 
ization routine, and a M onte-Carlo Ising model sim ulation. These applications or kernels, 
though simple, are representative of elements of a wide range of scientific and engineer­
ing codes. Experim ental evaluation of the applications in Linux and Solaris environm ents 
dem onstrate th a t the m em ory-adaptation framework yields considerable performance ad­
vantages over reliance on the v irtual memory system  to handle memory pressure. These 
advantages stem  not only from enabling the use of application-specific replacem ent policies, 
bu t from avoidance of v irtual memory system  overheads and antagonism. Im portantly, mul­
tiple jobs employing the m em ory-adaptation framework can coexist on a  system  w ithout 
thrashing, monopolizing, or otherwise d isrupting the system.
We believe th a t our results have dem onstrated the u tility  of our m em ory-adaptation 
framework, bu t many topics rem ain for future work. We have briefly investigated the use 
of a “most-missing” replacem ent policy th a t is coordinated w ith the page reclam ation of 
the operating system. Our results using this policy to evict panels th a t have been mostly 
evicted by the operating system  show promise, bu t there are some questions th a t m ust 
be answered: We suspect th a t we should always circumvent the application specific policy 
during memory shortage, bu t we m ust verify th a t this is the case or whether it is sometimes 
appropriate to strike a balance between the two policies. If a  balance should be struck, we 
m ust determ ine the thresholds th a t d ictate when the MRU policy should be followed for a 
page th a t is targeted by the most-missing policy.
We have designed M M lib  to  require as little operating system  provided inform ation as 
possible, in part to  facilitate portability. We have tested M M lib  extensively under Linux 
2.4 and Solaris 9, and to a lesser extent under Linux 2.2 and Solaris 7 and 8. We have also
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CH APTER 8. CONCLUSIONS AND FU TURE W O RK 119
tested it cursorily under MacOS X  10.1. It is im portant to evaluate M M lib  under other 
operating systems, whose memory-management strategies may differ. For instance, we have 
not evaluated M M lib  under systems th a t employ a page-fault frequency (PFF) strategy 
for thrashing prevention.
We have dem onstrated w ith M M lib  th a t it is possible to hide m em ory-adaptation deci­
sions and bookkeeping behind a simple library interface. The current version of the library 
is still somewhat rudim entary, however, and work should be done to  further refine it. For 
example, more functions im plem enting common panel replacem ent policies should be added, 
and interfaces to facilitate m apping more complex d a ta  structures to  panels should be pro­
vided. Developing a more “production ready” version of M M l ib  will make easier what is 
perhaps the most im portan t work to  be pursued: using M M lib  in more types of and more 
complex applications. Our eventual goal is to use M M l ib  to implem ent memory adaptiv ity  
in a widely used scientific toolkit such as PETSc. By providing features such as a memory- 
adaptive sparse m atrix  class th a t supports the standard  PETSc m atrix  m ethods, we can 
enable scientists to embed, w ith minimal effort, m em ory-adaptivity into their existing appli­
cation codes. Such ease of use is vital if we wish to  see adoption of our m em ory-adaptivity 
techniques in real-world applications.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Appendix A
LBlib: A flexible-phase load 
balancing library
To facilitate general use of our load balancing strategy, we have w ritten  an object-based 
C library, L B l ib , th a t hides much of the required bookkeeping from the application pro­
grammer. To simplify da ta  m anagement and provide inform ation hiding, d a ta  required 
for resource balancing are stored in a variable of the defined type LBS. An application 
program m er can access the d a ta  w ithin an LBS structu re  through L B lib functions. D ata 
encapsulation is ensured by appropriate use of v o id  * pointers in the internal implem en­
tation. A unique LBS structu re  is explicitly associated with a group of processors, and 
implicitly w ith a particular flexible section th a t these processors m ust load balance. Below, 
a subset of the basic functionality of the library is described:
LBS lb _ n ew _ lb s tru c t (MPI_Comm com m unicator)
This is the LBS constructor function, where com m unicator is the M PI com m unicator w ith 
which the LBS is to  be associated. Note th a t a node of an M PI job could participate in the 
load balancing of nested, bu t different, flexible sections.
v o id  lb _ se c tio n _ s ta r t(L B S  lb s )
double  lb_section_end(L B S l b s ,  doub le  ops_com pleted)
These functions designate the beginning and end of the flexible section associated with
120
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPEND IX A. L B L IB : A  FLEXIBLE-PH ASE LOAD BALANCING  L IB R A R Y 121
the lb s  structure, which is to  be load balanced. lb _ s e c t io n _ s ta r t () begins tim ing the 
execution of the section and, if memory balancing is used, it also begins tracking the am ount 
of page swapping and CPU idling. Once all operations w ithin the flexible section are 
completed, lb _ se c tio n _ e n d ()  stops the tim ing and calculates the rate  a t which the local 
processor performed operations since lb _ s e c t io n _ s ta r t  ( ) . This rate  is returned  by the 
function for convenience, though it is also stored in the lb s . ops_com pleted is the num ber 
of operations th a t were completed during the flexible section. It is up to  the application 
program m er to  decide on a suitable way of counting the num ber of operations. In  our 
eigensolver application, we use the num ber of iterations performed by the linear system 
solver during the correction phase.
double  lb_decide(LB S l b s ,  doub le  o p s , i n t  m ethod, [ fu n c t io n ] )
Before entering a section we m ust determ ine the tim e T  th a t all processors will spend in 
it. This is accomplished by calling lb _ d e c id e () , where ops is the num ber of operations 
th a t we ideally want each processor to  complete. lb _ d e c id e ()  is the only synchronous 
L B lib  call th a t causes all processors in the com m unicator associated w ith lb s  to  gather 
the com putation rates observed by each processor during the most recent execution of the 
flexible section. The rates are then  sorted, and are used to  determ ine T  using the m ethod 
specified by the m ethod argum ent. In  the current version of L B lib , the possible values of 
m ethod and the corresponding procedure for determ ining T  are as follows:
•  LB_USE_FASTEST: T  is the predicted tim e for the fastest processor to com pute ops 
operations.
•  LB_USE_SLOWEST: T  is the predicted tim e for the slowest processor to  com pute ops 
operations.
•  LB_USER_DEFINED: A pointer to  a user-defined function for calculating T  is passed as a 
fourth argum ent. lb _ d e c id e ()  calls this function, passing ops and the array of sorted
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPEND IX A. L B L IB : A FLEXIBLE-PH ASE LOAD BALANCING  L IB R A R Y 122
rates to  it.
lb _d ec id e() returns the tim e T, bu t it also stores it in lb s  so th a t the application program ­
mer does not need to  know it. Note th a t before the first call to  lb _ d e c id e () , one execution 
of the flexible section should have already been tim ed through calls to  lb _ s e c t io n _ sta r t( )  
and lb _ sec tio n _ en d () . Otherwise, no com putation rates are available for lb _ d ec id e ()  to 
determ ine the tim e T.
in t  lb_continue(LBS lb s ,  double ops_com pleted, double opsm eeded)
This function allows the program m er to  check whether the allo tted  tim e T  is about to be 
exceeded, in which case the program  m ust exit the flexible section. ops_completed is the 
num ber of operations completed so far during the current flexible section and ops_needed 
is the num ber of operations th a t will be completed before another call to lb _ co n tin u e()  
can be made. lb _con tin u e() calculates the execution rate  of the local processor during 
the current flexible section, and then  uses this ra te  to  predict the tim e required to do 
ops_needed more operations. If this tim e falls w ithin the allotted tim e, then  lb_continue () 
returns 1, indicating th a t execution of the flexible section should continue. Otherwise it 
returns 0, indicating th a t we should exit the flexible section to  avoid load imbalance. In 
our eigensolver application, the value of ops_needed is one, because we call lb _ co n tin u e()  
after each iteration  of the linear system  solver.
double * lb_get_rates (LBS lb s )  
in t  *lb_get_index(LBS lb s )
It is often necessary for the application program m er to  know the ordered rates of the 
processors, so th a t the critical tasks can be assigned to the most appropriate  processors. 
lb _ g e tjr a te s () returns a pointer to an array containing the rates of the processors, in as­
cending order. The processor indices corresponding to each of these rates can be retrieved 
through a pointer to an array returned  by lb _ g e t_ r a tes () .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPEND IX A. L B L IB : A  FLEXIBLE-PH ASE LOAD BALANCING  L IB R A R Y 123
The above functions significantly simplify the load balancing of codes th a t abide by 
the general model of section 3.1 . Figure A .l dem onstrates the use of L B l ib  to  bal­
ance CPU load in our coarse-grain JDcg eigensolver. Note th a t, the first tim e through 
the flexible section, the for-loop should execute a constant num ber of iterations in or­
der to collect the first set of performance data. The semantics of the flexible section 
and its associated d a ta  d ictate  certain  rules when calling some functions of the L B l ib  
library. Following the l b _ d e c i d e ( )  semantics, l b _ c o n t i n u e ( )  m ust be placed between 
an l b _ s e c t i o n _ s t a r t () and an l b _ s e c t i o n _ e n d ( ) . Also, l b _ g e t _ r a t e s ( )  m ust be called 
only after a  call to  l b _ d e c i d e ( )  has been made, so th a t the rates have been exchanged 
between processors. Finally, between a call to  l b _ s e c t i o n _ s t a r t ( l b s l )  and a call to 
l b _ s e c t i o n _ e n d ( l b s l ) ,  we cannot issue another call to  l b _ s e c t i o n _ s t a r t ( l b s l ) . How­
ever, a call to l b _ s e c t i o n _ s t a r t ( l b s 2 ) ,  for a different LBS structure, is permissible.
In ord er to  en force  su ch  ru les , an  LBS k eep s track  o f  tw o  fields: s c o p e  a n d  c o n t e x t ,  
s c o p e  k eep s track  o f  w h e th er  p rogram  e x e c u t io n  is cu rren tly  w ith in  th e  flex ib le  se c tio n , 
w h ile  c o n t e x t  track s th e  a v a ila b ility  o f  d a ta  for fu n c t io n s  su ch  as l b _ d e c i d e ( ) .  In  ca se  
o f  in a p p ro p r ia te  sc o p e  or c o n te x t , L B l ib  fu n c tio n s  se t  an  error c o d e  in  th e  LBS, w h ich  th e  
p rogram m er ca n  check  th r o u g h  v o i d  lb _ c h k e r r (L B S  l b s ) .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPEND IX A. L B L IB : A FLEXIBLE-PH ASE LOAD BALANCING  L IB R A R Y 124
Algorithm: Load balanced JDcg
lb s  = lb_new_lbstruct(MPI_COMM_WORLD);
U ntil convergence do:
/ /  Control phase
Perform  projection phase, steps 1-7 of JDcg 
Determ ine optim al num ber of iterations optits  to perform. 
lb_decide ( l b s , o p tits , LB_USE_FASTEST) ; 
ordering  = lb _ g et_ in d ex (lb s) ;
Perform  all-to-all: faster processors receive more critical residuals
/ /  Flexible Phase 
lb _ s e c t io n _ s t a r t ( lb s ) ;
for (ops =  0; lb_continue ( lb s  ,ops, 1) ; ops++)
Perform  one BCGSTAB step on eq. (3.1) 
lb _sec tion _en d (lb s  ,o p s ) ;
end do
F igure A .l: Pseudocode depicting how the L B lib  library can be used to load balance the JDcg 
solver.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bibliography
[1] M. A c c e t t a ,  R . B a r o n ,  W . B o l o s k y ,  D . B . G o l u b ,  R . R a s h i d ,  A . T e v a n i a n ,  
AND M . YOUNG. Mach: A  new kernel foundation for Unix development. In Proceedings 
of the Sum m er 1986 U SEN IX  Conference, pages 93-112, 1986.
[2] E . A n d e r s o n , Z. B a i , C. B i s c h o f , J . D e m m e l , J. D u  C r o z , A. G r e e n b a u m , 
S. H a m m a r l in g , A M c K e n n e y , S. O s t r o u c h o v , a n d  D . S o r e n s e n . L A P A C K  
User’s Guide, Third Edition. SIAM, 1999.
[3] D. A r n o l d , S. A g r a w a l , S. B l a c k f o r d , J . D o n g a r r a , M. M i l l e r , K. S e y ­
m o u r , K. S a g i , Z. S h i , a n d  S. V a d h i y a r . Users’ Guide to  NetSolve V l.4.1. In­
novative Com puting Dept. Technical Report ICL-UT-02-05, University of Tennessee, 
Knoxville, TN, June 2002.
[4] A n d r e a  C. A r p a c i- D u s s e a u  a n d  R e m z i  H . A r p a c i- D u s s e a u . Inform ation and 
control in gray-box systems. In  Proceedings o f the 18th A C M  Symposium on Operating 
System s Principles (SO SP ’01), pages 43-56, 2001.
[5] S a t is h  B a l a y , K r is  B u s c h e l m a n , V i c t o r  E i j k h o u t , W il l ia m  D. G r o p p , D i- 
n e s h  K a u s h i k , M a t t h e w  G. K n e p l e y , L o is  C u r f m a n  M c I n n e s , B a r r y  F. 
S m i t h , a n d  H o n g  Z h a n g . PETSc users manual. Technical Report ANL-95/11 - 
Revision 2.1.5, Argonne National Laboratory, 2004.
[6] S a t i s h  B a l a y ,  V i c t o r  E i j k h o u t ,  W i l l i a m  D . G r o p p ,  L o is  C u r f m a n  M c I n n e s ,  
AND B a r r y  F . S m it h .  Efficient m anagement of parallelism  in object oriented num er­
ical software libraries. In Modern Software Tools in Scientific Computing, E. Arge, 
A. M. Bruaset, and H. P. Langtangen, editors, pages 163-202. Birkhauser Press, 1997.
[7] K. B a r k e r , A. C h e r n i k o v , N. C h r is o c h o id e s , a n d  K. P i n g a l i . A lo a d  b a la n c in g  
fram ew ork  for a d a p tiv e  a n d  a sy n ch ro n o s a p p lic a t io n s . IE E E  Transasctions on Parallel 
and Distributed Systems, 14(12):183-192, 2004.
[8] K. J . B a r k e r  a n d  N. C h r i s o c h o i d e s . An evaluation of a framework for the dynamic 
load balancing of highy adaptive and irregular applications. In  Proceedings o f the 
IE E E /A C M  Supercomputing Conference, 2003.
[9] R a k e s h  D. B a r v e  a n d  J e f f r e y  S c o t t  V i t t e r . A theoretical framework for 
m em ory-adaptive algorithms. In  IE E E  Symposium on Foundations o f Computer Sci­
ence, pages 273-284, 1999.
125
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y 126
[10] M. B e r g e r  a n d  S . B o k h a r i . A  partitioning strategy for nonuniform  problem s on 
multiprocessors. IE E E  Transactions on Computers, C-36(5):570-580, 1987.
[11] M. J . B ER G ER  AND P . C o l e l l a .  Local adaptive mesh refinement for shock hydro­
dynamics. Journal o f Computational Physics, 82:64-84, 1989.
[12] M. J . B e r g e r  a n d  J . O l i g e r . A daptive mesh refinement for hyperbolic partial 
differential equations. Journal o f Computational Physics, 53:484-512, 1984.
[13] F . B e r m a n , R. W o l s k i , S. F i g u e i r a , J . S c h o p f , a n d  G. S h a o . Application level 
scheduling on d istribu ted  heterogeneous networks. In  Supercomputing 1996, Fall 1996.
[14] F r a n c in e  B e r m a n , R ic h a r d  W o l s k i , H e n r i  C a s a n o v a , W a l f r e d o  C i r n e , 
H o l l y  D a i l , M a r c io  F a e r m a n , S il v ia  F i g u e i r a , J im  H a y e s , G r a z ia n o  
O b e r t e l l i , J e n n i f e r  S c h o p f , G a r y  S h a o , S h a v a  S m a l l e n , N e il  S p r i n g , A l a n  
S u , a n d  D m it r ii  Z a g o r o d n o v . A d a p tiv e  c o m p u tin g  o n  th e  grid  u s in g  a p p le s . IE E E  
Trans. Parallel Distrib. Syst., 14(4):369-382, 2003.
[15] R . D . B l u m o f e , M . F r i g o , C. F . J o e r g , C. E . L e i s e r s o n , a n d  K . H. R a n d a l l . 
An analysis of dag-consistent d istribu ted  shared-memory algorithm s. In Proc. o f the 
8th A C M  Annual Symp. on Parallel Algorithms and Architectures (S P A A ’96), pages 
2 9 7 -3 0 8 , 1996.
[16] X .-C . C a i,  D. E . K e y e s ,  AND L. M a r c i n k o w s k i .  Nonlinear additive Schwarz pre­
conditioners and applications in com putational fluid dynamics. International Journal 
of Numerical Methods in Fluids, 40:1463-1470, 2002.
[17] X ia o - C h u a n  C a i  a n d  D a v id  K e y e s . N o n lin e a r ly  p rec o n d itio n ed  in e x a c t  N e w to n  
a lg o r ith m s. S IA M  J. Sci. Comput., 24:183-200, 2002.
[18] P .  C a o , E . F e l t e n , A . K a r l i n , a n d  K . L i . Im p le m e n ta t io n  an d  p erfo rm a n ce  o f  
in te g ra ted  a p p lica tio n -co n tro lled  file ca ch in g , p refe tch in g , a n d  d isk  sc h e d u lin g . A C M  
Transactions on Computer System s , 1 4 (4 ) :311—343, N o v em b er 1996.
[19] F . CHANG a n d  V. K a r AMCHETI. A utom atic configuration and run-tim e adap tation  
of d istribu ted  applications. In Proceedings o f HPD C ’00, pages 11-20. IEEE CS Press,
2000 .
[20] Y .-J . CHIANG. Out-of-core isosurface extraction of tim e-varying fields over irregular 
grids. In  Proceedings o f IE E E  Visualization ’03, pages 217-224, 2003.
[21] G. C y b e n k o . D y n a m ic  lo a d  b a la n c in g  for d is tr ib u te d  m em o ry  m u ltip ro c esso r s . Jour­
nal o f Parallel and Distributed Computing, 7(2):279-301, 1989.
[22] K. D e v i n e , B. H e n d r i c k s o n , E. B o m a n , M. S t . J o h n , a n d  C. V a u g h a n . Zoltan: 
A dynamic load-balancing library for parallel applications; user’s guide. Technical 
Report Tech. Rep. SAND99-1377, Sandia National Laboratories, Albuquerque, NM, 
1999.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y 127
[23] J . D o n g a r r a , J . D u C r o u z , I. D u f f , a n d  S. H a m m a r l in g . A set of level 3 basic 
linear algebra subprograms. A C M  Trans. Math. Soft., 16:1-17, 1990.
[24] J . D o n g a r r a , J . D u C r o u z , S. H a m m a r l i n g , a n d  R. H a n s o n . An extended set 
of FORTRAN basic linear algebra subprogram s. A C M  Trans. Math. Soft., 14:1-32, 
1988.
[25] J . D o n g a r r a , S. H a m m a r l i n g , a n d  D. W a l k e r . Key concepts for parallel out-of- 
core LU factorization. Parallel Computing, 23(l-2):49-70, April 1997.
[26] D . R . F o k k e m a , G . L. G . S l e i j p e n , a n d  H. A. v a n  d e r  V o r s t . Jacobi-Davidson 
style Q R  and QZ algorithm s for the partial reduction of m atrix  pencils. S IA M  J. Sci. 
Comput., 20(1), 1998.
[27] I. F o s t e r  a n d  C. K e s s e l m a n . Globus: A m etacom puting infrastructure toolkit. The 
International Journal o f Supercomputer Applications and High Performance Comput­
ing, 11(2):115—128, Summer 1997.
[28] G e o f f r e y  C. F o x , R o y  D. W il l ia m s , a n d  P a u l  C. M e s s i n a . Parallel Computing 
Works. M organ Kaufm ann Publishers, 1994.
[29] M a t t e o  F r ig o  a n d  S t e v e n  G. J o h n s o n . FFTW : An adaptive software architecture 
for the FFT . In Proc. 1998 IE E E  Intl. Conf. Acoustics Speech and Signal Processing, 
volume 3, pages 1381-1384. IEEE, 1998.
[30] M a t t e o  F r i g o ,  C h a r l e s  E . L e i s e r s o n ,  H a r a l d  P r o k o p ,  a n d  S r i d h a r  R a -  
MACHANDRAN. Cache-oblivious algorithms. In  Proc. 40th Ann. Symp. on Foundations 
of Comp. Sci. (FOCS), pages 285-297. IE E E  Com put. Soc., 1999.
[31] A. G e o r g e  a n d  J . Liu. Computer Solution o f Large Sparse Positive Definite Systems. 
Prentice-Hall, 1981.
[32] J o h n  R. G i l b e r t , G a r y  L. M il l e r , a n d  S h a n g - H u a  T e n g . Geometric mesh 
partitioning: Im plem entation and experiments. S IA M  J. Sci. Comput., 19(6):2091- 
2110, 1998.
[33] H a r v e y  G o u l d  a n d  J a n  T o b o c h n ik . A n introduction to computer simulation m eth­
ods: applications to physical systems. Addison-Wesley Publishing Company, 1996.
[34] K . G r e m b a n , G a r y  L. M il l e r , a n d  S h e n g - H u a  T e n g . M oments of inertia  and 
graph separators. In Proceedings o f the Fifth S IA M  Symposium on Discrete Algorithms, 
1994.
[35] KlERAN H a r t y  a n d  D a v id  R. C h e r i t o n .  Application-controlled physical memory 
using external page-cache management. In Proceedings o f the 5th International Con­
ference on Architectural Support for Programming Languages and Operating System  
(ASPLO S), volume 27:9, pages 187-197, New York, NY, 1992. ACM Press.
[36] B. H e n d r i c k s o n  a n d  R. L e l a n d . The Chaco user’s guide — version 2.0. Technical 
Report SAND94-2692, Sandia National Laboratories, 1994.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y 128
[37] B r u c e  H e n d r i c k s o n  a n d  R o b e r t  W . L e l a n d . A multi-level algorithm  for p a rti­
tioning graphs. In  Proceedings o f Supercomputing ’95, 1995.
[38] Y. F . H u, R . J . BLAKE, a n d  D .R . E m e r s o n . An optim al m igration algorithm  for 
dynamic load balancing. Concurrency: practice and experience, 10:467-483, 1998.
[39] S. F . H u m m e l , E. S c h o n b e r g , a n d  L. E . F l y n n . Factoring: a  practical and robust 
m ethod for scheduling parallel loops. In Proceedings of Supercomputing 91, pages 610- 
619. IEEE CS Press, 1991.
[40] S. JlANG AND X. Zh a n g . LIRS: An efficient low inter-reference recency set replace­
ment policy to  improve buffer cache performance. In Proceedings o f A C M  SIG M ET- 
R IC S  Conference on Measuring and Modeling o f Computer Systems, pages 31-42, 2002.
[41] SONG J i a n g . Efficient caching algorithms for memory managem ent in computer sys­
tems. PhD  thesis, D epartm ent of Com puter Science, College of W illiam  and Mary, 
2004.
[42] M. T . JONES AND P .E . P l a s s m a n n . Com putational results for parallel unstructured  
mesh com putations. Computing Systems in Engineering, 5:297-309, 1994.
[43] G e o r g e  K a r y p is  a n d  V ip in  K u m a r . M E TIS: Unstructured Graph Partitioning and 
Sparse M atrix Ordering System, Version 2.0, 1995.
[44] G e o r g e  K a r y p is  a n d  V ip in  K u m a r . A coarse-grain parallel form ulation of a  mul­
tilevel /c-way graph-partitioning algorithm . In Proc. 8th S IA M  Conference on Parallel 
Processing fo r  Scientific Computing, 1997.
[45] G e o r g e  K a r y p is  a n d  V ip in  K u m a r . A fast and high quality multilevel scheme for 
partitioning irregular graphs. S IA M  Journal on Scientific Computing, 20(l):359-392, 
1998.
[46] P . J . K e l e h e r , J . K. H o l l in g s w o r t h , a n d  D. P e r k o v i c . Exposing application 
alternatives. In Proceedings o f IC D C S ’99, pages 384-392. IEEE CS Press, 1999.
[47] C. T . KELLEY. Iterative methods for linear and nonlinear equations. SIAM, 1995.
[48] B. K e r n ig h a n  a n d  S. L i n . An efficient heuristic procedure for partitioning graphs. 
Bell System  Technical Journal, 29:291-307, 1970.
[49] K e it h  K r u e g e r , D a v id  L o f t e s n e s s , A m in  V a h d a t , a n d  T h o m a s  A n d e r s o n . 
Tools for the development of application-specific v irtual memory m anagement. In 
Proceedings o f the OOPSLA ’93 Conference on Object-oriented Programming Systems, 
Languages and Applications, pages 48-64, 1993.
[50] C l y d e  P . K r u s k a l  a n d  A l a n  W e i s s . Allocating independent subtasks on parallel 
processors. IE E E  Trans. Softw. Eng., 11 (10):1001—1016, 1985.
[51] V. K u m a r , A. Y. G r a m a , a n d  N a g e s h w a r a  R a o  V e m p a t y . Scalable load balanc­
ing techniques for parallel computers. Journal o f Parallel and Distributed Computing, 
22(l):60-79, 1994.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y 129
[52] V i p in  K u m a r  a n d  V. N a g e s h w a r a  R a o . Parallel depth-first search on m ultipro­
cessors part I: Analysis. International Journal o f Parallel Processing, 16(6) :501—519, 
1987.
[53] C. L a w s o n , R. H a n s o n , D. K i n c a i d , a n d  F. K r o g h . Basic linear algebra subpro­
grams for FORTRAN usage. A C M  Trans. Math. Soft., 5:308-325, 1979.
[54] D. L e e , J . C h o i , J .-H . K im , S. H. N o h , S. L. M i n , Y. C h o , a n d  C. S. K im . LRFU: 
A spectrum  of policies th a t subsumes the least recently used and least frequently used 
policies. IE E E  Transactions on Computers, 50(12):1352-1360, 2001.
[55] B e n o it  B. M a n d e l b r o t . The Fractal Geometry o f Nature. W. H. Freeman, 1982.
[56] J im  M a u r o  a n d  R ic h a r d  M c D o u g a l l . SO L A R IS  Internals, Core Kernel Archi­
tecture. Prentice Hall P T R , 2001.
[57] J . R . M c C o m b s , R. T . M i l l s , a n d  A. S t a t h o p o u l o s . Dynamic load balancing 
of an iterative eigensolver on networks of heterogeneous clusters. In  Proceedings o f the 
International Parallel and Distributed Processing Symposium (IPD PS 2003), 2003.
[58] J . R . M c C o m b s  a n d  A. S t a t h o p o u l o s . M ultigrain parallelism  for eigenvalue com­
putations on networks of clusters. In Proceedings o f the 11th IE E E  International Sym ­
posium on High Performance Distributed Computing (HPD C 02), pages 143-149, 2002.
[59] J . R . M c C o m b s  a n d  A. S t a t h o p o u l o s . Parallel, m ultigrain iterative solvers for 
hiding network latencies on M PPs and networks of clusters. Parallel Computing, 
29(9):1237-1259, 2003.
[60] D. M c N a m e e  a n d  K. A r m s t r o n g . Extending the M ach external pager interface 
to  accom m odate user-level page replacem ent policies. Technical Report TR-90-09-05, 
Carnegie Mellon University School of Com puter Science, 1990.
[61] R . T . M il l s , A. S t a t h o p o u l o s , a n d  D . S . N i k o l o p o u l o s . A dapting to memory 
pressure from w ithin scientific applications on m ultiprogram m ed COW s. In Proceedings 
of the International Parallel and Distributed Processing Symposium (IP D P S 2004), 
2004.
[62] R. T . M il l s , A. S t a t h o p o u l o s , a n d  D. S . N i k o l o p o u l o s . Runtim e support for 
memory adap tation  of scientific applications on shared com putational resources. 2004. 
To be subm itted.
[63] R . T . M il l s , A. S t a t h o p o u l o s , a n d  E . S m i r n i . Algorithmic modifications to  the 
Jacobi-Davidson parallel eigensolver to  dynam ically balance external CPU  and memory 
load. In  2001 International Conference on Supercomputing, pages 454-463. ACM Press,
2001 .
[64] D. NIKOLOPOULOS. Malleable memory mapping: User-level control of memory bounds 
for effective program  adaptation. In Proc. o f the 17th IE E E /A C M  International Par­
allel and Distributed Processing Symposium (IP D P S’03), Nice, France, April 2003.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y 130
[65] D im it r io s  S. N ik o l o p o u l o s  a n d  C o n s t a n t i n e  D . P o l y c h r o n o p o u l o s . A daptive 
scheduling under memory constraints on non-dedicated com putational farms. Future 
Gener. Comput. Syst., 1 9 (4 ):5 0 5 -5 1 9 , 2003 .
[66] L e o n i d  O l ik e r  a n d  R u p a k  B is w a s . PLUM: parallel load balancing for adaptive 
unstructured  meshes. Journal o f Parallel Distributed Computing, 52(2):150-177, 1998.
[67] H w e e H w a  P a n g , M i c h a e l  J. C a r e y , a n d  M ir o n  L i v n y . M emory-adaptive ex­
ternal sorting. In  19th International Conference on Very Large Data Bases, August 
24-27, 1993, Dublin, Ireland, Proceedings, Rakesh Agrawal, Sean Baker, and David A. 
Bell, editors, pages 618-629. Morgan Kaufmann, 1993.
[68] C. D. P o l y c h r o n o p o u l o s  a n d  D. J . K u c k . Guided self-scheduling: A practical 
scheduling scheme for parallel supercom puters. IE E E  Transactions on Computers, 
C-36(12): 1425-1439, 1987.
[69] A. POTHEN, H. SlMON, , AND K -P . LlOU. Partition ing  sparse m atrices with eigen­
vectors of graphs. S IA M  Journal o f M atrix Analysis, ll(3):430-452, 1990.
[70] V . N a g e s h w a r a  R a o  a n d  V i p in  K u m a r . Parallel depth-first search on m ultiproces­
sors part I: Im plem entation. International Journal of Parallel Processing, 1 6 (6 ):4 7 9 -  
4 9 9 , 1987.
[71] E. R o t h b e r g  a n d  R. S c h r e i b e r . Efficient m ethods for out-of-core sparse Cholesky 
factorization. S IA M  Journal on Scientific Computing, 21 (1): 129—144, January  2000.
[72] Y. SAAD a n d  M. H. S c h u l t z . GMRES: A generalized minimal residual algorithm  
for solving nonsym m etric linear systems. S IA M  J. Sci. Statist. Comput., 7:856-869, 
1986.
[73] Y. SAAD AND M . SOSONKINA. N on-standard parallel solution strategies for d istribu ted  
sparse linear systems. In Parallel Computation: Proc. o f A C P C ’99, A. Uhl P. Zinterhof, 
M. Vajtersic, editor, Lecture Notes in Com puter Science, Berlin, 1999. Springer-Verlag.
[74] Y o u s e f  SAAD. SPARSKIT: A basic toolkit for sparse m atrix  com putations. Tech­
nical Report 90-20, Research Institu te  for Advanced Com puter Science, NASA 
Ames Research Center, Moffet Field, CA, 1990. Software currently available a t 
< ftp ://ftp .c s .u m n .e d u /d ep t/sp a rse />.
[75] Y o u s e f  SAAD. A flexible inner-outer preconditioned GM RES algorithm . S IA M  J. 
Sci. Comput., 14(2):461-469, M arch 1993.
[76] Y o u s e f  S a a d . ILUT: a dual threshold incomplete ILU factorization. Numerical 
Linear Algebra with Applications, 1:387-402, 1994. Technical Report 92-38, M innesota 
Supercom puter Institu te, University of M innesota, 1992.
[77] Y o u s e f  S a a d . Iterative methods for sparse linear systems. PW S Publishing Company, 
1996.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y 131
[78] H o r s t  D. S i m o n . Partition ing  of unstructured  problems for parallel processing. Com­
puting System s in Engineering, 2:135-148, 1994.
[79] M a r c  S n i r , S t e v e  W . O t t o , D av id  W . W a l k e r , J a c k  D o n g a r r a , a n d  S t e v e n  
H u s s - L e d e r m a n . M PI: The Complete Reference. M IT Press, 1995.
[80] A. STATHOPOULOS AND J . R. M c C o m b s . A parallel, block, Jacobi-Davidson im­
plem entation for solving large eigenproblems on coarse grain environments. In 1999 
International Conference on Parallel and Distributed Processing Techniques and A p ­
plications, pages 2920-2926. CSREA Press, 1999.
[81] A. S t a t h o p o u l o s  a n d  J . R . M c C o m b s . Parallel, m ulti-grain eigensolvers w ith 
applications to  m aterials science. In First S IA M  Conference on Computational Science 
and Engineering, 2000.
[82] A. S t a t h o p o u l o s , S e r d a r  O g u t , Y. S a a d , J . R. C h e l i k o w s k y , a n d  H a n c h u l  
K im . Parallel m ethods and tools for predicting m aterial properties. Computing in 
Science and Engineering, 2(4): 19-32, 2000.
[83] V . E . T a y l o r  a n d  B. N o u r - O m i d . A study  of the factorization fill-in for a paral­
lel im plem entation of the finite element m ethod. International Journal o f Numerical 
Methods in Engineering, 37:3809-3823, 1994.
[84] SlVAN T o l e d o . A survey of out-of-core algorithm s in num erical linear algebra. In 
External M emory Algorithms and Visualization, Jam es Abello and Jeffrey Scott V itter, 
editors, pages 161-180. American M athem atical Society Press, Providence, RI, 1999.
[85] T . H. T z e n  AND L. M . N i. T ra p ezo id  se lf-sch ed u lin g : a  p ra c tic a l sc h e d u lin g  
sch em e for p a ra lle l co m p u ters . IE E E  Transactions on Parallel and Distributed Sys­
tems, 4 ( l) : 8 7 - 8 9 ,  1993.
[86] J e f f r e y  S c o t t  V i t t e r . External memory algorithm s and d a ta  structures: dealing 
w ith massive data. A C M  Computing Surveys (CSUR), 33(2):209-271, 2001.
[87] C. WALSHAW. Jostle: G raph partitioning and load balancing software, h t t p : / /  
s ta f fw e b . c m s .g re . a c .u k /~ c .w a ls h a w /jo s t le / .
[88] M ic h a e l  S. W a r r e n  a n d  J o h n  K. S a l m o n . A parallel hashed oct-tree n-body 
algorithm . In Proceedings o f Supercomputing ’93, pages 12-21, 1993.
[89] R . W h a l e y  a n d  J . D o n g a r r a . Autom atically Tuned Linear Algebra Software (AT­
LAS). In  Proceedings o f the IE E E /A C M  Supercomputing Conference (S C ’98), O r­
lando,FL, November 1998.
[90] R. W h a l e y , A. P e t i t e t , a n d  J . D o n g a r r a . A utom ated em pirical optim izations 
of software and the  ATLAS project. Parallel Computing, pages 3-25, January  2001.
[91] R. W o l s k i , N. S p r i n g , a n d  J . H a y e s . The network weather service: A d istribu ted  
resource performance forecasting service for m etacom puting. Journal o f Future Gen­
eration Computing Systems, 15(5-6):757-768, 1999.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G RAPH Y  132
[92] W e iy e  Z h a n g  a n d  P e r - A k e  L a r s o n . Dynamic memory adjustm ent for external 
mergesort. In The VLDB Journal, pages 376-385, 1997.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
133
VITA 
Richard Tran Mills
Richard Tran Mills was born on August 16, 1978 in Cookeville, Tennessee. His interest in 
com puting began at the age of six when he was presented w ith a Commodore-64 personal 
com puter for Christm as. After dropping out of Cookeville High School in 1995, he a ttended  
the University of Tennessee, Knoxville, as a Chancellor’s Scholar, where he studied geology 
and physics and conducted research in com putational geophysics. After graduating summ a  
cum laude w ith the B.A. degree in May of 1999, he enrolled in the graduate program  in 
com puter science at the College of W illiam & Mary, specializing in com putational science. 
He received the M.S. degree in May of 2001. In addition to  his research at W illiam  & Mary, 
he completed his DOE Com putational Science G raduate Fellowship practicum  experience 
at Los Alamos National Laboratory, where he worked with Dr. Peter Lichtner on developing 
codes for m odeling multi-phase, subsurface fluid flow on parallel computers.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
