Lightweight Task Analysis for Cache-Aware Scheduling on Heterogeneous
  Clusters by Grehant, Xavier & Jarp, Sverre
Lightweight Task Analysis for Cache-Aware Scheduling on
Heterogeneous Clusters (PDPTA’08)
Xavier Gre´hant
CERN openlab, Geneva and ENST, Paris
CH-1211 Gene`ve 23, Switzerland
xavier.grehant@gmail.com
Sverre Jarp
CERN openlab, Geneva
CH-1211 Gene`ve 23, Switzerland
sverre.jarp@cern.ch
Abstract
We present a novel characterization of how a pro-
gram stresses cache. This characterization permits
fast performance prediction in order to simulate
and assist task scheduling on heterogeneous clus-
ters. It is based on the estimation of stack distance
probability distributions. The analysis requires the
observation of a very small subset of memory ac-
cesses, and yields a reasonable to very accurate pre-
diction in constant time.
1 Introduction
Heterogeneous resources bring in clusters the op-
portunity for workload placement optimizations
[14, 18]. Cache is a core resource. The behavior
of a program relative to cache determines in great
part its performance on a given server. However,
cache misses are difficult to predict. In order to en-
hance schedulers by taking into account cache re-
sources, programs must be analyzed quickly. The
program analysis overhead must not overpass the
gain in scheduling efficiency.
This work is a first step towards cache-aware
scheduling in heterogeneous clusters. It consists in
the design and evaluation of a new program charac-
terization. This characterization permits fast cache
misses prediction. It provide the required perfor-
mance for use in schedulers. It is based on the esti-
mation of stack distance probability distributions.
Related characterizations are presented in sec-
tion 2. The scope of this work is defined in section
3. The design of the new characterization is ex-
plained in section 4 and evaluated in section 5.
2 Related work
Cycle-accurate simulators return a cache event in
response to each instruction. They require a han-
dle on the application being executed [17] or an ex-
haustive trace of the execution [16, 19]. Although
trace compression methods exist, these simulators
are slow compared to other predictors [7].
How well a program behaves relative to cache
has been explained in the literature with the no-
tions of program locality [13, 11, 2]. Program lo-
cality has a variety of descriptions. Reducing the
description size has always been a challenge for per-
formance prediction. Programs can be decomposed
into building blocks [9, 22, 23]. Resulting descrip-
tions are still substantial and they do not apply to
all kinds of caches.
Monte Carlo performance models represent a
program as inter-dependent statistical generators
of stall conditions [8, 20, 21]. These models are
fast. The average number of cache misses in a run
is correct even for complex processors. However,
the cache misses generators used in these works are
still specific to a cache configuration.
Fast cross-platform cache analysis is usually done
using stack distances. A stack distance is the
number of different memory addresses accessed be-
tween two accesses to the same address. Stack dis-
tances are suited to evaluate fully associative caches
with Least Recently Used (LRU) replacement pol-
icy and with cache lines of one element. In these
cases, and in the absence of pre-fetching, cache
misses occur for stack distances greater than the
cache size. In addition, stack distances have shown
to accurately extend to set-associative caches with
1
ar
X
iv
:0
90
2.
48
22
v1
  [
cs
.D
C]
  2
7 F
eb
 20
09
various cache line sizes and replacement policies
[15, 19, 6, 3, 1].
For prediction, stack distances are usually
recorded in a stack distance histogram. The pre-
cision of a histogram (i.e. the range of its bins)
is usually the size of a cache line. Stack distance
histograms contain the number of cache misses for
every cache size. Stack distance histograms are
widely used for cross-platform performance predic-
tion [10, 14, 4]. They are lighter than application
traces when the cache line size is known. However,
their size is still substantial and the whole trace
still needs to be collected.
3 Scope of this contribution
This section explains the limitations and novelty of
the characterization.
Limitations. This characterization aims to pre-
dict the number of cache misses. The cost of a
cache miss and the impact of pre-fetching are not
studied here, although they are important to simu-
late and assist cache-aware scheduling. They must
be addressed separately.
Cost of a cache miss. In modern processor ar-
chitectures the cost of a cache miss on the process
execution time depends on memory latency and
bandwidth, the number of hardware threads, the
quality of branch prediction, other platform char-
acteristics, and on whether it occurs during direct
or speculative execution. Evaluating the cost of a
cache miss is not the concern of this work, which
focuses on their number.
Pre-fetching. Modern processors use pre-
fetching, a strategy that consists of loading data
to cache before it is required in the program stack.
Pre-fetching takes advantage of spatial locality.
Along with efficient branch prediction, pre-fetching
dramatically reduces the number of cache misses.
However pre-fetching is externally scheduled by
processors. It does not belong to cache configu-
ration. The evaluation of how well it filters out
cache misses can be done separately, as in [21, 8].
Compulsory cache misses correspond to first-
time accessed memory addresses, that is, to infinite
stack distances. Compulsory instruction misses
are given by the binary size and compulsory data
misses are given by the data size. The characteriza-
tion predicts capacity and conflict misses according
to the standard taxonomy [5].
Novelty. We propose a new characterization of
how a program stresses cache. This characteriza-
tion outperforms current methods for description
size, analysis and prediction speed. It accounts for
constant prediction complexity and for the fastest
analysis since only small subsets of the applica-
tion trace need to be extracted. It permits cross-
platform cache performance prediction with reason-
able to very good accuracy.
These performances are required to provide on
the fly performance prediction in order to simulate
and assist task scheduling on heterogeneous clus-
ters.
4 The characterization
We propose a characterization based on the estima-
tion of the stack distance probability distribution.
Stack distance is seen as a random variable X. It
is fitted to a combination of well known probability
distributions. The obtained distribution has a cu-
mulative distribution function cdf(x) = P(X ≤ x).
If the estimation is correct, the cache misses ratio
is P(X > cs/ls) = 1 − cdf(cs/ls) where cs is the
cache size and ls the line size. The prediction is
thus of constant complexity.
In addition we propose a method to refine a sim-
ple fit. Cache misses prediction requires to fit cor-
rectly only the upper values of random variable X.
Indeed, prediction is only useful for realistic cache
sizes. If one determines that no cache is smaller
than a minimal cache size ms, then X must fit the
distribution correctly for values greater than ms/ls
where ls is the line size.
The refinement algorithm is as follows. X is a
random variable, in fact a list of samples. dist rep-
resents the parameters of a distribution, i.e. the
result of a random variable fit. Function fit is a
regular fit. Function fit′ is the refined fit.
function bias(X, dist, ms/ls) :
for each s in X such that s < ms/ls
do
s’ := randomly generated from dist
loop until s’ < ms/ls
s := s’
2
end for
return X
end function
function fit’(X, ms/ls) :
dist := fit(X)
X’ := X
for each refinement
X’ := bias(X’, dist, ms/ls)
dist := fit(X’)
end for
return dist
end function
At each refinement, randomly generated values
based on the previous estimation replace the lower
samples. Suppose that an estimation minimizes the
Mean Squared Error . i,j is the error of estimation
at ith refinement on data at jth refinement. down
and up are the contributions of lower and upper
samples to the error. n+1,n+1 ≤ n,n+1 because es-
timation n+1 minimizes the error on data n+1. It
yields upn+1,n+1 + 
down
n+1,n+1 ≤ upn,n+1 + downn,n+1. Since
downn,n+1 = 0 and 
up
n,n+1 = 
up
n,n by construction, it
yields upn+1,n+1 ≤ upn,n. The upper samples are bet-
ter fitted after each refinement.
The remaining of this paper is an evaluation of
the characterization based on the analysis of SPEC
CPU2006 benchmarks.
5 Evaluation
We instrumented SPEC binaries with PIN to ob-
tain instructions and data load traces [12]. Stack
distances are extracted using the trace profiling al-
gorithm [4]. We developed a few tools in Java.
These tools include trace analysis, estimators based
on the Method of Moments for a large spectrum of
distributions (Discrete, Uniform, Gamma, General-
ized Pareto (GP) and Half Normal (HN)), random
number generators for each of these distributions,
and estimation refinement1.
The evaluation is presented in three steps. The
first step shows how well stack distances fit a prob-
ability distribution. The second step shows the ef-
fect of collecting a limited number of stack distance
samples. The third step is a discussion on using the
1All tools and data developed and collected for this work
are available at http://code.google.com/p/mtc-project
analysis to predict cache misses of other parts of the
program and with other input data.
5.1 Stack distance distribution fit
bzip2 data
dealII data
perlbench data
specrand data
gobmk instructions
leslie3d instructions
libquantum inst.
tonto instructions
Memory reference index
St
ac
k 
di
st
an
ce
Figure 1: Stack distances visualization.
Figure 1 shows stack distances on a representa-
tive cross-section of SPEC CPU2006 benchmarks.
Left-hand-side figures show stack distances between
data accesses, and right-hand-side figures show
stack distances between instruction accesses. Light
dots are stack distances in chronological order.
Dashed lines are the outlines. They are made with
the same values in descending order. The plots of
figure 1 differ from histograms. On a histogram,
values are on the x axis and the y axis measures
the number of occurrences. On figure 1 the x axis
is a list of memory accesses and the y axis measures
corresponding stack distances.
In general, outlines are composed of curves and
straight segments. An outline exclusively com-
posed of straight segments indicates that the vari-
able perfectly fits a discrete distribution. In this
3
case the characterization is equivalent to estimat-
ing the histogram. It results in a compressed his-
togram where empty bins are removed [14]. To the
contrary, a curve indicates that a histogram would
require a high number of bins. When curves exist,
fitting a continuous distribution dramatically re-
duces the characterization size, for continuous dis-
tribution is determined with typically two or three
parameters. Among the 28 SPEC CPU2006 bench-
marks, 11 have discrete instruction stack distances,
and three (gromacs, lbm and libquantum) have dis-
crete data stack distances. In general, a stack dis-
tance distribution is the sum of a discrete distribu-
tion and continuous distributions.
 0
 200
 400
 600
 800
 1000
 1200
 1400
 0  1000  2000  3000  4000  5000
s t
a c
k  
d i
s t
a n
c e
memory access index
Data stack distance histogram: GemsFDTD 10000 samples
Actual sorted
MC GPD
MC Gamma
MC Half-Normal
MC Uniform
Figure 2: A problematic fit.
Figure 2 illustrates the analysis of GemsFDTD.
The outline is shown along with Monte Carlo sim-
ulations based on different analysis. For analy-
sis, discrete parts are filtered out and fitted sep-
arately. The remaining samples are fitted to a con-
tinuous distribution. HN fits well the upper part
of the curve and GPD the lower part. However,
the whole curve does not fit any single distribution
alone. Gamma and Uniform average the trends.
The characterization does not accurately account
for all stack distances in a program whose out-
line has an inflexion point. 8 data traces out of
the 28 benchmarks fall into this category. In these
worst cases, the refined fit permits to concentrate
on the higher stack distances that account for cache
misses.
Figure 3 illustrate six representative analysis sce-
narios. For each benchmark the best distribution
Memory access index
St
ac
k 
di
st
an
ce
Figure 3: Fit quality in various scenarios.
is selected, and the fit is refined. The selection can
be done automatically by picking the distribution
that accounts for the smallest estimation error. For
the three first benchmarks, the estimation is quite
accurate. Refined estimations give the best results.
Indeed, outlines have long tails that bias the esti-
mation of upper values in the absence of refinement.
The precision on the fourth benchmark is impaired
by the precision of the discrete fit. The two last
benchmarks are the worst cases, one because of its
heavy tail and the other because of its irregular
outline.
In conclusion, the characterization is accurate for
most SPEC CPU2006 benchmarks. Stack distances
predicted in the Monte-Carlo simulations perfectly
match actual values. However, there are impracti-
cable scenarios where the precision is the order of
magnitude.
5.2 Analysis speed and prediction
accuracy
In section 5.1 we considered the ability of a prob-
ability distribution to accurately reproduce stack
distances and thus predict cache misses for any kind
of cache. The difference between actual stack dis-
tances and the best distribution is a first contri-
4
bution to the prediction error. In this section we
evaluate the number of samples required to fit such
a distribution. The limited number of stack dis-
tance samples introduces another contribution to
the prediction error. There must be just enough
samples to obtain an estimation as close to the best
estimation as the best estimation to the real data.
In the following this number of samples is called
adequate.
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
219218217216215214213212211210292827262524 8
c a
c h
e  
m
i s
s e
s  
r a
t i o
,  2
0  
e s
t i m
a t
i o
n s
nb of samples out of 13583364
Data cache misses prediction, soplex
80kB actual
80kB GPD
80kB biased 3 iterations
80kB biased 50 iterations
320kB actual
320kB GPD
320kB biased 3 iterations
Figure 4: Data cache misses prediction: soplex.
 0
 0.002
 0.004
 0.006
 0.008
 0.01
 0.012
 0.014
220219218217216215214213212211210292827262524 8
c a
c h
e  
m
i s
s e
s  
r a
t i o
,  2
0  
e s
t i m
a t
i o
n s
nb of samples out of 26012828
Instruction cache misses prediction, dealII
8kB actual
8kB GPD
8kB biased 3 iterations
16kB actual
16kB GPD
16kB biased 3 iterations
Figure 5: Inst. cache misses prediction: dealII.
On figure 4 and 5, two benchmarks are exam-
ined. Actual cache misses ratios with two different
caches are compared to predictions based on differ-
ent sample sets. The same characterization is used
to predict cache misses for the two cache sizes.
Although refined fits are better in general, they
are not better for all cache sizes. For soplex data
misses prediction with refined fits, the adequate
number of samples is around 28. One sample must
be collected every 50,000 data accesses in memory.
This number yields a prediction accuracy of 99%.
For dealII instruction misses prediction, the ade-
quate number of samples is around 211. One sam-
ple must be collected every 13,000 instructions, for
an accuracy of 99.6%.
In conclusion, accurate predictions are obtained
with fast analysis. Less accurate predictions can be
done faster.
5.3 Prediction robustness
In this section we briefly discuss the characteriza-
tion accuracy to predict cache misses in the future
and with different input data.
Memory access index (chronological)
St
ac
k 
di
st
an
ce
 1
 10
 100
 1000
 10000
 100000
 0  2000  4000  6000  8000  10000
s t
a c
k  
d i
s t
a n
c e
memory access index
Data stack distance: bzip2 10000 samples
Sample set 1
Sample set 2
Figure 6: Two sample sets of bzip2.
Figure 6 illustrates the variation of bzip2 stack
distances outline. Samples are taken from millions
of consecutive memory accesses, and the two sam-
ple sets are separated by a few seconds. The sam-
ples are represented in chronological order on sepa-
rate figures (top). The outlines are represented on
the same picture (bottom). With bzip2 the outlines
5
are roughly similar, but a precise prediction is not
possible for future cache misses.
Memory reference index
St
ac
k 
di
st
an
ce
Figure 7: Different inputs and observation seg-
ments.
Figure 7 shows the evolution of the outline for
four benchmarks. sjeng does not change its mem-
ory access pattern in time. gobmk does not change
with different input data. To the contrary, astar
instruction access pattern changes in time, as well
as wrf data access pattern.
In conclusion, future behaviors can be predicted
only if the program is known to follow a certain reg-
ularity. For example, scientific computations often
involve the repetitive execution of the same rou-
tines [23]. In some cases, as with gobmk, different
input data do not change memory access patterns.
The analysis, unnoticeable on the first run of a rou-
tine or on the first input data, provides at worst a
rough indication of the cache misses ratio, useful for
cache-aware scheduling. In other cases it predicts
future cache misses with very high precision.
6 Conclusion
We presented a novel characterization of how a pro-
gram stresses cache, in terms of the stack distance
fit to a probability distribution. The characteriza-
tion has a very small size and provides cache misses
predictions in constant time. Its evaluation distin-
guishes three contributions to the prediction error.
One is relative to the appropriateness of a proba-
bility distribution to describe stack distances. The
second is relative to the number of samples used
for the fit, and the third is relative to the changes
in program behavior. The worst cases yield to rea-
sonable accuracy to simulate or assist scheduling
systems. Many application behaviors are very ac-
curately described by probability distributions and
have enough regularity for the prediction to apply
under different circumstances. Fitting a distribu-
tion requires the extraction of a very small subset of
the trace. This makes the analysis extremely fast,
which is needed to simulate and assist scheduling
systems.
References
[1] M. Brehob and R. Enbody. An analytical
model of locality and caching. Technical re-
port, Michigan State University, Dept of Com-
puter Science and Engineering, 1999.
[2] R. Fonseca, V. Almeida, M. Crovella, and
B. Abrahao. On the intrinsic locality prop-
erties of web reference streams. INFOCOM
2003. Twenty-Second Annual Joint Confer-
ence of the IEEE Computer and Communica-
tions Societies. IEEE, 1:448–458, 30 March-3
April 2003.
[3] K. Grimsrud, J. Archibald, R. Frost, and
B. Nelson. On the accuracy of memory ref-
erence models. In Proceedings of the 7th inter-
national conference on Computer performance
evaluation : modelling techniques and tools,
pages 369–388, Secaucus, NJ, USA, 1994.
Springer-Verlag New York, Inc.
[4] R. Hassan, A. Harris, N. Topham, and
A. Efthymiou. Synthetic trace-driven simula-
tion of cache memory. In AINAW ’07: Pro-
ceedings of the 21st International Conference
on Advanced Information Networking and Ap-
plications Workshops, pages 764–771, Wash-
ington, DC, USA, 2007. IEEE Computer So-
ciety.
[5] J. Hennessy and D. Patterson. Computer Ar-
chitecture - A Quantitative Approach. Morgan
Kaufmann, 1990,1996,2003,2006.
[6] M. D. Hill and A. J. Smith. Evaluating asso-
ciativity in cpu caches. IEEE Trans. Comput.,
38(12):1612–1630, 1989.
6
[7] A. Janapsatya, A. Ignjatovic,
S. Parameswaran, and J. Henkel. Instruction
trace compression for rapid instruction cache
simulation. In DATE ’07: Proceedings of the
conference on Design, automation and test in
Europe, pages 803–808, San Jose, CA, USA,
2007. EDA Consortium.
[8] T. S. Karkhanis and J. E. Smith. A first-order
superscalar processor model. SIGARCH Com-
put. Archit. News, 32(2):338, 2004.
[9] Y.-T. S. Li, S. Malik, and A. Wolfe. Perfor-
mance estimation of embedded software with
instruction cache modeling. ACM Trans. Des.
Autom. Electron. Syst., 4(3):257–279, 1999.
[10] G. Marin and J. Mellor-Crummey. Cross-
architecture performance predictions for scien-
tific applications using parameterized models.
SIGMETRICS Perform. Eval. Rev., 32(1):2–
13, 2004.
[11] V. Milutinovic. Caching in distributed sys-
tems. Concurrency, IEEE [see also IEEE
Parallel & Distributed Technology], 8(3):14–
15, Jul-Sep 2000.
[12] H. Pan, K. Asanovic´, R. Cohn, and C.-K. Luk.
Controlling program execution through binary
instrumentation. SIGARCH Comput. Archit.
News, 33(5):45–50, 2005.
[13] V. Phalke and B. Gopinath. An interrefer-
ence gap model for temporal locality in pro-
gram behavior. In in Proceedings of the 1995
ACM SIGMETRICS Conference, pages 291–
300, 1995.
[14] J. J. Pieper, A. Mellan, J. M. Paul, D. E.
Thomas, and F. Karim. High level cache sim-
ulation for heterogeneous multiprocessors. In
DAC ’04: Proceedings of the 41st annual con-
ference on Design automation, pages 287–292,
New York, NY, USA, 2004. ACM.
[15] B. R. Rau. Properties and applications of the
least-recently-used stack model. Technical re-
port, Stanford University, Stanford, CA, USA,
1977.
[16] H. Rotithor. On the effective use of a cache
memory simulator in a computer architecture
course. Education, IEEE Transactions on,
38(4):357–360, Nov 1995.
[17] F. Schintke, J. Simon, and A. Reinefeld. A
cache simulator for shared memory systems.
In ICCS ’01: Proceedings of the International
Conference on Computational Science-Part II,
pages 569–578, London, UK, 2001. Springer-
Verlag.
[18] H. Shan, L. Oliker, R. Biswas, and W. Smith.
Job scheduling in heterogeneous grid environ-
ment. In ADCOM2004: International Confer-
ence on Advanced Computing and Communi-
cation., 2004.
[19] A. J. Smith. Cache memories. ACM Comput.
Surv., 14(3):473–530, 1982.
[20] R. Srinivasan, J. Cook, and O. Lubeck. Perfor-
mance modeling using monte carlo simulation.
IEEE Computer Architecture Letters, 5(1):38–
41, 2006.
[21] R. Srinivasan, J. Cook, and O. Lubeck. Ultra-
fast cpu performance prediction: Extending
the monte carlo approach. In SBAC-PAD ’06:
Proceedings of the 18th International Sympo-
sium on Computer Architecture and High Per-
formance Computing, pages 107–116, Wash-
ington, DC, USA, 2006. IEEE Computer So-
ciety.
[22] F. Wolf and R. Ernst. Data flow based
cache prediction using local simulation. In
HLDVT ’00: Proceedings of the IEEE Interna-
tional High-Level Validation and Test Work-
shop (HLDVT’00), page 155, Washington,
DC, USA, 2000. IEEE Computer Society.
[23] L. T. Yang, X. Ma, and F. Mueller. Cross-
platform performance prediction of parallel
applications using partial execution. In SC
’05: Proceedings of the 2005 ACM/IEEE con-
ference on Supercomputing, page 40, Washing-
ton, DC, USA, 2005. IEEE Computer Society.
7
