Optimization of Discrete-parameter Multiprocessor Systems using a Novel
  Ergodic Interpolation Technique by Karanjkar, Neha V. & Desai, Madhav P.
Optimization of Discrete-parameter Multiprocessor
Systems using a Novel Ergodic Interpolation
Technique
Neha V. Karanjkar and Madhav P. Desai
Department of Electrical Engineering,
Indian Institute of Technology Bombay
email: {nehak,madhav}@ee.iitb.ac.in
Abstract—Modern multi-core systems have a large number of
design parameters, most of which are discrete-valued, and this
number is likely to keep increasing as chip complexity rises.
Further, the accurate evaluation of a potential design choice
is computationally expensive because it requires detailed cycle-
accurate system simulation. If the discrete parameter space can
be embedded into a larger continuous parameter space, then
continuous space techniques can, in principle, be applied to the
system optimization problem. Such continuous space techniques
often scale well with the number of parameters. We propose a
novel technique for embedding the discrete parameter space into
an extended continuous space so that continuous space techniques
can be applied to the embedded problem using cycle accurate
simulation for evaluating the objective function. This embedding
is implemented using simulation-based ergodic interpolation,
which, unlike spatial interpolation, produces the interpolated
value within a single simulation run irrespective of the number
of parameters. We have implemented this interpolation scheme
in a cycle-based system simulator. In a characterization study, we
observe that the interpolated performance curves are continuous,
piece-wise smooth, and have low statistical error. We use the
ergodic interpolation-based approach to solve a large multi-core
design optimization problem with 31 design parameters. Our
results indicate that continuous space optimization using ergodic
interpolation-based embedding can be a viable approach for large
multi-core design optimization problems. 1
Index Terms—Design Space Exploration, Discrete Optimiza-
tion, Multi-core processors
I. INTRODUCTION
Modern multi-core systems have complex architectures,
containing multiple components such as cores, caches and
interconnects interacting with each other in intricate ways.
A system can have tens to hundreds of design parameters,
most of which are discrete valued (for example, size and
associativity of caches, latency and throughput of components,
buffer sizes and issue width of cores), and arriving at a
configuration that optimizes cost/performance measures such
as execution time or energy consumption under given con-
straints is a non trivial task. Design Space Exploration (DSE)
refers to a systematic process for identifying good designs
prior to implementation [1]. The set of all possible values
that system parameters can take is referred to as the design
space or parameter space. This is a multi-dimensional space
1A short version of this paper will be published in the proceedings of IEEE
MASCOTS 2015
with each dimension corresponding to a design parameter.
Cost/performance measures to be optimized over the design
space constitute the objective function. The optimization pro-
cess is non-trivial for two reasons:
1) Cost/performance measures cannot be expressed as a
function of design parameters accurately using sim-
ple analytical expressions. Simulation of representative
benchmark programs on a cycle-accurate model of the
system is typically used to evaluate these measures with
reasonable accuracy. Evaluating each design option is
thus computationally expensive.
2) There are a large number of design parameters. The
number of possible design configurations grows expo-
nentially with the number of dimensions.
Techniques for design space exploration aim to find good
solutions whilst minimizing the computational expense of
finding them. The computational expense for evaluating a
single design option is determined by the level of abstraction
of the system model chosen. Hardware prototypes or FPGA
implementations provide performance measures with high
accuracy but involve very long implementation time, while
purely analytical models allow faster evaluation but lose out
on accuracy. Simulation based evaluation lies between the two
extremes. Our work focusses on exploring the design space
efficiently, assuming that the objective function is evaluated
using cycle-accurate simulations. Existing techniques for ex-
ploring the design space can be broadly classified as follows:
• Exhaustive enumeration: exhaustive search based meth-
ods [2], [3] yield globally optimal solutions but the num-
ber of evaluations becomes prohibitive for large number
of parameters.
• Design of experiments (DoE): number of evaluations can
be reduced by carefully selecting a subset of points
in the design space to be evaluated, using design of
experiments (DoE) approach [4], [5]. However effectively
using DoE approaches other than full-factorial requires
prior knowledge about effect of system parameters on
performance.
• Search over discrete parameter space: randomized search
methods such as simulated annealing [6]–[8], evolution-
ary algorithms [9]–[11] and heuristic-based local search
ar
X
iv
:1
41
1.
22
22
v2
  [
cs
.D
C]
  1
4 J
ul 
20
15
methods such as hill climbing [12] and Tabu search
[13], [14] have been applied to cope with the large
dimensionality of the design space.
• Meta-model based search : Using systematic sampling, a
meta model of the system is constructed. The meta-model
may be in the form of an artificial neural network, linear
regression model, polynomial or spline interpolation etc.
This meta-model is then used in an interleaved manner
with simulations to prune the design space or guide search
during optimization [15]–[17].
A. Main Contributions
Existing DSE techniques search for the optimum either
directly over the discrete parameter space, or search over a
meta-model which may be defined over a continuous domain.
We investigate a new approach to this optimization problem
which is based on embedding the discrete parameter space
into an extended continuous space, and applying continuous
optimization techniques directly over the embedded simulation
model for finding local optima efficiently2. The main motiva-
tion behind this approach is to achieve better scalability with
respect to the number of design parameters. Using continuous
optimization offers the following advantages:
1) Continuous optimization methods can handle large di-
mensionality of the design space better than exhaustive
search or design of experiments-based methods. The
number of function evaluations is weakly dependent on
the number of parameters.
2) They make use of gradient information and are thus
more efficient as compared to randomized search meth-
ods for finding local minima.
3) Continuous space offers more pathways to reach the
solution as compared to a discrete space. Continuous
optimization techniques can recognize diagonal ridges
in the objective function unlike local search methods in
discrete space such as hill climbing [1].
4) The approach does not involve use of a meta-model, thus
each function evaluation is as accurate as the detailed
simulation model.
However, descent-based continuous optimization techniques
find local minima, and random restarts are required to search
for the global optimum. Further, in order to convert the
continuous space solution back to discrete space, rounding
needs to be employed with care. The idea of applying contin-
uous optimization techniques to solve a discrete optimization
problem has been described in the past in chemistry [18] and
applied mathematics [19] . To our knowledge this approach
has not been investigated for system-level design exploration.
We propose a technique for embedding discrete parameters
into a continuous space by using a simulation-based ergodic
interpolation method, which, unlike spatial interpolation tech-
niques, can produce the interpolated result within a single
2Note that this optimization approach is distinct from building an
interpolation-based meta model of the system by sampling the discrete
parameter space.
simulation run irrespective of the number of parameters. The
basic idea behind ergodic interpolation is to replace each
discrete parameter in the cycle-accurate model with a discrete
random variable whose value changes over time, such that the
set of values of all parameters averaged over time within a
single simulation run approaches a given point in the extended
continuous parameter space at which we wish to evaluate the
interpolated performance value. While ergodic interpolation
can be applied, in principle, to a variety of discrete parameters,
we define and demonstrate the embedding for four types of
parameters:
1) buffer sizes
2) component throughputs
3) component latencies (in units of clock cycles)
4) number of pipelined stages in interconnect links
(where component can be a core, cache, memory module or an
interconnect). Our primary motivation was to understand the
viability of our techniques for this relatively simple parameter
set. Based on our experience with this parameter set (summa-
rized in Section IV), we are investigating the application of
the ergodic interpolation technique to other discrete parameters
such as cache size and associativity, core issue width etc. We
first describe the generic system model used in this study and
define the discrete parameters which are subject to embedding
in Section II. We then describe the ergodic interpolation
technique in detail in Section III.
We characterize the interpolated performance function ob-
tained using our ergodic interpolation technique on a problem
instance with 12 parameters. The objective function (total
execution time for a parallel workload) is evaluated at closely
spaced points along random straight lines passing through
the 12-dimensional continuous parameter space. We observe
that the interpolated function has low statistical error and
is continuous and piece-wise smooth along each line. This
indicates that continuous space optimization techniques can
be used with this embedding. We present the characterization
study in Section III-C.
Next, we apply the ergodic interpolation based approach
to find optimal configurations in a large DSE problem with
31 discrete parameters. The parameters are embedded into
a continuous space using ergodic interpolation. A variety of
continuous space optimization techniques can be applied over
this embedding. We choose COBYLA [20], an algorithm that
does not require expensive gradient computations. We use
an implementation of COBYLA from Python’s SciPy library.
The objective function to be minimized is a weighted sum of
performance and cost components where weights are varied
to obtain cost-performance trade-off curves. Performance is
defined as the sum of execution times for four NAS benchmark
kernels and cost is represented by a synthetic cost function.
For each set of weights, we perform multiple optimization
runs starting from random initial points in the design space.
We find that across all optimization runs (given a limit of 300
function evaluations per run) the solution converges to a local
optimum in all cases and the improvement in objective ranges
Fig. 1. System model
from 1.3X to 12.2X over the initial guess. Further, the spread
in objective values at the optimum across multiple runs is low
for most cases. We compare the quality of locally optimum
solutions found by COBYLA runs with a global optimum
reported by an Adaptive Simulated Annealing (ASA) search
in discrete space and find that most of the COBYLA runs
produce a solution that is close to (within 10% of) the global
optimum reported by ASA. The results (presented in Section
IV) indicate that continuous space optimization applied over
an ergodic interpolation based embedding is a viable approach
for solving discrete optimization problems in design space
exploration of multi-core systems.
II. SYSTEM MODEL
We use a parametrized cycle accurate model of a multi-core
system (shown in Figure 1) that is representative of current
NUMA (Non Uniform Memory Access) architectures such as
those based on the Intel QPI [21] or AMD HyperTransport
[22] standards. The system consists of m processors, with n
cores per processor (where m and n are model parameters).
The processors are connected to m memory modules, forming
a m-way NUMA configuration. Each core implements the
Sparc V8 instruction set. Timing of load/store accesses flowing
through the memory subsystem is modeled in detail. All
other instructions are assumed to execute in one cycle. The
cache subsystem comprises per-core split L1 and unified
L2 caches and a shared L3 cache. Coherency is maintained
using a hierarchical directory-based MESI protocol which
is implemented by generalizing the protocol described in
[23, Ch. 8.3.2, p.152] to an arbitrary number of levels in
memory hierarchy. Interconnect between successive levels in
the memory hierarchy is a full-crossbar with parametrized link
delays. The NUMA effect is modeled by assigning different
delays to links connecting a processor to its local and remote
memory nodes. The model is built using the SiTAR modeling
framework [24] and is capable of running user-level programs
compiled for Sparc V8. Parallel applications can be ported to it
using a library of synchronization routines. For all simulations
reported in this study, the number of processors (m) and
the number of cores per processor(n) were fixed at 2 and 4
respectively.
A. Embedded Parameters
We present a precise definition of the discrete parameters
which are subject to embedding. Although we use functional
models for all components in the system, each component
can be classified into one of three basic types: modules, wires
and queues. Modules represent behavioral components in the
system such as caches, cores, memory modules and inter-
connect schedulers, wires represent interconnect links with
a parametrizable number of pipelined stages, and queues are
used to represent buffering at various places in the system. The
system can thus be thought of as being constructed using an
interconnection of modules (representing the processor cores
for example), wires (which are pipelined) and buffers/queues.
The activity in the system (for example, memory accesses
and coherence requests) is modeled by jobs and movement
of data-tokens. A job represents a behavioural action by a
module which can consume and produce data-tokens. Data-
tokens are used to encapsulate information. Jobs are triggered
inside modules by the availability of the necessary data-tokens.
The completion of a job may produce new data-tokens that can
be transported out of the module. Queues are used to buffer
data-tokens. Each module has its own input and output queues
for storing data-tokens to be used by or generated from jobs
that it executes. All modules, wires and queues in the system
are parametrized as follows:
• queue capacity C(q) : A queue q has a single parameter
C(q). In each cycle, the queue q will accept new data-
tokens as long as the total number of data-tokens in the
queue is ≤ C(q).
• module throughput N(m) : a module m can accept
new jobs each cycle as long as the number of jobs being
processed by it is ≤ N(m).
• module delay D(j,m) : If j is a job that is accepted by
module m, then D(j,m) is the number of cycles it takes
for the module to execute the job. The job is removed
from the input queue as soon as it is accepted by the
module and place is reserved for its output in the output
queues. Its output becomes visible in the output queues
only after D(j,m) cycles.
• wire latency L(w) : A wire w has a single parameter
L(w) that represents the number of register stages in
the wire. A wire can accept at most one data-token
each cycle, and each token takes L(w) cycles to pass
through the wire. The wire can accept a token only after
reserving a place for it in the output queue to which it
is connected. The token becomes visible in the output
queue after L(w) cycles.
To summarize: our cycle accurate simulation model has the
discrete parameters C(q), N(m), D(j,m) and L(w) for all
queues q, modules m and wires w that make up the architec-
tural components in the system. We define an embedding for
these parameters in the following section.
III. EMBEDDING THE DISCRETE PARAMETER SPACE INTO
CONTINUOUS SPACE
An embedding of the discrete parameter space into con-
tinuous space requires us to extend each discrete param-
eter to a continuous one. Based on this, we can extend
cost/performance functions from the discrete space to the
extended continuous space by using interpolation. The em-
bedding should ideally be implemented in such a way that
the behaviour of the interpolated cost/performance functions
is suitable for the application of continuous optimization
algorithms.
Let X = {x1, x2, ...xn} be a vector of values of discrete-
valued design parameters in the model. X ∈ ΩD where ΩD
is the discrete parameter space. Our cycle based simulation
model allows us to evaluate some objective function
f : ΩD → R.
The function f needs to be optimized. We construct an
extension of f to produce a continuous function fˆ :
fˆ : ΩC → R, ΩD ⊂ ΩC ⊆ Rn
where ΩC is a continuous space extension of ΩD. The
extension fˆ must satisfy
fˆ(Y ) =
{
f(Y ) when Y ∈ ΩD
θ(Y,ΩD) otherwise
(1)
That is, fˆ must be continuous in ΩC and must agree with f
on ΩD. Thus, we need a suitable interpolator θ(Y,ΩD).
Spatial interpolation (performed using standard multivariate
interpolation methods such as Lagrange interpolation [25],
Simplex interpolation [26] or Monte Carlo interpolation [27, p.
143] ) is an obvious candidate for θ. In spatial interpolation,
for each Y ∈ ΩC , we identify a set of nearest neighbours
X1(Y ), X2(Y ), . . . Xk(Y ) of Y such that Xi(Y ) ∈ ΩD for
each i. Then,
θ(Y,ΩD) = I(X1(Y ), X2(Y ), . . . Xk(Y ))
where I is some interpolation function. The interpolated value
at a single point Y is then computed in terms of the function
values of a set of neighbour points, which have to be com-
puted using expensive simulations. Thus, spatial interpolation
as a means of embedding is computationally inefficient for
simulation-based optimization.
Instead, we introduce an ergodic interpolation method
which relies on a randomization of the simulation model in
order to construct the function θ. Using this, the value θ(Y )
can be produced by a single simulation run. The ergodic
interpolation method builds on a sensitivity measurement
technique described in [28] for producing small (real-valued)
perturbations to discrete-valued parameters in a simulation
model.
A. Ergodic Interpolation using a Randomized Simulation
Model
The basic idea behind ergodic interpolation is to approxi-
mate the result of spatial interpolation with averaging in time.
Each discrete parameter i in the model is replaced by a discrete
random variable whose value changes over time within a single
simulation run, such that its average value over the simulation
run approaches a real number vi ∈ V , where the set of average
values of all parameters V = {v1, v2, ...vn} represents a point
in the extended continuous space at which we wish to evaluate
the interpolated performance value.
Several choices exist in implementing such an embedding:
for instance, whether the value of the parameter is changed
every cycle or once in an interval consisting of multiple cycles,
and the exact behavior of a component when it transitions from
one set of values for its parameters to another. We present one
possible definition of embedding for the C(q), N(m), D(j,m)
and L(w) parameters introduced in Section II-A. We observe
that this definition leads to interpolated performance functions
that are smooth and suitable for continuous space optimization.
In order to construct the ergodic interpolator, we first
randomize the cycle-based simulation model introduced in
Section II. If 0 ≤ p ≤ 1 and if x is a real number, then
we define
γ(p, x) =
 dxe with probability pbxc with probability 1− p
Thus, γ(p, x) is an integer-valued random variable. If x is an
integer, then γ(p, x) = x. For fixed real x, the expected value
of γ(p, x) is p dxe+ (1− p) bxc. It follows that for fixed real
x, the expected value of γ(x− bxc , x) is x.
Suppose the parameters C(q), D(j,m), N(m), L(w) are
real numbers. Then the component behaviour in the simulation
model is randomized as follows:
• For a queue q with real parameter C(q): Let p = C(q)−
bC(q)c. At every cycle, accept data-tokens into the queue
as long as the total number of tokens in the queue is
≤ γ(C(q)−bC(q)c , C(q)). For example, if C(q) = 10.3,
then p = 0.3. At each cycle γ(C(q)−bC(q)c , C(q)) will
be 11 with probability 0.3 and 10 with probability 0.7, so
that during the simulation, the queue will have capacity
10 for 70% of the time and capacity 11 for the remaining
30% of the time.
• For a module with real parameter N(m) : At every
cycle, start a new job in the module only if the to-
tal number of active jobs in the module is less than
γ(N(m)− bN(m)c , N(m)).
• For a module m and parameter D(j,m) for some job j :
At every cycle, if the job j is started successfully, assign
a latency of γ(D(j,m)−bD(j,m)c , D(j,m)) to the job.
• For a wire w with real parameter L(w) : At every cycle,
for a data-token that enters the wire in this cycle, assign
a transport latency of γ(L(w) − bL(w)c , L(w)) to the
token.
This randomization effectively ensures that the average
value of each parameter can be a real number, while the
simulation model continues to be discrete parameter and cycle-
based. The net effect is that each parameter in the simulation
model can be treated as a discrete valued Bernoulli random
variable whose time-average value is the desired continuous
value at which the function is to be computed. We call this
an ergodic interpolation because the time-average in a single
simulation run gives the interpolated value. The rest of the
embedding is easy. We embed ΩD into a box ΩC as follows:
for each parameter pi in the parameter space, we define a
minimum possible value mi and a maximum possible value
Mi. Then
ΩC = {(x1, x2, . . . xn) : mi ≤ xi ≤Mi, i = 1, 2, . . . n}
For each point Y ∈ ΩC , the ergodic interpolation θ(Y,ΩD)
for Y ∈ ΩC is produced by the randomized simulation
model described above. This technique gives a well defined
interpolation. However there are some questions:
1) What is the amount of statistical error in the interpolated
value?
2) Is the interpolation well-behaved? That is, is the inter-
polated function smooth enough for us to be able to use
continuous optimization techniques?
We address these questions in the following subsections.
B. Statistical Error in Ergodic Interpolation
Statistical error in the interpolated value can be controlled
by increasing the number of samples of parameter values. This
can be done by averaging results from multiple simulation
runs, or by using a single long simulation run. For the bench-
mark programs used as workload in our design exploration
experiment (listed in Table I), we estimated the standard
deviation of the interpolated performance value at a few points
in the design space by generating multiple samples. For these
medium sized benchmarks (spanning 8 to 20 million simulated
cycles) we find the standard deviation relative to the mean to
be between 0.009% to 0.019%. These error values are small,
and thus, for long enough benchmarks, a single simulation run
is sufficient for obtaining the interpolated performance value
at a single point.
C. Well-behavedness of Ergodic Interpolation
We check whether the interpolated performance function
is smooth, so that continuous optimization techniques can
be applied to it. We do this by evaluating the function at
closely-spaced points along random straight lines passing
through the extended continuous parameter space. Parameters
for this experiment are D, N and C( for output buffers) (as
introduced in Section II-A) in L1, L2, L3 caches and main
memory. Thus the extended continuous parameter space has
12 dimensions. The interpolated function fˆ is the total time
to execute a parallel memory test workload. The workload
involves each core accessing non-overlapping but interleaved
memory locations, and is chosen to stress the memory system
sufficiently.
We consider random straight lines passing through the 12-
dimensional continuous parameter space. Each line is sampled
at 200 uniformly spaced points, and the objective function fˆ
(total execution time) is evaluated at each of these points using
cycle accurate simulations of the model described in Section
II. We perform multiple simulation runs at each point with
distinct randomization seeds in order to measure the mean and
standard error values. In Figure 2, we show the interpolated
performance function values (mean fˆ ) evaluated at uniformly
spaced points along ten randomly chosen straight lines passing
through the continuous parameter space. We observe that along
each line, the interpolated function is continuous and piece-
wise smooth. The measured relative standard error values are
less than 0.01%. Thus the interpolated function obtained using
our ergodic interpolation technique seems to be well-behaved
and suitable for the application of continuous optimization
techniques.
IV. RESULTS OF CONTINUOUS OPTIMIZATION USING
ERGODIC INTERPOLATION
We apply the ergodic interpolation based approach for
finding optimal configurations in a multiprocessor design
exploration problem with 31 discrete parameters. The discrete
parameters and their ranges are listed in Table II. The pa-
rameters are embedded into continuous space using ergodic
interpolation over a cycle accurate model described in Section
II. Four kernels from the NAS parallel benchmark suite (NPB)
[29] are used as workload. We have ported an OpenMP+C
version of NPB v2.3 developed by the Omni project [30] to
our model. The kernels and their problem sizes are listed in
Table I.
TABLE I
NAS KERNELS AND THEIR PROBLEM SIZES
Kernel Problem Size
Embarrassingly Parallel (EP) 216
Multigrid (MG) 163
3-D FFT PDE solver (FT) 163
Integer Sort (IS) 216
Continuous optimization is performed over the embedding
using an implementation of a derivative-free continuous op-
timization algorithm COBYLA [20] from Python’s SciPy
library. We define the objective function for this optimization
experiment as follows:
A. The Objective Function
Let Y ∈ Rn be a vector of parameter values in the
extended continuous space and fˆ(Y ) denote the objective
function we wish to minimize. We construct the objective
function as a weighted sum of performance and cost measures.
The performance measure execution time(Y ) is the sum of
execution times for four benchmark kernels (listed in Table I).
We represent cost using a synthetic function cost(Y ) which
increases as each parameter is varied in the direction of
improving performance. The cost function is defined as
0 50 100 150 200
3.5
4.0
4.5
×107 line 1
0 50 100 150 200
2
3
4
5
×107 line 2
0 50 100 150 200
3.3
3.4
3.5
3.6
×107 line 3
0 50 100 150 200
3.0
3.2
3.4
×107 line 4
0 50 100 150 200
3.0
3.5
4.0
×107 line 5
0 50 100 150 200
3.2
3.4
3.6
3.8
×107 line 6
0 50 100 150 200
2
4
6
×107 line 7
0 50 100 150 200
2.5
3.0
3.5
4.0
×107 line 8
0 50 100 150 200
3.2
3.4
3.6
3.8
×107 line 9
0 50 100 150 200
3.0
3.5
4.0
4.5
×107 line 10
Fig. 2. Interpolated objective function (total execution time) values plotted along ten random straight lines passing through the multi-dimensional continuous
parameter space. Each line is sampled at 200 uniformly spaced points.
cost(Y ) =
∑
i xi +
∑
j
100
dj
where dj and xi are delay and
non-delay parameters normalized to lie in the range [1, 100].
This synthetic cost function is sufficient to demonstrate the
validity of our technique and can be replaced with other cost
functions such as energy consumption as appropriate for a
particular design exploration problem. The objective function
is:
fˆ(Y ) = execution time(Y ) + α× cost(Y )
Where α is a weighting factor which is varied to obtain
cost/performance trade offs.
B. Results
We study the convergence properties of the COBYLA
algorithm applied to the objective function obtained by our er-
godic interpolation technique. Optimization is performed with
multiple values of the weight factor α ∈ {0, 104, 105, 106}
to get cost/performance trade-off curves. Further, we perform
eight optimization runs starting from distinct randomly chosen
points in the parameter space for each value of α. A single
optimization run is allowed to make at most 300 function
evaluations.
In Figure 3, we show the evolution of the objective function
with the number of function evaluations for all optimization
TABLE II
DESIGN PARAMETERS AND THEIR RANGES
(’OPT’ LISTS PARAMETER VALUES AT THE OPTIMUM FOR α = 105)
Parameter Min Max Opt Parameter Min Max Opt
N (L1I) 1 4 2.95 CinQ(L1I) 1 4 1.00
N (L1D) 1 4 1.93 CinQ(L1D) 1 4 1.06
N (L2) 1 4 1.27 CinQ(L2) 1 16 1.12
N (L3) 1 4 1.22 CinQ(L3) 1 16 2.22
N (mem) 1 4 1.02 CinQ(mem) 1 32 1.00
D(L1I) 1 4 1.70 CoutQ(L1I) 1 4 1.06
D(L1D) 1 4 3.17 CoutQ(L1D) 1 4 1.00
D(L2) 8 16 9.33 CoutQ(L2) 2 16 2.00
D(L3) 16 32 21.90 CoutQ(L3) 4 16 4.00
D(mem) 64 128 80.61 CoutQ(mem) 4 32 4.00
L(X1) 1 4 2.18 CinQ(X3) 1 8 1.00
CinQ(X1) 1 4 1.00 L(X3) local 16 64 62.53
CoutQ(X1) 1 4 1.00 L(X3) remote 32 64 55.27
L(X2) 4 8 5.41 CoutQ(X3) local 1 16 1.00
CinQ(X2) 1 4 1.00 CoutQ(X3) remote 1 16 1.02
CoutQ(X2) 1 4 1.02
• N , D, L and C are parameters as described in Section II-A.
• Components L1I, L1D, L2 and L3 are caches as depicted in
Figure 1, mem refers to a main memory bank, and X1, X2,
and X3 refer to interconnects between L1 to L2, L2 to L3
and L3 to main memory respectively.
• Links in X3 connecting a processor to its local and remote
memory modules have different L and CoutQ values to model
NUMA effect.
runs. In almost all cases, we observe that the objective function
values converge to a local optimum within 200 function
evaluations. Further, the spread of values obtained across
different initial points is small for most cases as summarized
in Table III. Improvements over the initial guess range from
1.3X to 12.2X as listed in Table IV.
TABLE III
OBJECTIVE FUNCTION VALUES AT THE OPTIMUM ACROSS EIGHT
COBYLA RUNS FOR EACH VALUE OF α
α = 0 α = 104 α = 105 α = 106
best 2.916× 107 3.697× 107 5.753× 107 1.218× 108
worst 2.922× 107 4.020× 107 7.401× 107 2.836× 108
mean 2.917× 107 3.836× 107 6.305× 107 1.608× 108
relative std dev 0.08% 3.08% 8.59% 35.66%
TABLE IV
IMPROVEMENTS IN OBJECTIVE FUNCTION VALUE OVER THE INITIAL
GUESS
α = 0 α = 104 α = 105 α = 106
best 2.8x 2.5x 3.3x 12.2x
worst 1.4x 1.3x 2.1x 4.4x
mean 1.8x 1.7x 2.8x 8.8x
In Figure 4, we plot cost and performance values at the
optimum for each of the optimization runs (as α and initial
points are varied). Each point in the plot represents the result
0.2
0.4
0.6
0.8
×108
α = 0
0.4
0.6
0.8
1.0
×108
α = 104
0.5
1.0
1.5
2.0
×108
α = 105
0 100 200 300
0.0
0.5
1.0
1.5
2.0
×109
α = 106
Number of function evaluations
O
bj
ec
tiv
e
fu
nc
tio
n
va
lu
e
Fig. 3. Objective function values versus number of function evaluations. For
each value of the weighting factor α, eight optimization runs are performed
starting from distinct initial points.
of a single optimization run with (x,y) coordinates showing
(cost, performance) values at the optimum. The plot shows
a clear knee which can be used to select the optimal system
configuration for maximum performance. Parameter values at
the best solution among all COBYLA runs (for α = 105 at
the knee) are listed in Table II. The solution indicated in Table
II is interesting. For example, it indicates that the throughput
parameter N for the L1 I-cache should be approximately 3,
while that for the L1 D-cache should be approximately 2.
Further, the delay D of the L1 D-cache can be considerably
larger than that of the L1 I-cache (3.17 versus 1.70).
Since the COBYLA algorithm produces a local optimum,
we are also interested in understanding the quality of the local
optimum across random initial starting points. We compare
the quality of solutions generated by COBYLA to those
generated by an Adaptive Simulated Annealing (ASA) search
over discrete parameter space. We use a Python binding [31]
of a well-established ASA implementation [32]. In Table V,
we show the objective function values obtained after running
ASA with a limit of 1000 function evaluations. We observe
that most of the COBYLA runs produce solutions that are
close to (within 10% of) the global optimum reported by ASA
−500 0 500 1000 1500 2000 2500 3000
cost
2
3
4
5
6
7
8
9
pe
rf
or
m
an
ce
(e
xe
cu
tio
n
tim
e
in
cy
cl
es
)
×107 cost/performance tradeoff
α = 0
α = 104
α = 105
α = 106
Fig. 4. Performance and cost values at the optimum for multiple optimization
runs (as α and initial points are varied). Each point represents the result of
a single optimization run with (x,y) coordinates showing (cost, performance)
values at the optimum.
as summarized in Table VI.
TABLE V
OBJECTIVE FUNCTION VALUES AT THE OPTIMUM REPORTED BY ASA
α = 0 α = 104 α = 105 α = 106
2.916× 107 3.694× 107 6.063× 107 1.247× 108
TABLE VI
NUMBER OF COBYLA RUNS (OUT OF 8) THAT YIELD SOLUTIONS WITHIN
10% OF THE GLOBAL OPTIMUM REPORTED BY ASA
α = 0 α = 104 α = 105 α = 106
8 8 6 5
V. CONCLUSIONS
We have described a technique using which discrete pa-
rameter multi-core systems can be optimized using contin-
uous space optimization schemes. The technique relies on
a novel ergodic interpolation scheme based on randomizing
the discrete parameter cycle-accurate simulation model of the
multi-core system. The interpolated performance function is
continuous, has low statistical error, and was observed to be
piece-wise smooth. Using this ergodic interpolation technique,
we have applied a standard continuous space optimization
algorithm to find optimal designs for a 31-parameter multipro-
cessor system exercised with a subset of the NAS benchmarks.
The optimization algorithm converged to a local optimum
within 200 function evaluations and produced substantial
improvements ranging from 1.3X to 12.2X over the initial
guess in the cases that we have tried. Cost performance
curves can also be generated using different weightings of the
performance and cost components in the objective function.
More work is needed to completely characterize the im-
pact of rounding on the quality of the results obtained,
and on the application of the ergodic interpolation technique
to other discrete parameters such as cache size/associativity
and processor core issue-width/clock-frequency. However, our
preliminary investigations indicate that ergodic interpolation
based optimization can be an effective and practical approach
for the design space exploration of multi-core systems.
ACKNOWLEDGMENT
Part of this work was funded by an IBM Faculty Award.
Simulations in Section III-C were run on CDAC’s PARAM
Yuva II cluster. The authors wish to thank Prof. Virendra
Sule for granting us the use of a 48 core cluster and Prof.
Sachin Sapatnekar for his suggestions and feedback during
initial stages of this work.
REFERENCES
[1] M. Gries, “Methods for Evaluating and Covering the Design Space dur-
ing Early Design Development,” Integration, the VLSI journal, vol. 38,
no. 2, 2004.
[2] T. Givargis, J. Henkel, and F. Vahid, “Interface and Cache Power Explo-
ration for Core-based Embedded System Design,” in 1999 IEEE/ACM
International Conference on Computer-Aided Design. Digest of Techni-
cal Papers. IEEE, 1999.
[3] J. Kin, C. Lee, C. Leez, W. H. M. Smith, and M. Potkonjak, “Power
Efficient Mediaprocessors: Design Space Exploration.”
[4] D. Sheldon, F. Vahid, and S. Lonardi, “Soft-core Processor Customiza-
tion using the Design of Experiments Paradigm,” in 2007 Design,
Automation & Test in Europe Conference & Exhibition. IEEE, 2007.
[5] J. Yi, D. Lilja, and D. Hawkins, “A Statistically Rigorous Approach
for Improving Simulation Methodology,” in The Ninth International
Symposium on High-Performance Computer Architecture, 2003. HPCA-
9 2003. Proceedings. IEEE Comput. Soc, 2003.
[6] H. Orsila, E. Salminen, M. Hannikainen, and T. D. Hamalainen,
“Evaluation of Heterogeneous Multiprocessor Architectures by Energy
and Performance Optimization,” in 2008 International Symposium on
System-on-Chip. IEEE, 2008.
[7] B. C. Schafer, “Adaptive Simulated Annealer for High Level Synthesis
Design Space Exploration,” in 2009 International Symposium on VLSI
Design, Automation and Test. IEEE, 2009.
[8] V. Srinivasan, S. Radhakrishnan, and R. Vemuri, “Hardware Software
Partitioning with Integrated Hardware Design Space Exploration,” in
Proceedings Design, Automation and Test in Europe. IEEE Comput.
Soc, 1998.
[9] M. Holzer, B. Knerr, and M. Rupp, “Design Space Exploration with
Evolutionary Multi-Objective Optimisation,” in 2007 International Sym-
posium on Industrial Embedded Systems. IEEE, 2007.
[10] M. Palesi and T. Givargis, “Multi-objective Design Space Exploration
using Genetic Algorithms,” in Proceedings of the Tenth International
Symposium on Hardware/Software Codesign. CODES 2002. ACM,
2002.
[11] A. Sengupta, R. Sedaghat, and P. Sarkar, “A multi structure genetic
algorithm for integrated design space exploration of scheduling and allo-
cation in high level synthesis for DSP kernels,” Swarm and Evolutionary
Computation, vol. 7, 2012.
[12] K. Lahiri, A. Raghunathan, and S. Dey, “Efficient Exploration of
the SoC Communication Architecture Design Space,” in IEEE/ACM
International Conference on Computer Aided Design. ICCAD - 2000.
IEEE/ACM Digest of Technical Papers. IEEE, 2000.
[13] P. Eles, Z. Peng, K. Kuchcinski, and A. Doboli, “System Level Hard-
ware/Software Partitioning Based on Simulated Annealing and Tabu
Search,” Design Automation for Embedded Systems, vol. 2, no. 1, 1997.
[14] T. Wiangtong, P. Y. K. Cheung, and W. Luk, “Comparing Three
Heuristic Search Methods for Functional Partitioning in Hardware-
Software Codesign,” Design Automation for Embedded Systems, vol. 6,
no. 4, Jul. 2002.
[15] G. Palermo, C. Silvano, and V. Zaccaria, “An Efficient Design Space
Exploration Methodology for Multiprocessor SoC Architectures based
on Response Surface Methods,” in 2008 International Conference on
Embedded Computer Systems: Architectures, Modeling, and Simulation.
IEEE, Jul. 2008.
[16] R. Piscitelli and A. D. Pimentel, “Design Space Pruning through Hybrid
Analysis in System-level Design Space Exploration,” in 2012 Design,
Automation & Test in Europe Conference & Exhibition. IEEE, Mar.
2012.
[17] E. I¨pek, S. A. McKee, R. Caruana, B. R. de Supinski, and M. Schulz,
“Efficiently Exploring Architectural Design Spaces via Predictive Mod-
eling,” ACM SIGARCH Computer Architecture News, vol. 34, no. 5, Oct.
2006.
[18] S. K. Koh, G. Ananthasuresh, and S. Vishveshwara, “A Deterministic
Optimization Approach to Protein Sequence Design Using Continuous
Models,” The International Journal of Robotics Research, vol. 24, no.
2-3, Feb. 2005.
[19] H. Wang and B. W. Schmeiser, “Discrete Stochastic Optimization using
Linear Interpolation,” in 2008 Winter Simulation Conference. IEEE,
Dec. 2008.
[20] M. Powell, “On Trust Region Methods for Unconstrained Minimization
without Derivatives,” Mathematical Programming, vol. 97, no. 3, 2003.
[21] D. Ziakas, A. Baum, R. A. Maddox, and R. J. Safranek, “Intel Quick-
Path Interconnect Architectural Features Supporting Scalable System
Architectures,” in 2010 18th IEEE Symposium on High Performance
Interconnects. IEEE, 2010.
[22] C. Keltcher, K. McGrath, A. Ahmed, and P. Conway, “The AMD
Opteron Processor for Multiprocessor Servers,” IEEE Micro, vol. 23,
no. 2, Mar. 2003.
[23] D. Sorin, M. Hill, and D. Wood, A Primer on Memory Consistency and
Cache Coherence. Morgan and Claypool Publishers, 2011.
[24] N. V. Karanjkar and M. P. Desai, “SiTAR : Simulation Tool for
Architectural Research. Technical report,” 2012, unpublished.
[25] T. Sauer and Y. Xu, “On Multivariate Lagrange Interpolation,” MATH.
COMP, vol. 64, 1994.
[26] S. Davies, “Multidimensional Triangulation and Interpolation for Rein-
forcement Learning,” 1996.
[27] J. Hammersley and D. Handscomb, Monte Carlo Methods, ser.
Methuen’s monographs on applied probability and statistics. Methuen,
1964.
[28] G. Hazari, M. P. Desai, and G. Srinivas, “Bottleneck Identification
Techniques Leading to Simplified Performance Models for Efficient
Design Space Exploration in VLSI Memory Systems,” in 2010 23rd
International Conference on VLSI Design. IEEE, 2010.
[29] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter,
L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S.
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga,
“The NAS Parallel Benchmarks Summary and Preliminary Results,” in
Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, ser.
Supercomputing ’91. New York, NY, USA: ACM, 1991.
[30] K. Kusano, S. Satoh, and M. Sato, “Performance Evaluation of the Omni
OpenMP Compiler,” in High Performance Computing, ser. Lecture Notes
in Computer Science. Springer Berlin Heidelberg, 2000, vol. 1940.
[31] J. Robert. Python bindings for the asa code. [Online]. Available:
https://pypi.python.org/pypi/pyasa/
[32] L. Ingber, “Adaptive Simulated Annealing (ASA): Lessons Learned,”
Control and Cybernetics, vol. 25, 1996.
