Learning-based Dynamic Pinning of Parallelized Applications in Many-Core
  Systems by Chasparis, Georgios C. et al.
Learning-based Dynamic Pinning of Parallelized
Applications in Many-Core Systems
Georgios C. Chasparis Vladimir Janjic Michael Rossbory
Abstract—Motivated by the need for adaptive, secure and
responsive scheduling in a great range of computing applications,
including human-centered and time-critical applications, this
paper proposes a scheduling framework that seamlessly adds
resource-awareness to any parallel application. In particular, we
introduce a learning-based framework for dynamic placement of
parallel threads to Non-Uniform Memory Access (NUMA) ar-
chitectures. Decisions are taken independently by each thread in
a decentralized fashion that significantly reduces computational
complexity. The advantage of the proposed learning scheme is
the ability to easily incorporate any multi-objective criterion and
easily adapt to performance variations during runtime. Under the
multi-objective criterion of maximizing total completed instruc-
tions per second (i.e., both computational and memory-access
instructions), we provide analytical guarantees with respect to
the expected performance of the parallel application. We also
compare the performance of the proposed scheme with the Linux
operating system scheduler in an extensive set of applications,
including both computationally and memory intensive ones. We
have observed that performance improvement could be significant
especially under limited availability of resources and under
irregular memory-access patterns.
I. INTRODUCTION
Efficient resource allocation for multi-threaded applications
in NUMA architectures has attracted significant scientific
attention due to a) the involved complexity of the decision-
making process, and b) the need to incorporate alternative
optimization criteria that goes beyond standard maximization
of execution speed. This statement is further reinforced by
the recent advancement of tools for parallelizing complex
applications, that gave birth to non-trivial and highly advanced
parallel and data patterns [2], [3], [4], [5]. In addition, the
nature of an application (e.g., machine-learning, image pro-
cessing, control and optimization) may add additional criteria
that cannot easily be integrated into an OS scheduler. As
expected, the problem of efficiently utilizing resources, while
concurrently optimizing a multi-objective criterion, cannot be
treated by standard heuristic-based techniques.
To this end, this paper proposes and investigates the poten-
tial of a learning- or measurement-based scheduling scheme
that is part of a running application and regularly corrects/im-
proves allocation decisions given the observed application’s
performance. In particular, this paper proposes a distributed
This paper is an extension of an earlier version appeared in the conference
paper [1]. It has been supported by the European Union grant EU H2020-
ICT-2014-1 project RePhrase (No. 644235).
G. C. Chasparis and M. Rossbory are with the Software Competence Center
Hagenberg GmbH, Softwarepark 21, A-4232 Hagenberg, Austria.
V. Janjic is with the School of Computer Science, University of St Andrews,
Scotland, UK.
learning scheme specifically tailored for addressing the prob-
lem of dynamically assigning/pinning threads of a parallelized
application to the available processing units. The proposed
scheme is flexible enough to incorporate any multi-objective
optimization criterion and provides convergence guarantees
to at least suboptimal assignments. Given the fact that it
is measurement-based, it is computationally efficient with a
linear-complexity with the number of threads. Since it is iter-
ative in nature, it also exhibits minimal memory requirements.
It is worth noting that we target an online learning frame-
work where allocation decisions are taken during runtime,
and without requiring any prior application knowledge. Such
feature can make parallel applications more responsive by
reducing their execution time, especially in situations where
computing resources are shared between different applications.
This is also very important for human-centered computing,
where strict timing requirements can be of high importance,
given that they are often computationally intensive, such as
machine-learning or image processing applications. In addi-
tion, the proposed scheduling framework can seamlessly be
attached to any parallel application. These features provide an
easy-to-use and user-friendly supervisory scheduling scheme
that reduces the need for expert and application knowledge.
In our previous work [6], [7], we have proposed a
reinforcement-learning-based distributed scheduling frame-
work (PaRLSched), adapted to Uniform Memory Architec-
tures (UMA). In this paper, our goal is to provide a general-
ized methodology that also extends to Non-Uniform Memory
Architectures (NUMA). Such framework should be considered
as a supervisory scheme that acts on top of any OS scheduling
and performs either low- or high-frequency allocation correc-
tions possibly subject to alternative multi-objective criteria.
For example, when optimizing with respect to both computa-
tional and memory-access instructions completed per second,
the learning scheme should find the right balance between
computing bandwidth and memory affinities. In this paper
though, we are not concerned with memory migrations.
This paper is an extension of an earlier version appeared in
[1]. In this updated version, we provide analytical guarantees
of the performance of the learning-based scheduling frame-
work, and we have extended our experimental evaluation to
applications with memory irregularities.
The paper is organized as follows. Section II discusses
related work and contributions. Section III describes the
problem formulation and objective of the paper. Section IV
presents the main features of the proposed Dynamic Scheduler
(PaRLSched) and Section V provides analytical convergence
ar
X
iv
:1
80
3.
00
35
5v
2 
 [c
s.D
C]
  1
1 J
an
 20
20
guarantees with respect to the application’s performance.
Section VI presents a performance comparison with the
standard Linux scheduler in benchmark applications. Finally,
Section VII presents concluding remarks and future work.
II. RELATED WORK AND CONTRIBUTIONS
Prior work has demonstrated the importance of thread-to-core
bindings in the overall performance of a parallelized appli-
cation [8]. The task of discovering such optimal bindings is
rather complex, given the structure of NUMA architectures [9].
This task becomes even harder given the need for developing
tools that can easily generalize to any architecture and they
are application independent.
For example, reference [10] describes a tool that checks the
performance of each of the available thread-to-core bindings
and searches for an optimal placement. Unfortunately, the
exhaustive-search type of optimization that is implemented
may prohibit runtime implementation. Reference [11] com-
bines the problem of thread scheduling with scheduling hints
related to thread-memory affinity issues. A similar scheduling
policy is also implemented by [12].
At the same time, given that no prior knowledge of the
application’s details is available, a centralized optimization
formulation is prohibitive. Such design restrictions give rise
to learning-based techniques, where scheduling decisions are
taken based only on performance measurements. This need for
learning from data has been recognized in [13], where a ma-
chine learning based mechanism is designed for transactional
applications. In this case, each instance of the application has
to be run and profiled before any learning process is to be
implemented.
Even such learning processes could be computationally
complex given the quite large search space. For this rea-
son, distributed or game-theoretic optimizations have been
attempted in the past for related problems, including coop-
erative game formulation for allocating bandwidth in grid
computing [14], the non-cooperative game formulation in the
problem of medium access protocols in communications [15]
or for allocating resources in cloud computing [16]. These
approaches can significantly reduce the involved computa-
tional complexity and also allow for the development of online
selection rules based on performance measurements. However,
such modeling techniques have not yet been implemented in
the context of pinning of parallelized applications.
Recognizing this need for both learning- and distributed-
based optimization, and contrary to the aforementioned ref-
erences on pinning of parallelized applications, our earlier
work [6], [7] proposed a scheduling scheme for optimally
allocating threads of a parallelized application that com-
bines both a learning- and a distributed-based optimization.
It requires a minimum information exchange, where only
measurements collected from each running thread are needed.
Furthermore, it is flexible enough to accommodate alternative
optimization criteria depending on the available performance
counters. However, one potential drawback was the fact that no
special consideration was taken upon the possible non-uniform
memory access (NUMA) architectures, as it did not distinguish
between moving a thread to a “local” (within the same NUMA
node) and “remote” (from a different NUMA node) core.
This paper extends the scheduling framework of our previ-
ous work [6], [7] with respect to the following contributions:
(C1) We propose a novel two-level scheduling process that
is appropriate for NUMA architectures. At the higher
level, the scheduler decides on which NUMA node each
thread should be assigned, while at the lower level it
decides on which CPU core (within that NUMA node)
to execute the thread.
(C2) We provide analytical convergence guarantees with re-
spect to the resulting performance of the application in
comparison to the optimal performance.
(C3) We demonstrate the efficiency of the proposed approach
on several benchmark applications with different charac-
teristics, including computational- and memory-intensive
applications.
This paper is also an extension of an earlier version appeared
in [1] with respect to contributions (C2) and (C3).
III. PROBLEM FORMULATION AND OBJECTIVE
Let a parallel application comprise n threads, I =
{1, 2, ..., n}. We denote the assignment of a thread i to a set
of available NUMA nodes JNUMA by αi ∈ JNUMA. Within
the selected NUMA node αi, thread i should be assigned
to one of the available CPU cores JCPU(αi), denoted by
βi ∈ JCPU(αi). Let also α = {(αi, βi) , i ∈ I} denote the
overall assignment profile, and let A be the set of all profiles.
The Resource Manager (RM) periodically checks the perfor-
mance of a thread and makes decisions about its assignment
for the next scheduling iteration. For the remainder of the
paper, we will assume that: a) The internal properties and
details of the threads are not known to the RM. Instead, the
RM may only have access to measurements related to their
performances; b) Threads may not be idled or postponed by
the RM. Instead, the goal of the RM is to assign the currently
available resources to the currently running threads (work-
conserving).
1) Static optimization and issues: A possible centralized
objective that we may consider could be to maximize the
average processing speed over all threads, i.e.,
max
α∈A
f(α,w)
.
=
∑n
i=1 ui(α,w)/n, (1)
where, for example, ui may represent the processing speed of
thread i under assignment α ∈ A. In general, ui will depend
on the assignment profile α and exogenous disturbances (e.g.,
other applications) summarized within the parameter w. Any
solution to the optimization problem (1) will correspond to an
efficient/optimal assignment. However, there are two practical
issues when posing an optimization problem in this form,
namely a) the details of the function ui(α,w) are unknown
and it may only be evaluated through measurements, denoted
by u˜i; and, b) w is also unknown and may vary with time.
2) Measurement- or learning-based optimization: We wish
to address a static optimization objective of the form (1)
through a measurement- or learning-based methodology. That
is, the RM reacts to measurements of f(α,w), periodically
collected at time instances k = 1, 2, ... and denoted by
f˜(k). The measured objective may take on the form f˜(k) .=∑n
i=1 u˜i(k)/n. Given these measurements and the current
assignment α(k) of resources, the RM will select the next
assignment of resources α(k+1), so that the measured objec-
tive approaches the true optimum of the unknown performance
function f(α,w).
3) Multi-agent formulation: We further distribute the
decision-making process into a thread-based optimization,
where the RM makes decisions independently for each thread.
Equivalently, we may assume that each thread makes its own
independent decisions as in multi-agent formulations. Such
distribution reduces the complexity of the decision-making
process, since each thread has a reduced number of choices as
compared to the number of choices of the group of threads.
Furthermore, it increases robustness, since any performance
degradation noticed in a group of threads can immediately be
treated by the affected threads, thus avoiding the complexity
of centrally designed assignment corrections.
4) Multi-level decision-making and actuation: Recent work
by the authors [6], [7] has demonstrated the potential of
learning-based optimization in UMA architectures. However,
when an application runs on a NUMA architecture, additional
information can be exploited to enhance scheduling of a
parallelized application. To this end, a multi-level decision-
making and actuation process is considered. We extend the
PaRLSched dynamic scheduler of [6], [7] by introducing two
nested decision processes depicted in Figure 1. At the higher
level (Level 1), the performance of a thread is evaluated with
respect to its own prior history of performances, and decisions
are taken with respect to its NUMA placement. At the lower
level (Level 2), the performance of a thread is evaluated
with respect to its own prior history of performances, and
decisions are taken with respect to its CPU placement (within
the selected NUMA node).
IV. DYNAMIC SCHEDULER
Each one of the two levels of the decision process will take
place at different frequencies and based on different reasoning.
In particular, NUMA-node switching may be costly, especially
when performed with high frequency due primarily to memory
affinities, while CPU-node switching within the same NUMA
node may be costless (with respect to its impact to the
processing speed). For this reason, we have introduced two
measurement-based learning algorithms specifically tailored to
accommodate these different needs (Figure 1):
− (Level 1) Aspiration learning for NUMA-node switch-
ing, that responds only to significant performance vari-
ations and does not require frequent migrations.
− (Level 2) Perturbed learning automata for CPU-core
pinning within a given NUMA node, that allows fre-
quent CPU-core switches.
Fig. 1. Two-level scheduling where the RM decides firstly the NUMA node
and secondly the CPU core at which each thread should be pinned on.
We introduce periodic time instances with period TCPU > 0,
and indexed by k = 1, 2, ..., at which decisions at Level 2
(CPU-core pinning) are revised. Decisions at Level 1 (NUMA-
node switching) are performed less frequently, at periodic time
instances of period TNUMA  TCPU, which will be indexed
by τ = 1, 2, ....
A. Utility Function
A cornerstone in the design of any such multi-agent formu-
lation is the preference criterion or utility function ui for each
thread i ∈ A. The utility function captures the benefit of a
decision maker (thread) resulting from the assignment profile
α, i.e., it represents a function of the form ui : A → R+
(where we restrict it to be a positive number). The action
profile (i.e., the selections of all threads) constitutes a “state”
of the environment that directly determines the performances
of all threads. We are interested in building learning-based
reflex agents that respond only to current measurements in an
effort to “eventually” learn to play efficient assignments.
It is important to note that the utility function ui of each
agent/thread i is subject to design and it is introduced in
order to guide the preferences of each agent. Thus, ui may
not necessarily correspond to a measured quantity, but it
could be a function of available performance counters. For
example, a natural choice for the utility of each thread is its
own execution speed, which can be measured by the number
of executed instructions per unit of time. This may also be
combined with other counters, e.g., the number of memory-
access instructions, the number of cache misses, etc., to give
a better representation of the performance of a thread.
B. Aspiration learning for NUMA-node switching
We developed a novel learning scheme for NUMA-node
switching that is based upon the notions of benchmark ac-
tions/performances and bears similarities with the so-called
aspiration learning [17]. The novelty here lies in the in-
troduction of two benchmark levels in order to handle the
possibility of noisy measurements. Such type of learning
dynamics tries to gradually reach assignment profiles where all
threads perform well. They have the advantage that exploration
(of new assignments) can be performed selectively (e.g., when
a significant reduction in performance is observed). In this
way, a low-frequency NUMA-node switching can be attained.
The specific steps are depicted in Table I.
It is important to note that this learning scheme will react
immediately to a rapid drop in the performance. In particular,
when the performance drops below the lower benchmark, then
with high probability the action will change, while in any
other case, the action will change with a small probability
ζ > 0. The reason for maintaining both an upper and lower
benchmark is in order to minimize the effect of noise in the
decision-making process.
When the thread needs to select a new NUMA node, it
will select among the set of better replies, i.e., nodes at which
other threads perform better so far. Note that a thread may not
have a-priori knowledge of the exact impact an action switch
has on his own utility (until this action switch is performed).
However, we may use prior data of the performances of other
threads, as defined in BRNUMA,i(α). Thus, at step (4a), we
may direct threads that currently do not perform well to the
NUMA nodes where threads perform better.
C. Perturbed Learning Automata for CPU-core pinning
Let us assume that, at Level 1, and for each one of the
running threads i ∈ I, the RM has already selected a NUMA
node αi ∈ JNUMA. Then, at Level 2, the RM needs to decide
which CPU core each thread should be pinned to. Given that
CPU-core switching within the same NUMA node is usually
costless, we have designed a learning algorithm that allows
frequent switching and therefore a faster convergence rate.
To this end, we employ perturbed learning automata [18]
developed by the authors. Such dynamics perform well in the
presence of noise contrary to alternative schemes, as discussed
in [18], and can guarantee convergence to at least locally
optimal assignments.
The basic idea behind learning automata is rather simple.
Each agent i keeps track of a strategy vector that holds its
estimates over the best choice. We denote this strategy by
σi = [σij ]j , where j ∈ JCPU(αi), σij ≥ 0 and
∑
j σij = 1.
To provide an example, consider the case of 3 available CPU
cores, i.e., JCPU(αi) = {1, 2, 3}. In this case, a vector of the
form σi = (0.2, 0.5, 0.3) is a strategy vector, such that 20%
corresponds to the probability of assigning itself to CPU core
1, 50% to CPU core 2 and 30% to CPU core 3. Briefly, the
CPU core selection will be denoted by βi ∈ JCPU(αi). Note
that if σi is a unit vector, say ej , then agent i selects its jth
action with probability one.
In particular, the steps executed in each iteration of the
perturbed learning automata are depicted in Table II. Accord-
ing to this recursion, if currently thread i selected CPU core
βi(k), and measured performance βi(k), then its strategy is
TABLE I
ASPIRATION LEARNING FOR NUMA-NODE SWITCHING
At fixed periodic time instances denoted by τ = 1, 2, ..., with period
TNUMA sec, the following steps are executed recursively for each thread i
in parallel.
(1) Performance measurement. For the currently selected NUMA-node
αi(τ) thread i retrieves its current performance measurement, u˜i(τ).
(2) Aspiration-level update. Given the current performance measurement
u˜i(τ), update the discounted running average performance of the thread, as
follows:
ρi(τ + 1) = ρi(τ) + ν · [u˜i(τ)− ρi(τ)], (2)
where u˜i(τ) is the current measurement of the utility of thread i.
(3) Benchmarks update. Define the upper benchmark performance, bi(τ),
as a performance threshold over which a performance is considered satis-
factory, and the lower benchmark performance, bi(τ), as a performance
threshold under which a performance is considered unsatisfactory, with
bi(τ) < bi(τ). They are updated as follows:
− if ρi(τ + 1) ≥ bi(τ), then
bi(τ + 1) = ρi(τ + 1)
bi(τ + 1) = ρi(τ + 1)/η
− if bi(τ) ≤ ρi(τ + 1) < bi(τ), then
bi(τ + 1) = bi(τ)
bi(τ + 1) = bi(τ)
− if ρi(τ + 1) < bi(τ), then
bi(τ + 1) = η · ρi(τ + 1)
bi(τ + 1) = ρi(τ + 1)
for some constant η > 1.
(4) Action update. A thread i selects actions according to the following
rule:
a) if ρi(τ +1) < bi(τ), i.e., if the updated discounted running average
performance is unsatisfactory, then thread i will perform a random
switch to a better reply, i.e.,
αi(τ + 1) ∈ randunif
[
BRNUMA,i(α)
]
,
where BRNUMA,i(α) denotes the better-reply of thread i to the
assignment α, defined as
BRNUMA,i(α)
.
={
α′i ∈ JNUMA : ρi(τ) < γ
∑
{j∈I:αj(τ)=α′i} ρj(τ)∣∣{j ∈ I : αj(τ) = α′i}∣∣
}
(3)
for some γ ∈ (0, 1). The set {j ∈ I : αj(τ − 1) = α′i} includes all
those threads that selected action α′i in the previous time instance.
In other words, an action α′i ∈ BRNUMA,i(α) if the average of the
threads selecting α′i did better on average than thread i.
If more than one thread has chosen to migrate, then only one thread
(selected at random) is allowed to execute this migration.
b) if ρi(τ +1) ≥ bi(τ) , then each thread i will keep playing the same
action with high probability and experiment with any other action
with a small probability ζ > 0, i.e.,
αi(τ + 1) =
{
αi(τ), w.p. 1− ζ
randunif [BRNUMA,i(α)], w.p. ζ
(4)
If more than one thread has chosen to migrate, then only one thread
(selected at random) is allowed to execute this migration.
going to increase in the direction of the selected action and
proportionally to the observed performance. Informally, the
dynamics reinforce repeated selection and reinforcement is
always proportional to the received reward.
TABLE II
PERTURBED LEARNING AUTOMATA FOR CPU-CORE PINNING
At fixed time instances denoted by k = 1, 2, ..., the following steps are
executed recursively for each thread i in parallel.
(1) Performance measurement. For the currently selected CPU-core βi(k)
thread i retrieves its current performance measurement, u˜i(k).
(2) Strategy update. Given that αi is the current NUMA-node assignment
of thread i, and |JCPU(αi)| is the number of the available CPU cores, the
strategy of thread i with respect to its CPU-core pinning is defined as:
σi(k) = (1− λ)xi(k)− λ|JCPU(αi)|
(5)
where λ > 0 corresponds to a perturbation term (or mutation) and xi(k)
corresponds to the nominal strategy of agent i. The nominal strategy is
updated according to the following update recursion:
xi(k + 1) = xi(k) +  · u˜i(k) · [eβi(k) − xi(k)] (6)
for some constant step-size  > 0.
(3) Action update. The action of each thread i is updated as follows:
βi(k + 1) = randσi [JCPU(αi)] .
V. CONVERGENCE ANALYSIS
The problem of optimally allocating threads into CPU cores
can be formulated as a load-balancing game. Such formulation
can help us provide immediate answer with respect to whether
optimal allocations exist as well as the characteristics of these
allocations. The notion of weak-acyclicity [19] in strategic-
form games can help us provide an answer to these questions.
In the context of load-balancing games, we are given a
set of tasks (or computing threads) that need to be executed
in a multi-core computing system (comprising multiple CPU
cores). An objective may correspond to the minimization of
the makespan, that is the maximum load over all the available
CPU cores. In this case, the computing load of a CPU core
corresponds to the total computing bandwidth requested by all
threads assigned to this core, that is the frequency with which
the CPU core is reserved by all threads.
More formally, there exist m CPU cores with speeds
s1, s2, ..., sm and n threads with weights w1, w2, ..., wm,
where the weight of a thread i characterizes its operation/ser-
vice level (e.g., the computing bandwidth requested). The
speed sj of CPU core j will be defined as the maximum
number of instructions per sec (IPS) that can be executed by
the CPU core. Moreover, the weight wi of a thread i will be
measured by the number of instructions per second that this
thread will require within a unit of available bandwidth.
The speed sj of machine j may not necessarily be known
in advance (usually average over many different types of
threads). Also, the weight wi may also not be available, while
it may change throughout the execution time of a thread. For
now, let us assume that these quantities are constant, but not
necessarily known. As we will see, the explicit knowledge of
these quantities will not be necessary.
We can analyze the problem of allocating threads into CPU
cores within the context of strategic-form games. In strategic-
form games, there exists a set of players/agents I .= {1, ..., n},
which in this case to be the set of threads requesting resources,
T1 T2 Tn· · ·
CPU 1 CPU 2 CPU 3 · · · CPU m
α1 α2 αn
Fig. 2. A sketch of a load-balancing allocation problem in the context of a
multi-core computing system. Each running thread independently pins itself
to a single CPU core. Multiple threads may run on the same CPU core.
and JCPU .= {1, ...,m} to be the set of machines or CPU
cores available. In this setting, each thread may be thought
of as an independent player that can decide independently
with respect to which one of the available cores to run on.
In this context, βi ∈ J corresponds to the action of thread
i, which may be any one of the available cores JCPU, and
β
.
= (β1, ..., βn) corresponds to the action profile over all
threads (or assignment).
This definition of actions naturally fit to the setup of
Perturbed Learning Automata for CPU-core pinning of Sec-
tion IV-C, where each thread i regularly updates its selection
βi so that threads gradually learn the optimal allocation. Can
threads, however, learn to play an optimal allocation? In order
to answer this question, we need to have a closer look on the
structure and properties of their interaction. Such investigation
can be performed in the context of strategic-form games and
it will be described in the following section.
A. Weak-acyclicity and optimal CPU-core pinning
As it is the case in standard operating systems, each thread
may run in either one of the available CPU cores under
no constraints, e.g., all threads may run on the same core.
However, the number of threads running on the same CPU
core influences the speed with which these threads will be
executed (a high number of threads on the same CPU core
will lead to a low processing speed for these threads and vice
versa). In particular, the load of a CPU core j ∈ J under
assignment β will be defined as
`j(β)
.
=
∑
{k∈I:βk=j} wk
sj
> 0. (7)
We will also denote the maximum load under profile β as
L(β)
.
= maxj∈JCPU `j(βj). In other words, L(β) corresponds
to the makespan, cf., [20, Chapter 20].
Although the speed sj of CPU core j and the weight wi
of thread i may not be known in advance, the actual running
speed of a thread on a given core can be measured in real-
time quite accurately (that is the total number of completed
instructions per sec which may include computational or
memory related instructions).
We define the utility of thread i as the number of instructions
completed per sec on core j, which can be expressed as
follows:
ui(βi = j, β−i)
.
=
wi∑
{k∈I:βk=j} wk
sj =
wi
`j(β)
, (8)
where we have assumed that the operating system allocates
fairly the available bandwidth in CPU core j over all threads
and proportionally to their weights. It is important to note that
wi and `j(β) may not be known or easily measured. However,
the utility ui can directly be measured on regular time intervals
and per thread. Thus, it can directly be integrated into the
implementation of the algorithms in Tables I–II. This design
is motivated by the measurement-based optimization approach
for resource allocation introduced in [21]. It also introduces a
slightly different design than the classical treatment of load-
balancing games (see, e.g., [20]), where the cost function of
a thread is defined as the load of the core.
The strategic-form game, characterized by the tuple
〈I,A, {ui}i〉 will be referred to as a load-balancing game.
We are specifically interested in allocations that correspond
to (pure) Nash equilibria, that is allocations β∗ at which no
thread would have the incentive to switch to a different CPU
core. In particular, an allocation β∗ is a Nash equilibrium if
ui(β
′
i, β
∗
−i) ≤ ui(β∗i , β∗−i) for all β′i 6= β∗i .
Let us denote the set of Nash-equilibrium allocations by
BNE. Moreover, let us define the set B∗ of optimal allocations
as
B∗ .= {∀β ∈ B : L(β∗) ≤ L(β)} . (9)
In other words, the set of optimal assignments minimizes the
makespan. Let also denote L∗, the minimum makespan that
can be achieved at the optimal assignments.
Proposition 5.1 (Existence of Nash equilibria): Consider the
load-balancing game characterized by the tuple 〈I,A, {ui}i〉
with a utility function defined by (8). Then, the set of pure
Nash equilibria is non-empty, i.e., BNE 6= ∅.
Proof. Let us consider any allocation profile β which is not
a pure Nash equilibrium. In other words, there exists a thread
i and two available CPU cores j and l, such that, switching
from core j to core l strictly increases the utility of thread i
(i.e., its processing speed). In particular, given that:
ui(βi = j, β−i)− ui(β′i = l, β−i) = wi
`l(β
′)− `j(β)
`l(β′)`j(β)
(10)
we conclude that, if ui(β′) > ui(β) (i.e., β′ is a better reply
to β) then `j(β) > `l(β′). In other words, if thread i strictly
improves its speed by switching from core j to core l, it
implies that the load of core j (when i runs on core j) is
strictly larger than the load of core l (when i runs on core
l). Thus, we conclude that L(β′) ≤ L(β), i.e., under any
better reply, the makespan reduces or remains the same.
Furthermore, the number of threads that have a load which is
equal or higher than `j(β) has now been strictly decreased.
We conclude that this process may only terminate at a state
than no thread can improve its speed any further, i.e., at a
Nash equilibrium. 
The importance of this proposition lies on the fact that there
exists a set of Nash equilibria at which all threads perform
well at least locally. Note that the set of Nash equilibria may
not necessarily coincide with the set of optimal allocations
B∗. In fact, the set of optimal allocations may or may not be
part of the set of Nash equilibria. However, certain guarantees
can be established with respect to the utility achieved at the
worst Nash equilibrium as compared to the utility received at
an optimal allocation. The following proposition provides a
lower bound on the performance of any Nash equilibrium as
compared to the performance of an optimal assignment. We
only investigate the case of identical CPU cores, since this
condition is satisfied by our experimental setup.
Proposition 5.2 (Performance of Nash equilibria): For the
case of identical CPU cores and for any pure Nash equilibrium
assignment β ∈ BNE, the makespan satisfies
L(β) ≤ 2 |JCPU||JCPU|+ 1 · L
∗ (11)
where |JCPU| denotes the number of available CPU cores.
Furthermore, the utility of any thread i ∈ I at any pure Nash
equilibrium assignment β ∈ BNE satisfies
ui(β) ≥ (|JCPU|+ 1)
2 |JCPU| ·
wi
L∗
. (12)
Proof. The proof of the first statement (11) follows the exact
same reasoning with Theorem 20.5 in [20]. The proof of the
second statement (12) follows directly from the definition of
the utility (8) and the first statement (11). In particular, let us
consider any thread i with weight wi. Its speed will satisfy:
ui ≥ wi
L(β)
≥ (|JCPU|+ 1)
2 |JCPU| ·
wi
L∗
.
which concludes the proof. 
The above proposition provides a lower-bound in the utility
that can be achieved at a Nash equilibrium assignment. In
particular, note that the ratio u∗i
.
= wi/L∗ corresponds to
the least maximum speed that a thread can achieve under
an optimal assignment. Thus, in a 10 CPU-core architecture,
condition (12) implies that ui(β) ≥ 11/20u∗i . Such lower
bound is a bit conservative, however it provides a significant
guarantee.
From Equation (12), we may also conclude that:
1
|JCPU|
∑
i∈I
ui ≥ (|JCPU|+ 1)
2 |JCPU| ·
(
1
|JCPU|
∑
i∈I
wi
L∗
)
,
which also establishes a similar lower bound with respect to
our original (desirable) objective of maximizing the average
speed over all threads.
We conclude that if threads settle on a Nash equilibrium
assignment, then there is a certain guarantee with respect to
their average running speed.
B. Convergence analysis of CPU-core pinning
The previous section discussed existence and properties
of assignments that are Nash equilibria of the load balanc-
ing game of the CPU-core assignment problem. Given the
properties of Proposition 5.2, Nash-equilibrium assignments
should be desirable, since they provide certain guarantees
with respect to the overall performance. However, can the
dynamics presented in Section IV of Tables I–II guarantee
convergence to the set of Nash-equilibrium assignments? This
is the question we try to answer in this section.
First, we will investigate the convergence properties of the
dynamics of Table II under the condition of a single NUMA-
node availability. In other words, threads do not have the
opportunity to migrate, and they can only increase their utility
by improving their pinning assignment to the available CPU
cores. The following proposition provides strong guarantees
with respect to the convergence of the dynamics for CPU-core
pinning of Table II.
Proposition 5.3 (Convergence of CPU-pinning): Consider
the update recursion of Table II. The fraction of time that
the discrete-time dynamics spends in an arbitrarily small
neighborhood of the set of pure Nash equilibria goes to one
as the perturbation factor λ ↓ 0, the step-size  ↓ 0 and the
time index k →∞.
Proof. Theorem 3.1 in [18] has shown that as the perturbation
factor λ ↓ 0, the induced Markov chain of the dynamics
of Table II has an invariant probability measure whose
support lies on the pure strategy states (i.e., states at which
for all i, xi assigns probability one to some action). By
Birkhoff’s individual ergodic theorem [22, Theorem 2.3.4],
this implies that the process will spend an arbitrarily
large portion of time on pure-strategy states as λ ↓ 0 and
k → ∞. Furthermore, according to [23, Proposition 3.6],
λ-perturbations of pure Nash equilibria are the unique limit
points of the continuous-time approximation of the dynamics
(6). Thus, according to a straightforward implementation of
[24, Theorem 8.2.1], the fraction of time that the discrete-
time dynamics (6) spends in a small neighborhood of the
set of pure Nash equilibria goes to one as  ↓ 0 and k →∞. 
C. Discussion on combined NUMA and CPU placements
The main motivation for decomposing the decision making
process into NUMA-placement and CPU-pinning in Tables I–
II, respectively, lies on the principle of the two time-scale
dynamics. In particular, the NUMA placement algorithm of
Table I operates at a slow time-scale with a period of TNUMA,
while the CPU-pinning of Table II operates at a faster time-
scale with a period TCPU  TNUMA. The goal is to allow
the dynamics of CPU-pinning to first approach a Nash-
equilibrium assignment (given the convergence guarantees of
Proposition 5.3), before any thread considers migrating to a
different NUMA node. Such design principle also restricts
frequent NUMA-node migrations, since they may be rather
costly (taking into account possible implications to memory
access).
When we select TNUMA/TCPU to be sufficiently large, then
the CPU-core pinning dynamics have already settled in the
set of pure Nash equilibria (according to Proposition 5.3)
before revising the migration of threads to different nodes.
There are two possibilities that a thread decides to migrate.
Under the first condition (4a) of Table 2, thread i is unsatisfied
under the current assignment, and randomly selects among
alternative NUMA nodes where currently threads perform
better on average. By appropriately selecting sufficiently small
γ ∈ (0, 1) in the implementation of the better-reply condition
(3), a migration to a new NUMA node will only result
in an increased processing speed for a thread. This is also
guaranteed by the fact that only one thread is allowed to
migrate at a given time. Under the second condition (4b) of
Table 2, there always exists a small probability ζ > 0 that a
(neither satisfied nor unsatisfied) thread is selected to migrate
at random and given that there are alternative nodes that can
offer a better performance. Thus, under either condition, and
for sufficiently large TNUMA/TCPU, we should expect that
threads may only increase their performance by migrating.
VI. EXPERIMENTS
In this section, we present an experimental study
of the proposed framework. Experiments were conducted
on 20×Intel c©Xeon c©CPU E5-2650 v3 2.30 GHz
running Linux Kernel 64bit 3.13.0-43-generic. The cores are
divided into two NUMA nodes (Node 1: 0-9 CPU cores, Node
2: 10-19 CPU cores).
In all experiments, the utility of each thread is defined as
the total instructions completed per second which incorporates
both the computational and memory-access instructions. This
is a multi-objective criterion and it is expected that the larger
the number of instructions completed, the larger the processing
speed of a thread. We compared the overall performance of
the application (in terms of processing speed of threads and
completion time of an application) with that of the Linux OS
scheduler. We considered a number of parallel applications
under different levels of resource availability (i.e., number of
CPU cores available for the applications) and background-load
settings (i.e., number of threads of other applications running
on the available cores at the same time).
A. Benchmark applications
In particular, we have considered the following benchmark
applications:
− Swaptions (SWA), that uses the Heath-Jarrow-Morton
(HJM) framework to price a portfolio of swaptions. The
HJM framework describes how interest rates evolve for
risk management and asset liability management [25].
The application employs Monte-Carlo simulation to
compute the prices. It is regular in terms of task sizes,
with a low degree of communication between different
threads. It was taken from the Parsec benchmark suite.
− Blackscholes (BLA), that calculates, using differential
equations, how the value of an option changes as the
TABLE III
COMPUTATIONAL/MEMORY INTENSITY OF CASE STUDIES (TOT INS =
TOTAL INSTRUCTIONS, LST INS = LOAD/STORE INSTRUCTIONS, TLB DM
= DATA TRANSLATIONS)
Index BLA SWA ACO CSO
TOT INS / LST INS O(10+7) O(10+6) O(10+5) O(10+2)
TLB DM / LST INS O(10−7) O(10−6) O(10−5) O(10−2)
price of the underlying asset changes; parallel implemen-
tation calculates values for a number of options at the
same time, assigning a thread to each option (or a group
of options). If the options are equally divided between
threads, this results in a regular (in terms of task sizes)
parallel application. In practice, similar calculations are
used by financial houses to price 10-100 thousands of
options. This is computationally intensive application
as depicted in Table III. It was taken from the Parsec
benchmark suite.
− Ant Colony Optimization (ACO) [26] is a metaheuristic
used for solving NP-hard combinatorial optimization
problems. In this paper, we apply ACO to the Single
Machine Total Weighted Tardiness Problem (SMTWTP).
Briefly, this is a scheduling problem of jobs that are
characterized by varying processing times, deadlines
and weights. The objective is to find the schedule that
minimizes the total tardiness. A detailed description of
this use case is provided in [6]. This is computationally
intensive application as depicted in Table III.
− Stochastic-Local-Search for Cutting-Stock Industrial
Optimization (CSO) that optimizes classical bin-packing
and cutting-stock optimization problems using an evo-
lutionary stochastic-local-search (SLS) algorithm. The
use case and the type of parallelization (which is based
on the Fast-Flow parallelization library [27]) has been
described in detail in [28]. In particular, we used the
Scholl 1–3 datasets for classical bin packing problems
provided in [29]. According to the implemented SLS
algorithm, an initial number of candidate solutions (pool)
of a bin-packing/cutting-stock problem, are processed
continuously through a series of heuristic based oper-
ations/modifications (optimization cycle). In each such
cycle, multiple threads are assigned a portion of the
candidate solutions. Since the application usually runs
for a fixed time, the total number of candidate solutions
processed in all optimization cycles completed consti-
tutes an indication of the average processing speed.
This is a memory intensive application as depicted in
Table III, while the computation bandwidth requested
varies significantly with time.
B. Experimental setup
The period of the CPU pinning is fixed to TCPU = 0.05
sec, which is also the interval in which the RM collects
measurements of the total instructions completed per sec
(using the PAPI library [30]) for each one of the threads
TABLE IV
ALGORITHM SETTINGS
Parameter Value
 0.01/108
λ 0.02
TCPU 0.05 sec
ν 0.01
ζ 0.02
γ 0.9
η 0.8
TNUMA 2 sec
separately. In other words, the utility ui of thread i corresponds
to the total instructions completed per sec for thread i.
Pinning of threads to CPU cores is achieved through the
sched.h library. In all experiments, the RM is executed by
the master thread of an application, which is always running
in a fixed CPU core (usually the first available CPU core of
the first NUMA node).
In Table V, we provide an overview of the conducted
experiments. We classify the experiments with respect to the
resource availability and the CPU availability. We classify the
resource availability as small (around 4 application threads
per CPU core), medium (2 threads per CPU core) and high
(1 thread per CPU core). We classify the CPU availability
as uniform, when no background applications are running
and therefore all CPU cores are fully available to the tested
application, non-uniform where 8 threads of a background
application are running on the first 4 CPU cores of the
machine for the whole duration of the running of the tested
application and time-varying, where initially the availability
varies continuously with time in the first 4 CPU cores of the
machine.
Our goal is to investigate the performance of the scheduler
under different set of available resources, and how the dynamic
scheduler adapts to background load.
TABLE V
CLASSIFICATION OF THE EXPERIMENTS.
Exp. Resource availability CPU availability
A.1 Small Uniform
A.2 Small Non-uniform
A.3 Small Time-varying
B.1 Medium Uniform
B.2 Medium Non-uniform
B.3 Medium Time-varying
C.1 Large Uniform
C.2 Large Non-uniform
C.3 Large Time-varying
C. Experimental Results
Tables VI–IX show the execution times of the four chosen
applications under OS and PaRLSched scheduler and under
the experimental scenarios of Table V. Below, we analyze each
application separately.
TABLE VI
COMPLETION TIMES OF OS AND PaRLSched SCHEDULING FOR
SWAPTIONS APPLICATION. WE SHOW THE MEAN EXECUTION TIME OF
THE APPLICATION, THE DEVIATION AND IMPROVEMENT IN EXECUTION
TIME OF PaRLSched OVER OS SCHEDULING
Exp/
Resources
OS PaRLSched Diff. (%)Mean Dev Mean Dev
SWA (A.1) 225.58 1.28 225.27 1.41 +0.13
SWA (A.2) 385.75 17.00 344.53 3.38 +10.69
SWA (A.3) 337.46 14.62 311.17 2.98 +7.79
SWA (B.1) 163.40 0.56 158.10 2.20 +3.25
SWA (B.2) 289.31 5.93 285.68 5.28 +1.26
SWA (B.3) 240.81 5.22 238.05 4.89 +1.15
SWA (C.1) 122.54 0.79 129.85 3.25 −5.96
SWA (C.2) 206.68 1.94 202.85 2.83 +1.85
SWA (C.3) 164.11 1.49 161.54 3.19 +1.57
a) SWA: We observe that the PaRLSched scheduler
exhibits better behavior than the OS under small and medium
availability of resources (i.e., categories A and B) with or with-
out background interference. The improvement varies between
0.13% and 10.69%. In case of large availability of resources
(i.e., category C), the OS outperforms the PaRLSched but
only in the case where there is no background interference.
Note also that the percentages of the deviations are signifi-
cantly smaller than the corresponding performance differences
(except for the A.1 case), thus we may not attribute these
improvements to noise.
b) ACO: In this set of experiments, we see a similar
behavior to the SWA experiments. The PaRLSched outper-
forms the OS in the case of small and medium availability
of resources and in the presence of background interference
(i.e., categories A.2–A.3 and B.2–B.3). The improvement
may reach up to 16.92%. In the absence of any background
interference, the behavior under small availability of resources
(i.e., category A.1) is about equivalent, while in the remaining
categories the OS outperforms the PaRLSched scheduler.
As a side note, we should mention that even under scenarios
where the OS outperformed PaRLSched, such as scenario C.3,
the average speed over all threads is not necessarily smaller,
as Figure 3 demonstrates. In other words, the PaRLSched
does indeed achieve a good level of the average processing
speed, which agrees with its design criterion, but apparently
completion time is not only a matter of average speed. For
example, a large average speed over all threads does not
necessarily guarantee that all threads are running with identical
speeds. Instead, there might be significant differences in the
speeds between threads, which may have an impact on the
overall completion time.
c) BLA: The performance under the Blackscholes ap-
plication is not deviating significantly in comparison with
the conclusions of ACO and SWA applications. In fact, we
observe a constantly better performance of the PaRLSched in
conditions of small resource availability which may reach up
to 4.05% improvement. On the other hand, the performance
under large resource availability has been up to -8.89% worse
than the OS performance.
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(A
.3
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(B
.3
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(C
.3
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
Fig. 3. Sample responses for Experiments of category 3 (i.e., under time-
varying CPU availability. The running average speed is measured in (108· #
instructions/sec/thread).
d) CSO: The CSO application is a bit different than the
ones previously considered. It is characterized by scattered
memory pages as Table III reflects. In general, we observe
significant advantage of the PaRLSched scheduler under cat-
egories A and C of resource availability, and a reduced
performance in the case of category B (medium availability).
The rather inconclusive behavior should be attributed to the
irregular memory accesses of the application and the long
idle times of the threads. This large variation in the requested
bandwidth is also demonstrated in Figures 4, 5, and 6 which
show the response of the PaRLSched scheduler under all
scenarios.
D. Discussion
In general, we observed that the PaRLSched scheduler was
able to achieve better performance that the OS scheduler in
limited cases of limited availability of resources (Category A)
and external disturbances. Under such scenarios, we expect
the performance of individual threads to vary due to external
influences and, therefore, it is important to make the correct
remapping decisions. Also, under such scenarios, it is not
possible to predict this variation in the performance solely
based on the characteristics of the application itself. Finally,
in the memory-intensive application (CSO), the scheduler
was able to better adapt to the irregularity in the memory-
TABLE VII
COMPLETION TIMES (CT) AND AVERAGE PROCESSING SPEED (AVG. SPD) OF OS AND PaRLSched SCHEDULING FOR ACO APPLICATION. WE SHOW THE
MEAN EXECUTION TIME OF THE APPLICATION, MEAD DEVIATION (IN SECONDS) AND AVERAGE PROCESSING SPEED PER THREAD (IN 108 INSTRUCTIONS
PER SECOND).
Exp/
Time(s)
OS PaRLSched Diff. CT
(%)
Diff. Avg. Spd
(%)Mean CT Dev CT Avg. Spd. Mean CT Dev CT Avg. Spd
ACO (A.1) 1065.05 7.68 13.37 1075.48 6.45 14.87 −0.9 +11.21
ACO (A.2) 1752.46 14.00 8.54 1455.92 22.8 9.82 +16.92 +14.98
ACO (A.3) 1459.18 9.42 10.29 1402.00 4.06 10.41 +3.91 +1.17
ACO (B.1) 673.09 5.69 21.26 699.16 10.24 22.16 −3.87 +4.23
ACO (B.2) 1106.36 16.71 12.73 1041.33 16.71 14.94 +5.87 +17.36
ACO (B.3) 1066.18 0.88 13.20 1019.11 8.39 14.58 +4.41 +10.45
ACO (C.1) 455.87 5.08 31.90 496.26 5.08 33.46 −8.85 +4.89
ACO (C.2) 659.78 27.45 21.57 688.80 18.66 24.15 −4.39 −12.02
ACO (C.3) 659.35 3.62 21.82 676.03 7.72 23.72 −2.52 +8.70
Average +1.17 +6.77
TABLE VIII
COMPLETION TIMES OF OS AND PaRLSched SCHEDULING FOR
BLACKSCHOLES (BLA) APPLICATION
Exp/
Resources
OS PaRLSched Diff. (%)Mean Dev Mean Dev
BLA (A.1) 193.20 1.89 190.43 0.62 +1.09
BLA (A.2) 322.32 4.98 314.73 8.40 +2.36
BLA (A.3) 285.76 4.17 274.17 7.30 +4.05
BLA (B.1) 129.98 1.09 129.88 1.31 +0.08
BLA (B.2) 236.62 4.18 245.09 2.64 −3.58
BLA (B.3) 192.45 5.16 200.15 4.46 −4.00
BLA (C.1) 98.97 1.11 107.77 1.25 −8.89
BLA (C.2) 166.50 1.46 172.65 3.00 −3.69
BLA (C.3) 130.24 2.13 135.42 3.87 −3.98
access speeds between the two NUMA nodes also under large
availability of resources.
On the other hand, the OS outperformed the PaRLSched
scheduler in most cases of large availability of resources (e.g.,
category C.1). This should be attributed to the fact that the
Linux scheduler is utilizing internal load balancing of threads
between cores, which has notable effect on the execution
time when there is not significant background interference
(in terms of additional running applications). In this case,
performance of the individual threads depends exclusively on
the distribution of threads of the application to cores, so there
is no additional benefit in measuring external interference in
the PaRLSched scheduler. The PaRLSched scheduler applies
rigid pinning of threads to cores, which means that it cannot
utilize any internal load balancing by the Linux scheduler.
Given the rather diverse nature of the considered appli-
cations, the observed improvements constitute a promising
indication. Note that the intention and goal of this work is not
to replace the OS scheduler, but instead to act on a supervisory
level, and possibly under alternative multi-objective criteria.
The notion of the utility function that drives the thread
placement can be designed to accommodate any such multi-
objective criterion, since the only assumption considered is the
positivity constraint.
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(A
.1
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(B
.1
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(C
.1
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
Fig. 4. Sample responses for Experiments of category 3 (i.e., under time-
varying CPU availability. The running average speed is measured in (108· #
instructions/sec/thread).
VII. CONCLUSIONS AND FUTURE WORK
We proposed a measurement- (or performance-) based learn-
ing scheme for addressing the problem of efficient dynamic
pinning of parallelized applications into many-core systems
under a NUMA architecture. According to this scheme, a cen-
tralized objective is decomposed into thread-based objectives,
where each thread is assigned its own utility function. Allo-
cation decisions were organized into a hierarchical decision
structure: at the first level, decisions are taken with respect to
TABLE IX
CANDIDATE SOLUTIONS PROCESSED (CSP) AND AVERAGE PROCESSING SPEED (AVG. SPD) UNDER OS AND PaRLSched SCHEDULING FOR CSO
APPLICATION WITHIN 10MIN SIMULATION TIME. WE SHOW THE MEAN SOLUTIONS PROCESSES, THE DEVIATION, AND AVERAGE PROCESSING SPEED PER
THREAD (IN 108 INSTRUCTIONS PER SECOND).
Exp/
Resources
OS PaRLSched Diff. CSP
(%)
Diff. Avg. Spd
(%)Mean CSP Dev CSP Avg. Spd Mean CSP Dev CSP Avg. Spd
CSO (A.1) 968.80 20.57 11.61 965.60 24.10 9.49 −0.33 −18.26
CSO (A.2) 398.40 12.07 5.32 461.60 10.29 7.09 +15.86 +33.27
CSO (A.3) 572.80 9.39 6.93 577.00 0.00 7.10 +0.73 +2.45
CSO (B.1) 955.80 57.32 11.78 960.80 48.54 12.75 +0.52 +8.23
CSO (B.2) 812.40 129.60 10.88 644.40 33.44 8.24 −20.68 −24.26
CSO (B.3) 955.20 89.04 11.07 769.40 56.00 9.20 −19.45 −16.90
CSO (C.1) 925.80 35.58 10.74 983.20 41.75 12.62 +6.20 +17.50
CSO (C.2) 614.80 14.67 8.74 616.00 8.94 8.76 +0.20 +0.23
CSO (C.3) 746.50 37.47 8.86 876.60 72.05 9.05 +17.43 +2.14
Average +0.06 +0.48
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(A
.2
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(B
.2
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(C
.2
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
Fig. 5. Sample responses for Experiments of category 3 (i.e., under time-
varying CPU availability. The running average speed is measured in (108· #
instructions/sec/thread).
the assigned NUMA node, while at the second level, decisions
are taken with respect to the assigned CPU core (within the
selected NUMA node). The proposed framework is flexible
enough to accommodate any multi-objective criterion, while
it is appropriately designed to handle noisy observations.
We demonstrated the utility of the proposed framework in
the maximization of the running average processing speed
of the threads and we evaluated its performance in four
benchmark parallel applications. We have concluded that the
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(A
.3
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(B
.3
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
0 100 200 300 400 500 600 700
0
20
40
Time (sec)R
u
n
.
A
v
er
.
S
p
ee
d
(C
.3
)
(
1
0
8
·
#
in
s
t
r
u
c
t
io
n
s
/
s
e
c
/
t
h
r
e
a
d
)
OS
PaRLSched
Fig. 6. Sample responses for Experiments of category 3 (i.e., under time-
varying CPU availability. The running average speed is measured in (108· #
instructions/sec/thread).
PaRLSched scheduler can achieve better running speed in
certain cases, especially of small availability of resources or
large background load. These observations should be further
reinforced with additional benchmark tests. In addition, we
plan to identify and generalize the indicators that trigger these
advantageous responses of the PaRLSched scheduler and also
to consider additional utility functions, such as register count
of each thread.
REFERENCES
[1] G. C. Chasparis, M. Rossbory, V. Janjic, and K. Hammond, “Learning-
Based Dynamic Pinning of Parallelized Applications in Many-Core
Systems,” in 2019 27th Euromicro International Conference on Parallel,
Distributed and Network-Based Processing (PDP). Pavia, Italy: IEEE,
Feb. 2019, pp. 1–8.
[2] M. Danelutto, “On skeletons and design patterns,” in Proc. of Intl. ParCo
2001, ser. Parallel Computing: Advances and Current Issues, G. Joubert,
A. Murli, F. Peters, and M. Vanneschi, Eds. Imperial College Press,
2001, pp. 425–432.
[3] M. Aldinucci, G. P. Pezzi, M. Drocco, C. Spampinato, and M. Torquati,
“Parallel visual data restoration on multi-gpgpus using stencil-reduce
pattern,” The International Journal of High Performance Computing
Applications, vol. 29, no. 4, pp. 461–472, 2015.
[4] D. del Rio Astorga, M. F. Dolz, J. Fernndez, and J. D. Garca, “A generic
parallel pattern interface for stream and data processing: A generic
parallel pattern interface for stream and data processing,” Concurrency
and Computation: Practice and Experience, vol. 29, no. 24, Dec. 2017.
[5] V. Janjic, C. Brown, K. Mackenzie, K. Hammond, M. Danelutto,
M. Aldinucci, and J. D. Garcia, “Rpl: A domain-specific language for de-
signing and implementing parallel c++ applications,” in 24th Euromicro
International Conference on Parallel, Distributed, and Network-Based
Processing (PDP), Feb 2016, pp. 288–295.
[6] G. C. Chasparis, M. Rossbory, and V. Janjic, Efficient Dynamic Pinning
of Parallelized Applications by Reinforcement Learning with Applica-
tions, ser. Lecture Notes in Computer Science, F. F. Rivera, T. F. Pena,
and J. C. Cabaleiro, Eds. Springer International Publishing, 2017, vol.
10417.
[7] G. C. Chasparis and M. Rossbory, “Efficient Dynamic Pinning of
Parallelized Applications by Distributed Reinforcement Learning,” Int.
J. Parallel Program., pp. 1–15, 2017.
[8] A. Podzimek, L. Bulej, L. Y. Chen, W. Binder, and P. Tuma, “Analyzing
the Impact of CPU Pinning and Partial CPU Loads on Performance
and Energy Efficiency,” in 15th IEEE/ACM International Symposium
on Cluster, Cloud and Grid Computing, May 2015, pp. 1–10.
[9] B. Goglin, “Managing the topology of heterogeneous cluster nodes
with hardware locality (hwloc),” in International Conference on High
Performance Computing and Simulation (HPCS), 2014, pp. 74–81.
[10] T. Klug, M. Ott, J. Weidendorfer, and C. Trinitis, “autopin - au-
tomated optimization of thread-to-core pinning on multicore systems,”
in Transactions on High-Performance Embedded Architectures and
Compilers III, ser. Lecture Notes in Computer Science, P. Stenstrom,
Ed. Springer Berlin Heidelberg, 2011, vol. 6590, pp. 219–235.
[11] F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst,
“ForestGOMP: An efficient OpenMP environment for NUMA architec-
tures,” International Journal Parallel Programming, vol. 38, pp. 418–
439, 2010.
[12] S. Olivier, A. Porterfield, and K. Wheeler, “Scheduling task parallelism
on multi-socket multicore systems,” in ROSS’11, Tuscon, Arizona, USA,
2011, pp. 49–56.
[13] M. Castro, L. F. W. Goes, C. P. Ribeiro, M. Cole, M. Cintra, and J.-
F. Mehaut, “A machine learning-based approach for thread mapping
on transactional memory applications,” in 2011 18th International
Conference on High Performance Computing, 2011, pp. 1–10.
[14] R. Subrata, A. Y. Zomaya, and B. Landfeldt, “A cooperative game
framework for QoS guided job allocation schemes in grids,” IEEE
Transactions on Computers, vol. 57, no. 10, pp. 1413–1422, Oct. 2008.
[15] H. Tembine, E. Altman, R. ElAzouri, and Y. Hayel, “Correlated evolu-
tionary stable strategies in random medium access control,” in Int. Conf.
Game Theory for Networks, 2009, pp. 212–221.
[16] G. Wei, A. V. Vasilakos, Y. Zheng, and N. Xiong, “A game-theoretic
method of fair resource allocation for cloud computing services,” The
Journal of Supercomputing, vol. 54, no. 2, pp. 252–269, Nov. 2010.
[17] G. C. Chasparis, A. Arapostathis, and J. S. Shamma, “Aspiration learning
in coordination games,” SIAM J. Control and Optim., vol. 51, no. 1,
2013.
[18] G. C. Chasparis, “Stochastic Stability of Perturbed Learning Automata
in Positive-Utility Games,” IEEE Transactions on Automatic Control,
vol. 64, no. 11, pp. 4454–4469, Nov. 2019.
[19] A. Fabrikant, A. D. Jaggard, and M. Schapira, “On the Structure of
Weakly Acyclic Games,” Theory Comput Syst, vol. 53, no. 1, pp. 107–
122, Apr. 2013.
[20] B. Vcking, “Selfish Load Balancing,” in Algorithmic Game Theory,
N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani, Eds. Cam-
bridge: Cambridge University Press, 2007, pp. 517–542.
[21] G. C. Chasparis, “Measurement-based efficient resource allocation with
demand-side adjustments,” Automatica, vol. 106, pp. 274–283, Aug.
2019.
[22] O. Hernandez-Lerma and J. B. Lasserre, Markov Chains and Invariant
Probabilities. Birkhauser Verlag, 2003.
[23] G. Chasparis and J. Shamma, “Distributed dynamic reinforcement of
efficient outcomes in multiagent coordination and network formation,”
Dynamic Games and Applications, vol. 2, no. 1, pp. 18–50, 2012.
[24] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive
Algorithms and Applications, 2nd ed. Springer-Verlag New York, Inc.,
2003.
[25] D. Heath, R. Jarrow, and A. Morton, “Bond pricing and the term
structure of interest rates: A new methodology for contingent claims
valuation,” Econometrica, vol. 60, no. 1, pp. 77–105, Jan. 1992.
[26] M. Dorigo and T. Stu¨tzle, Ant Colony Optimization. Scituate, MA,
USA: Bradford Company, 2004.
[27] M. Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati,
“Pool Evolution: A Parallel Pattern for Evolutionary and Symbolic
Computing,” International Journal of Parallel Programming, vol. 44,
no. 3, pp. 531–551, June 2016.
[28] G. C. Chasparis, M. Rossbory, and V. Haunschmid, “An evolution-
ary stochastic-local-search framework for one-dimensional cutting-stock
problems,” arXiv, vol. 1707.08776, 2017.
[29] M. Delorme, M. Iori, and S. Martello. (2018) A bin packing problem
library. [Online]. Available: http://or.dei.unibo.it/library/bpplib
[30] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A portable
interface to hardware performance counters,” in Proceedings of the
Department of Defense HPCMP Users Group Conference, 1999, pp.
7–10.
