Inductive-bias-driven Reinforcement Learning For Efficient Schedules in
  Heterogeneous Clusters by Banerjee, Subho S et al.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in
Heterogeneous Clusters
Subho S. Banerjee 1 Saurabh Jha 1 Zbigniew T. Kalbarczyk 1 Ravishankar K. Iyer 1
Abstract
The problem of scheduling of workloads onto
heterogeneous processors (e.g., CPUs, GPUs, FP-
GAs) is of fundamental importance in modern
data centers. Current system schedulers rely on
application/system-specific heuristics that have to
be built on a case-by-case basis. Recent work has
demonstrated ML techniques for automating the
heuristic search by using black-box approaches
which require significant training data and time,
which make them challenging to use in prac-
tice. This paper presents Symphony, a scheduling
framework that addresses the challenge in two
ways: (i) a domain-driven Bayesian reinforce-
ment learning (RL) model for scheduling, which
inherently models the resource dependencies iden-
tified from the system architecture; and (ii) a sam-
pling-based technique to compute the gradients of
a Bayesian model without performing full prob-
abilistic inference. Together, these techniques
reduce both the amount of training data and the
time required to produce scheduling policies that
significantly outperform black-box approaches by
up to 2.2×.
1. Introduction
The problem of scheduling of workloads on heterogeneous
processing fabrics (i.e., accelerated datacenters including
GPUs, FPGAs, and ASICs, e.g., Asanovic´ (2014); Shao &
Brooks (2015)), is at its core an intractable NP-hard prob-
lem (Mastrolilli & Svensson, 2008; 2009). System sched-
ulers generally rely on application- and system-specific
heuristics with extensive domain-expert-driven tuning of
scheduling policies (e.g., Isard et al. (2009); Giceva et al.
(2014); Lyerly et al. (2018); Mars et al. (2011); Mars & Tang
(2013); Ousterhout et al. (2013); Xu et al. (2018); Yang et al.
(2013); Zhang et al. (2014); Zhuravlev et al. (2010); Za-
1University of Illinois at Urbana-Champaign, USA. Correspon-
dence to: Subho S. Banerjee <ssbaner2@illinois.edu>.
Proceedings of the 37 th International Conference on Machine
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by
the author(s).
haria et al. (2010)). Such heuristics are difficult to generate,
as variations across applications and system configurations
mean that significant amounts of time and money must be
spent in painstaking heuristic searches. Recent work has
demonstrated machine learning (ML) techniques (Delim-
itrou & Kozyrakis, 2013; 2014; Mao et al., 2016; 2018)
for automating heuristic searches by using black-box ap-
proaches which require significant training data and time,
making them challenging to use in practice.
This paper presents Symphony, a scheduling framework
that addresses the challenge in two ways: (i) we use a
domain-guided Bayesian-model-based partially observable
Markov decision process (POMDP) (Astrom, 1965; Kael-
bling et al., 1998) to decrease the amount of training data
(i.e., sampled trajectories); and (ii) a sampling-based tech-
nique that allows one to compute the gradients of a Bayesian
model without performing full probabilistic inference. We
thus, significantly reduce the costs of (i) running a large het-
erogeneous computing system that uses an efficient schedul-
ing policy; and (ii) training the policy itself.
Reducing Training Data. State-of-the-art methods for
choosing an optimal action in POMDPs rely on training of
neural networks (NNs) (Mnih et al., 2016; Dhariwal et al.,
2017). As these approaches are model-free, training of the
NN requires large quantities of data and time to compute
meaningful policies. In contrast, we provide an inductive
bias for the reinforcement learning (RL) agent by encod-
ing domain knowledge as a Bayesian model that can infer
the latent state from observations, while at the same time
leveraging the scalability of deep learning methods through
end-to-end gradient descent. In the case of scheduling, our
inductive bias is a set of statistical relationships between
measurements from microarchitectural monitors (Dreyer &
Alpert, 1997). To the best of our knowledge, this is the first
paper to exploit those relationships and measurements to
infer resource utilization in the system (i.e., latent state) to
build RL-based scheduling polices.
Reducing Training Time. The addition of the inductive
bias, while making the training process less data-hungry
(i.e., requiring fewer workload executions to train the
model), comes at the cost of additional training time: the
cost of performing full-Bayesian inference at every training
step (Dagum & Luby, 1993; Russell et al., 1995; Binder
ar
X
iv
:1
90
9.
02
11
9v
2 
 [c
s.D
C]
  3
0 J
un
 20
20
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
 0
 2
 4
 6
 8
 10
 12
28 210 212 214 216 218 220 222
B
an
d
w
id
th
 (G
B
/s
)
Message Size (Bytes)
Isolated
Contention
1.8x
R
S S
G
P
U
G
P
U
G
P
U
N
IC
Isolated
Flow
Contention
flow
PCIe Root Complex
PCIe Switch
Figure 1. Performance degradation due to PCIe contention be-
tween GPU and NIC (averaged over 10 runs).
et al., 1997). It is this cost that makes the use of deep
RL techniques in dynamic real-world deployments (which
require periodic retraining) prohibitively expensive. To ad-
dress that issue, we have developed a procedure for comput-
ing the gradient of variables in the above Bayesian model
without requiring full inference computation, unlike prior
work (Russell et al., 1995; Binder et al., 1997). The key
is to calculate the gradient by generating samples from the
model, which is computationally simpler than inferring the
posterior distribution.
Need for New Scheduler. Current schedulers prioritize the
use of simple generalized heuristics and coarse-grained re-
source bucketing (e.g., core counts, free memory) to make
scheduling decisions. Hence, even though they are per-
ceived to perform well in practice, they do not model com-
plex emergent heterogeneous compute platforms and hence
leave a lot to be desired. Consider the case of a distributed
data processing framework that uses two GPUs to perform a
halo exchange.1 Fig. 1 shows the performance (here, band-
width) of the exchange as “isolated” performance. If the
application were to concurrently perform distributed net-
work communication, we would observe that the original
GPU-to-GPU communication is affected because of PCIe
bandwidth contention at shared links (i.e., a “hidden” re-
source that is not often exposed to the user). Such behavior
is shown as “contention” in Fig. 1, and can cause as much
as a 0− 1.8× slowdown, depending on the size of the trans-
mitted messages. Traditional approaches would either have
such a heuristic manually searched and incorporated into a
scheduling policy, or would expect it to be found automati-
cally as part of the training of a black-box ML model, and
both approaches can require significant effort in profiling/-
training. In contrast, our approach allows the utilization of
architectural resources (in this case, of the PCIe network)
as an inductive bias for the RL-agent, thereby allowing the
training process to automatically hone in on such resources
of interest, without having to identify the resource’s impor-
tance manually.
Results. The Symphony framework reduces the average
1A halo exchange occurs due to communication arsing between
parallel processors computing an overlapping pieces of data, called
halo regions, that need to be periodically updated.
job completion time over hand-tuned scheduling heuristics
by as much as 32%, and to within 6% of the time taken
by an oracle scheduler. It also achieves a training time
improvement of 4× compared to full Bayesian inference
based on belief propagation. Further, the technique out-
performs black-box ML techniques by 2.2× in terms of
training time. We believe that Symphony is also representa-
tive of RL applied to several other control-related problems
(e.g., industrial scheduling, data center network schedul-
ing) where data-driven approaches can be augmented with
domain knowledge to build sample-efficient RL-agents.
2. Background
Partially Observable Markov Decision Processes. A
POMDP is a stochastic model that describe relationships
between an agent and its environment. It is a tuple
(S,A, T ,Ω, O,R, γ), where S is the state space, A is
the action space, and Ω is the observation space. We
use st ∈ S to denote the hidden state at time t. When
an action at ∈ A is executed, the state changes accord-
ing to the transition distribution, st+1 ∼ T (st+1|st, at).
Subsequently, the agent receives a noisy or partially oc-
cluded observation ot+1 ∈ Ω according to the distribution
ot+1 ∼ O(ot+1|st+1, at), and a reward rt+1 ∈ R according
to the distribution rt+1 ∼ R(rt+1|st+1, at).
An agent acts according to its policy pi(at|st), which re-
turns the probability of taking action at at time t. The
agent’s goal is to learn a policy pi that maximizes the ex-
pected future reward J = Eτ∼p(τ)[
∑T
t=1 γ
t−1rt] over tra-
jectories τ = (s0, a0, . . . , aT−1, sT ) induced by its pol-
icy, where γ ∈ [0, 1) is the discount factor. In gen-
eral, a POMDP agent must infer the belief state bt =
Pr(st|o1, . . . , ot, a0, . . . , at−1), which is used to calculate
pi(at|sˆt) where sˆt ∼ bt. In the remainder of the paper, we
will use pi(at|sˆt) and pi(at|bt) interchangeably.
Related Work. Finding solutions for many POMDPs in-
volves (i) estimating the transition model T and observation
model O, (ii) performing inference under this model, and
(iii) choosing an action based on the inferred belief state.
Prior work in this area has extensively explored the use
of NNs, particularly recurrent NNs (RNNs), as universal
function approximators for (i) and (iii) above because they
can be easily trained and have efficient inference procedures
(e.g., Hausknecht & Stone (2015); Narasimhan et al. (2015);
Mnih et al. (2015); Jaderberg et al. (2016); Foerster et al.
(2016); Karkus et al. (2017); Zhu et al. (2018)). Neural
networks have proven to be extremely effective at learning,
but usually require a lot of data (for RL-agents, sampled tra-
jectories, which may be prohibitively expensive to acquire
for certain classes of applications, such as scheduling). The
ability to incorporate explicit domain knowledge (which in
the case of scheduling, is based on system design invariants)
could significantly reduce the amount of data required. To
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
that end, other work (Karkus et al., 2017; Silver et al., 2017;
Igl et al., 2018) has advocated the integration of probabilis-
tic models (including Bayesian filter models) for (i) above.
The significant computational cost of learning and inference
in such deep probabilistic models has spurred the use of ap-
proximation techniques for training and inference, including
NN-based approximations of Bayesian inference (Karkus
et al., 2017; Zhu et al., 2018) and variational inference meth-
ods (Igl et al., 2018).
In this paper, we too advocate the use of a domain-driven
probabilistic model for bt that can be trained through end-
to-end back-propagation to compute a policy. Specifically,
the technique handles the gradient descent procedure on a
Bayesian network (BN) with known structure and incom-
plete observations without performing inference on the BN,
only requiring generation of samples from the model. That
approach is different from to prior work on learning BNs
using gradient descent (Russell et al., 1995; Binder et al.,
1997) or expectation maximization, both of which require
full posterior inference at every training step.
Actor-Critic Methods. Actor-Critic methods (Konda &
Tsitsiklis, 2000) have previously been proposed for learn-
ing the parameters ρ of an agent’s policy piρ(at|st). Here
(i) the “Critic” estimates the value function V (s), and
(ii) the “Actor” updates the policy pi(a|s) in the direction
suggested by the Critic. In this paper, we use n-step learn-
ing with the asynchronous advantage actor-critic (A3C)
method (Mnih et al., 2016). For n-step learning, start-
ing at time t, the current policy performs ns consecutive
steps in ne parallel environments. The gradient updates
of pi and V are based on that mini-batch of size nens.
The target for the value function Vη(st+i), i ∈ [0, ns),
parameterized by η, is the discounted sum of on-policy
rewards up until t + ns and the off-policy bootstrapped
value V ∗η (st+ns). If we use an advantage function A
t,i
η =
(
∑ns−i−1
j=0 γ
jrt+i+j) + γ
ns−iV ∗η (st+ns) − Vη(st+1), the
value function is
LAt (ρ) = −
1
nens
ne−1∑
e=0
ns−1∑
i=0
Est+i∼bt+i [log piρ(at+i|st+i)
At,iη (st+i, at+i)] (1a)
LVt (η) =
1
nens
ne−1∑
e=0
ns−1∑
i=0
Est+i∼bt+i
[
At,iη (st+i, at+i)
2
]
.
(1b)
3. Training the POMDP RL-Agent with
Back-Propagation
We consider a special case of the POMDP formulation
presented above (illustrated in Fig. 2). We assume that
the domain knowledge about the environment of the
RL-agent is presented as a joint probability distribution
Pr(st, at−1, ot; ΘBN ) that can be factorized as a BN (with
Inference 
Procedure
Critic
Actor
BN Model
V (bˆt)
<latexit sha1_base64="CKgbE ygiY5kI0bdutUnTJqmk42w=">AAAB8nicbVBNS8NAEJ34WetX1aO XxSrUS0mqoMeCF48V7AekoWy223bpZhN2J0IJ/RlePCji1V/jzX/ jts1BWx8MPN6bYWZemEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnG qGW+yWMa6E1LDpVC8iQIl7ySa0yiUvB2O72Z++4lrI2L1iJOEBxE dKjEQjKKV/FalO6JIwh5e9kplt+rOQVaJl5My5Gj0Sl/dfszSiCt kkhrje26CQUY1Cib5tNhNDU8oG9Mh9y1VNOImyOYnT8mFVfpkEGtb Cslc/T2R0ciYSRTazojiyCx7M/E/z09xcBtkQiUpcsUWiwapJBiT 2f+kLzRnKCeWUKaFvZWwEdWUoU2paEPwll9eJa1a1buq1h6uy/Xz PI4CnMIZVMCDG6jDPTSgCQxieIZXeHPQeXHenY9F65qTz5zAHzif Pxn+kGM=</latexit>
⇡(at|bˆt)
<latexit sha1_base64="tj51o PyZB2uxVwy1p0Vr1G4pRZQ=">AAAB/HicbVDLSsNAFJ3UV62vaJd uBqtQNyWpgi4LblxWsA9oQphMJ+3QyYOZGyHE+ituXCji1g9x598 4bbPQ1gMXDufcy733+IngCizr2yitrW9sbpW3Kzu7e/sH5uFRV8W ppKxDYxHLvk8UEzxiHeAgWD+RjIS+YD1/cjPzew9MKh5H95AlzA3 JKOIBpwS05JlVJ+F14gF+xM6YAPY9OPfMmtWw5sCrxC5IDRVoe+a XM4xpGrIIqCBKDWwrATcnEjgVbFpxUsUSQidkxAaaRiRkys3nx0/x mVaGOIilrgjwXP09kZNQqSz0dWdIYKyWvZn4nzdIIbh2cx4lKbCI LhYFqcAQ41kSeMgloyAyTQiVXN+K6ZhIQkHnVdEh2Msvr5Jus2Ff NJp3l7XWaRFHGR2jE1RHNrpCLXSL2qiDKMrQM3pFb8aT8WK8Gx+L 1pJRzFTRHxifPyUyk7M=</latexit>at 1
<latexit sha1_base64="zJkKW 8BSQkS9Uys99at0qC6ECc8=">AAAB7nicbVBNS8NAEJ3Ur1q/qh6 9LFbBiyWpgh4LXjxWsB/QhrLZbtqlm03YnQgl9Ed48aCIV3+PN/+ N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJ UM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38389hPXRsTqEScJ9yM 6VCIUjKKV2rSf4aU37ZcrbtWdg6wSLycVyNHol796g5ilEVfIJDW m67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/d0rOrTIgYaxtKSRz 9fdERiNjJlFgOyOKI7PszcT/vG6K4a2fCZWkyBVbLApTSTAms9/J QGjOUE4soUwLeythI6opQ5tQyYbgLb+8Slq1qndVrT1cV+pneRxF OIFTuAAPbqAO99CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QPn b48w</latexit>
ot
<latexit sha1_base64="y9ZDD VNc7+JGZgF5Qaq1B+JdumY=">AAAB6nicbVBNS8NAEJ3Ur1q/qh6 9LFbBU0mqoMeCF48V7Qe0oWy2m3bpJht2J0IJ/QlePCji1V/kzX/ jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqGZV qxptMSaU7ATVcipg3UaDknURzGgWSt4Px7cxvP3FthIofcZJwP6L DWISCUbTSg+pjv1xxq+4cZJV4OalAjka//NUbKJZGPEYmqTFdz03 Qz6hGwSSflnqp4QllYzrkXUtjGnHjZ/NTp+TcKgMSKm0rRjJXf09k NDJmEgW2M6I4MsveTPzP66YY3viZiJMUecwWi8JUElRk9jcZCM0Z yokllGlhbyVsRDVlaNMp2RC85ZdXSatW9S6rtfurSv0sj6MIJ3AK F+DBNdThDhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzifP1vOjcA= </latexit>
bˆt
<latexit sha1_base64="bmwmM +LHOhk1tZjZGsLufgyB37g=">AAAB73icbVBNS8NAEJ3Ur1q/qh6 9LFbBU0mqoMeCF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/ jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTj VjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/Yg OlQgFo2ilTm9EkQR97JcrbtWdg6wSLycVyNHol796g5ilEVfIJDW m67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrTIgYaxtKSRz 9fdERiNjJlFgOyOKI7PszcT/vG6K4Y2fCZWkyBVbLApTSTAms+fJ QGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndZrd1fVepneRxF OIFTuAAPrqEOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QOm No+e</latexit>
f⇡(bˆt;⇥⇡)
<latexit sha1_base64="fgi yA7nsfUDxiW5IqYaIi/1up4g=">AAACBHicdVDJSgNBEO2Je9y iHr00RiFehukkJgEvghePEbJBJgw9nZ6kSc9Cd40Qggcv/ooXD 4p49SO8+Td2FkFFHxQ83quiqp6fSKHBcT6szNLyyura+kZ2c2t 7Zze3t9/ScaoYb7JYxqrjU82liHgTBEjeSRSnoS952x9dTv32D VdaxFEDxgnvhXQQiUAwCkbycoeB5yai4A4pYN+Dc+w2hhzoVDz 1cnnHdkiZVGrYsckZcYpVQ0q1aqVMMLGdGfJogbqXe3f7MUtD HgGTVOsucRLoTagCwSS/zbqp5gllIzrgXUMjGnLdm8yeuMUnRu njIFamIsAz9fvEhIZaj0PfdIYUhvq3NxX/8ropBLXeRERJCjxi 80VBKjHEeJoI7gvFGcixIZQpYW7FbEgVZWByy5oQvj7F/5NW0S Ylu3hdzl8cL+JYR4foCBUQQVV0ga5QHTURQ3foAT2hZ+veerRe rNd5a8ZazBygH7DePgGbNpda</latexit>
fV (bˆt;⇥V )
<latexit sha1_base64="37Hfh miNq7vPq+bSc2lChEwWbR8=">AAACAHicdVDJSgNBEO1xjXGLevD gpTEK8TJMJzEJeAl48RghGyRh6On0JE16FrprhBBy8Ve8eFDEq5/ hzb+xswgq+qDg8V4VVfW8WAoNjvNhrayurW9sprbS2zu7e/uZg8O mjhLFeINFMlJtj2ouRcgbIEDydqw4DTzJW97oeua37rjSIgrrMI5 5L6CDUPiCUTCSmzn23WauO6SAPReucLc+5EDd5oWbyTq2Q4qkVMG OTS6Jky8bUqiUS0WCie3MkUVL1NzMe7cfsSTgITBJte4QJ4behCoQ TPJpuptoHlM2ogPeMTSkAde9yfyBKT43Sh/7kTIVAp6r3ycmNNB6 HHimM6Aw1L+9mfiX10nAr/QmIowT4CFbLPITiSHCszRwXyjOQI4N oUwJcytmQ6ooA5NZ2oTw9Sn+nzTzNinY+dtitnq2jCOFTtApyiGC yqiKblANNRBDU/SAntCzdW89Wi/W66J1xVrOHKEfsN4+ATvslXQ= </latexit>
Pr(st, at 1, ot;⇥BN )
<latexit sha1_base64="hJ2RmG w//ZRWtYwqWCaCX5CZ3yQ=">AAACE3icdVBNSyNBEO3xazW6btSjl8Yo uIs7TMdoAl5EL54kglEhCUNPp2Iaez7orhHDMP/Bi3/FiwdFvHrx5r+x EyO4sj4oeLxXRVW9IFHSoOe9OmPjE5NTP6ZnCrNzP+d/FRcWT0ycagEN EatYnwXcgJIRNFCigrNEAw8DBafBxf7AP70EbWQcHWM/gXbIzyPZlYKj lfzinxbCFWZ1na8bHzco9zP8y/INGvu4Q1vHPUAr7R3m9LdfLHmuxyps u0Y9l20xr1y1ZLNW3a4wylxviBIZoe4XX1qdWKQhRCgUN6bJvATbGdco hYK80EoNJFxc8HNoWhrxEEw7G/6U0zWrdGg31rYipEP180TGQ2P6YWA7 Q44989UbiP/zmil2a+1MRkmKEIn3Rd1UUYzpICDakRoEqr4lXGhpb6Wi xzUXaGMs2BA+PqXfk5Oyyzbd8lGltLs6imOaLJMVsk4YqZJdckDqpEEE uSa35J48ODfOnfPoPL23jjmjmSXyD5znNzN/nQA=</latexit>
Figure 2. The proposed RL architecture.
parameters ΘBN ). A BN is a probabilistic graphical model
that represents a set of variables and their conditional de-
pendencies via a directed acyclic graph (DAG). We use
probabilistic inference on the BN to calculate an estimate of
the belief state bˆt. bˆt is then used in an NN fpi(bˆt; Θpi) (with
parameters Θpi) to approximate the RL-agent’s policy, and
an NN fV (bˆt; ΘV ) (with parameters ΘV ) to approximate
the state-based value function. We refer to all the parame-
ters of the model as Θ = (ΘBN ,Θpi,ΘV ) = (ρ, η). The
model is then trained by propagating the gradient of the
total loss ∇ΘLRLt = ∇ΘLAt (ρ) +∇ΘLVt (η). Estimating
this gradient requires us to compute ∇ΘBN bˆt. Traditional
methods for computing the gradient require inference com-
putation (Russell et al., 1995; Binder et al., 1997). However,
even approximate inference in such models is known to be
NP-Hard (Dagum & Luby, 1993). Below we describe an
algorithm for approximating the gradient without requiring
computation of full Bayesian inference. All that is required
is the ability to generate samples from the BN. Only the
subset of the BN necessary for generation of the samples
is expanded. The samples are then used as a representa-
tion of the distribution of the BN. As a result, the proposed
method decouples the training of the BN from the inference
procedure used on it to calculate bˆt.
3.1. The Bayesian Network & Its Gradient
Let the BN described above be a DAG (V,E), and let
X = {Xv|v ∈ V } be a set of random variables indexed by
V . Associated with each nodeX is a conditional probability
density function Pr(X|℘(X)), where ℘(X) are the parents
of X in the graph. We assume that we are given (i) an effi-
cient algorithm for sampling values of X given ℘(X), and
(ii) a function fX(x, y; θX) = PrθX (X = x|℘(X) = y)
whose partial derivative with respect to θX is known and
efficiently computable. The BN can also have determinis-
tic relationships between two random variables, under the
assumption that the relationship is a differentiable diffeo-
morphism. That is, for random variables X , Y , and diffeo-
morphism F , Pr(Y = y) = Pr(X = F−1(y))|DF−1(y)|
where DF−1 is the inverse of the Jacobian of F .
Computing Gradient. For a random variable X in the
BN, we define its parents as ℘(X), its ancestor set as
Ξ(X) = {Y |Y  X ∧ Y 6∈ ℘(X)} (where  repre-
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
sents a directed path in the BN). We now define a procedure
to approximately compute the gradient of X with respect
to ΘBN . We do so in two parts: (i) ∂ Pr(X=x|ξ=a)/∂θX and
(ii)∇ΘBN\θX Pr(X = x|ξ = a) for ξ ⊆ Ξ(X). First,
∂ Pr(X = x|ξ = a)
∂θX
=
∂
∂θX
∫
Pr(℘(X) = y|ξ = a)×
Pr(X = x|℘(X) = y, ξ = a)dy
=
∂
∂θX
∫
Pr(℘(X) = y|ξ = a)fX(x,y; θX)dy
=
∫
Pr(℘(X) = y|ξ = a)∂fX(x,y; θX)
∂θX
dy
≈
S∑
i=1
nS(a,yi)
nS(a)
∂fX(x,yi; θX)
∂θX
. (2)
Here, S samples are drawn from a variable(s) Z such that
nS(j) is the number of times the value j appears in the set
of samples {zi}, i.e., nS(j) =
∑S
i=1 1{zi = j}. Next,
∇ΘBN\θX Pr(X = x|ξ = a)
= ∇ΘBN\θX
∫
Pr(℘(X) = y|ξ = a)×
Pr(X = x|℘(X) = y, ξ = a)dy
=
∫
fX(x,y; θX)∇ΘBN\θX Pr(℘(X) = y|ξ = a)dy
≈
S∑
i=1
nS(yi)
S
fX(x,yi; θX)∇ΘBN\θX Pr(℘(X) = yi|ξ = a)
(3)
When |℘(X)| > 1, variables in ℘(X) might not be condi-
tionally independent given Ξ(X). Hence we find a set of
nodes N such that I ⊥ J |Ξ(X) ∪N ∀I, J ∈ ℘(X). Then,
Pr(℘(X) = yi|ξ = a)
=
∫
Pr(N = n|ξ = a)Pr(℘(X) = y|N = n, ξ = a)dn
=
∫
Pr(N = n|ξ = a)
m∏
j=1
Pr(Pj = yj |N = n, ξ = a)dn
≈
S∑
k=1
nS(a,ni)
nS(a)
m∏
j=1
Pr(Pj = yj |N = nk, ξ = a), (4)
where ℘(X) = (P1, . . . , Pm) and yi = (yi,1, . . . , yi,m).
Thus, we obtain,
∇ΘBN\θX Pr(℘(X) = yi|ξ = a)
≈
S∑
k=1
nS(a,nk)
nS(a)
×
∇ΘBN\θX
m∏
j=1
Pr(Pj = yi,j |N = nk, ξ = a)
=
S∑
k=1
nS(a,nk)
nS(a)
×
m∑
l=1
 m∏
h=1,h6=l
Pr(Ph = yi,h|N = nk, ξ = a)
×
∇ΘBN\θX Pr(Pl = yi,l|N = nk, ξ = a)
≈
S∑
k=1
nS(a,nk)
nS(a)
m∑
l=1
 m∏
h=1,h 6=l
nS(yi,h, a,nk)
nS(a,nk)
×
Expand by recursion using Eqns. 2, 3, and 5︷ ︸︸ ︷
∇ΘBN\θX Pr(Pl = yi,l|N = nk, ξ = a) . (5)
The term ∇ΘBN\θX Pr(Pl = yi,l|N = nk, ξ = a) rep-
resents the gradient operator on a subset of the original
BN, containing only the ancestors (from the BN’s graphical
structure) of X . Hence that gradient term can be recursively
expanded using Eqns. 2, 3, and 5. Repeating that process
for all variables in bˆt allows us to calculate the∇ΘBN bˆt.
Computational Complexity. The cost of computing
Eqns. 2, and 3 is O(S). The cost of computing Eqn. 5
is O(mS). The cost of finding N is O(|℘(st)|2(|V |+ |E|))
(i.e., the cost of running the Bayes ball algorithm (Shachter,
2013) for every pair of nodes in ℘(X)). The total com-
putational complexity of the entire procedure hinges on
finding the number of times Eqns. 2, 4, and 5 are executed,
which we refer to as Q. Q depends on the size of N and
on the graphical structure of the BN. Hence, the total cost
of computing∇ΘBN bˆt is O(Q(|℘(st)|2(|V |+ |E|) +mS))
(where |℘(st)| ≤ |V | − 1), which is computed nsne|bt|
times during training. Note that for a polytree BN (the
graphical structure of the BN we will use in §4), N = ∅,
and Q ≤ |V |. This is still better than belief propagation
on the polytree with the gradient computation technique
from Russell et al. (1995); Binder et al. (1997), which is
O(|V |maxv∈V (dom(Xv))), where dom(X) is the size of
the domain of X , which could be exponentially large.
4. Scheduling Data Center Workloads By
Using Reinforcement Learning
We now demonstrate an application of the POMDP model
and training methodology presented in §3 to the problem
of scheduling tasks on a heterogeneous processing fabric
that includes CPUs, GPUs, and FPGAs. The model inte-
grates real-time performance measurements, prior knowl-
edge about workloads, and system architecture to (i) dy-
namically infer system state (i.e., resource utilization), and
(ii) automatically schedule tasks on a heterogeneous pro-
cessing fabric.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Bayesian 
Inference
BN Model
Graph 
Network
System 
Topology
Graph 
Network
LSTM
FC
FC
Critic
Actor
Computer Systems
Data Flow Graph
Action
PMU
Measurements
V (bˆt)
<latexit sha1_base64="CKgbEygiY 5kI0bdutUnTJqmk42w=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSrUS0mqoMe CF48V7AekoWy223bpZhN2J0IJ/RlePCji1V/jzX/jts1BWx8MPN6bYWZemEh h0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1LDpVC8iQIl7ySa0yi UvB2O72Z++4lrI2L1iJOEBxEdKjEQjKKV/FalO6JIwh5e9kplt+rOQVaJl5M y5Gj0Sl/dfszSiCtkkhrje26CQUY1Cib5tNhNDU8oG9Mh9y1VNOImyOYnT8m FVfpkEGtbCslc/T2R0ciYSRTazojiyCx7M/E/z09xcBtkQiUpcsUWiwapJBi T2f+kLzRnKCeWUKaFvZWwEdWUoU2paEPwll9eJa1a1buq1h6uy/XzPI4CnMI ZVMCDG6jDPTSgCQxieIZXeHPQeXHenY9F65qTz5zAHzifPxn+kGM=</latex it>
ot
<latexit  sha1_base64="y 9ZDDVNc7+JGZgF5 Qaq1B+JdumY=">A AAB6nicbVBNS8NA EJ3Ur1q/qh69LFb BU0mqoMeCF48V7Q e0oWy2m3bpJht2J 0IJ/QlePCji1V/k zX/jts1BWx8MPN6 bYWZekEhh0HW/nc La+sbmVnG7tLO7t 39QPjxqGZVqxptM SaU7ATVcipg3UaD knURzGgWSt4Px7c xvP3FthIofcZJwP 6LDWISCUbTSg+pj v1xxq+4cZJV4Oal Ajka//NUbKJZGPE YmqTFdz03Qz6hGw SSflnqp4QllYzrk XUtjGnHjZ/NTp+T cKgMSKm0rRjJXf09 kNDJmEgW2M6I4Ms veTPzP66YY3viZi JMUecwWi8JUElRk 9jcZCM0ZyokllGl hbyVsRDVlaNMp2R C85ZdXSatW9S6rt furSv0sj6MIJ3AK F+DBNdThDhrQBAZ DeIZXeHOk8+K8Ox +L1oKTzxzDHzifP 1vOjcA=</latexi t>
bˆt
<latexit sha1_ba se64="bmwmM+LHOhk1tZjZGsLufgyB 37g=">AAAB73icbVBNS8NAEJ3Ur1q/ qh69LFbBU0mqoMeCF48V7Ae0oWy2m3 bpZhN3J0IJ/RNePCji1b/jzX/jts1B Wx8MPN6bYWZekEhh0HW/ncLa+sbmVn G7tLO7t39QPjxqmTjVjDdZLGPdCajh UijeRIGSdxLNaRRI3g7GtzO//cS1Eb F6wEnC/YgOlQgFo2ilTm9EkQR97Jcr btWdg6wSLycVyNHol796g5ilEVfIJD Wm67kJ+hnVKJjk01IvNTyhbEyHvGup ohE3fja/d0rOrTIgYaxtKSRz9fdERi NjJlFgOyOKI7PszcT/vG6K4Y2fCZWk yBVbLApTSTAms+fJQGjOUE4soUwLey thI6opQxtRyYbgLb+8Slq1qndZrd1f VepneRxFOIFTuAAPrqEOd9CAJjCQ8A yv8OY8Oi/Ou/OxaC04+cwx/IHz+QOm No+e</latexit>
f⇡(bˆt;⇥⇡)
<latexit sha1_b ase64="fgiyA7nsfUDxiW5IqYaIi/ 1up4g=">AAACBHicdVDJSgNBEO2J e9yiHr00RiFehukkJgEvghePEbJBJ gw9nZ6kSc9Cd40Qggcv/ooXD4p49S O8+Td2FkFFHxQ83quiqp6fSKHBcT 6szNLyyura+kZ2c2t7Zze3t9/Scao Yb7JYxqrjU82liHgTBEjeSRSnoS95 2x9dTv32DVdaxFEDxgnvhXQQiUAwC kbycoeB5yai4A4pYN+Dc+w2hhzoV Dz1cnnHdkiZVGrYsckZcYpVQ0q1aq VMMLGdGfJogbqXe3f7MUtDHgGTVOs ucRLoTagCwSS/zbqp5gllIzrgXUM jGnLdm8yeuMUnRunjIFamIsAz9fvE hIZaj0PfdIYUhvq3NxX/8ropBLXeR ERJCjxi80VBKjHEeJoI7gvFGcixI ZQpYW7FbEgVZWByy5oQvj7F/5NW0S Ylu3hdzl8cL+JYR4foCBUQQVV0ga5 QHTURQ3foAT2hZ+veerRerNd5a8Za zBygH7DePgGbNpda</latexit>
fV (bˆt;⇥V )
<latexit sha1_ba se64="37HfhmiNq7vPq+bSc2lChEwW bR8=">AAACAHicdVDJSgNBEO1xjXGL evDgpTEK8TJMJzEJeAl48RghGyRh6O n0JE16FrprhBBy8Ve8eFDEq5/hzb+x swgq+qDg8V4VVfW8WAoNjvNhrayurW 9sprbS2zu7e/uZg8OmjhLFeINFMlJt j2ouRcgbIEDydqw4DTzJW97oeua37r jSIgrrMI55L6CDUPiCUTCSmzn23Wau O6SAPReucLc+5EDd5oWbyTq2Q4qkVM GOTS6Jky8bUqiUS0WCie3MkUVL1NzM e7cfsSTgITBJte4QJ4behCoQTPJpup toHlM2ogPeMTSkAde9yfyBKT43Sh/7 kTIVAp6r3ycmNNB6HHimM6Aw1L+9mf iX10nAr/QmIowT4CFbLPITiSHCszRw XyjOQI4NoUwJcytmQ6ooA5NZ2oTw9S n+nzTzNinY+dtitnq2jCOFTtApyiGC yqiKblANNRBDU/SAntCzdW89Wi/W66 J1xVrOHKEfsN4+ATvslXQ=</latexi t>
Pr(st, at 1, ot;⇥BN )
<late xit sha1 _base64= "hJ2RmGw //ZRWtYw qWCaCX5C Z3yQ=">A AACE3icd VBNSyNBE O3xazW6b tSjl8You Is7TMdoA l5EL54kg lEhCUNPp 2Iaez7or hHDMP/Bi 3/FiwdFv Hrx5r+xE yO4sj4oeL xXRVW9IF HSoOe9Om PjE5NTP6 ZnCrNzP+ d/FRcWT0 ycagENEa tYnwXcgJ IRNFCigr NEAw8DBa fBxf7AP7 0EbWQcHW M/gXbIzy PZlYKjlf zinxbCFW Z1na8bHz co9zP8y/ INGvu4Q1 vHPUAr7R 3m9LdfLHm uxypsu0Y 9l20xr1y 1ZLNW3a4 wylxviBI Zoe4XX1q dWKQhRCg UN6bJvAT bGdcohYK 80EoNJFx c8HNoWhr xEEw7G/6 U0zWrdGg 31rYipEP 180TGQ2P 6YWA7Q44 989UbiP/ zmil2a+1 MRkmKEIn3 Rd1UUYzp ICDakRoE qr4lXGhp b6WixzUX aGMs2BA+ PqXfk5Oy yzbd8lGl tLs6imOa LJMVsk4Y qZJdckDq pEEEuSa3 5J48ODfO nfPoPL23 jjmjmSXy D5znNzN/ nQA=</la texit>
aˆt ⇠ ⇡(at|bˆt)
<latexit sha1 _base64="tXY7Yvnb32HYxwcb abyUjcjZvoA=">AAACCnicbVC 7TsMwFHV4lvIKMLIYClJZqqQg wViJhbFI9CE1UeS4TmvVeci+ QapKZxZ+hYUBhFj5Ajb+BifNA C1HutLxOffK9x4/EVyBZX0bS8 srq2vrpY3y5tb2zq65t99WcSo pa9FYxLLrE8UEj1gLOAjWTSQj oS9Yxx9dZ37nnknF4+gOxglzQ zKIeMApAS155pEzJICJB9hRPM ROwqvZ4wHnuu/BmWdWrJqVAy8 SuyAVVKDpmV9OP6ZpyCKggijV s60E3AmRwKlg07KTKpYQOiID 1tM0IiFT7iQ/ZYpPtdLHQSx1R YBz9ffEhIRKjUNfd4YEhmrey8 T/vF4KwZU74VGSAovo7KMgFRh inOWC+1wyCmKsCaGS610xHRJJ KOj0yjoEe/7kRdKu1+zzWv32o tI4KeIooUN0jKrIRpeogW5QE7 UQRY/oGb2iN+PJeDHejY9Z65J RzBygPzA+fwD4z5kR</latex it>
Neural Network
Resource
Utilizations
Figure 3. Architecture of the Symphony ML model.
Workload & Programming Model. The system workload
consists of multiple user programs, and each program is
expressed as a data flow graph (DFG). A DFG is a DAG
where the nodes represent computations (which we refer
to as kernels, e.g., matrix multiplication), and edges rep-
resent input-output relationships between the nodes. Prior
work has shown that a large number of applications can
be expressed as compositions of such kernels (Asanovic´
et al., 2009; Banerjee et al., 2016). Prominent examples of
such compositions include modern data analytics and ML
frameworks that describe workloads as DFGs (Abadi et al.,
2016; Chambers et al., 2010; McCool et al., 2012; Zaharia
et al., 2012). We assume that the kernels are known ahead
of time and have multiple implementations available for
different processors and accelerators. That assumption is
correct for many ML workloads; for other workloads, it is
an area of active research wherein accelerator designers and
architects are trying to decompose larger applications into
smaller pieces. Once trained, our approach can schedule any
composition (DFG) of the kernels, but requires retraining
when the set of available kernels change.
POMDP Architecture. The overall architecture of the
Symphony POMDP model is illustrated in Fig. 3. The first
part of the POMDP models the latent state bˆt of the com-
puter system. For the scheduling problem, bˆt corresponds to
resource utilization of various components of the computer
system. Utilization of some of the resources can be mea-
sured directly in software (e.g., the amount of free memory);
however, the different layers of abstraction of the computer
stack hide some others from direct measurement. For exam-
ple, consider the example in Fig. 1 in §1; here, PCIe link
bandwidth cannot be directly measured. However, it can
be measured indirectly by using the number of outstanding
requests to memory from each PCIe device and by using the
topology of the PCIe network. In essence, we statistically
relate the back pressure of one resource on another, until we
can find a resource that can be directly measured via real-
time performance counter (PC) measurements (ot) (Dreyer
& Alpert, 1997). We refer to such resources whose uti-
lization cannot be directly measured as hidden resources.
PCs are special-purpose registers present in the CPU and
other accelerators for characterization of an application’s be-
havior and identification of microarchitectural performance
bottlenecks. Specifically, we use a BN to (i) model aleatoric
uncertainty in measurements, and (ii) encode our knowl-
edge about system architecture in terms of invariants or
statistical relationships between the measurements. Infer-
ence on that BN then gives us an accurate estimate of the
latent state of the system. Second, we use an RNN (i.e.,
fpi(·) and fV (·)) to learn scheduling policies for user pro-
grams that minimize resource contention and maximize
performance. Those two ML models effectively decouple
system-architecture-specific and measurement-specific as-
pects of scheduling (the BN) from its optimization aspects
(the NN). The compelling value of the above architecture
(and its two constituent models) is that it can automatically
generate scheduling policies for the deployment of DFGs in
truly heterogeneous environments (that have CPUs, GPUs,
and FPGAs) without requiring configuration specifics, or
painstakingly tuned heuristics. The model improves over-
all performance and resource utilization, and enables fine-
grained resource sharing across workloads.
Performance Counters. PCs are generally relied upon to
conduct low-level performance analysis or tuning of per-
formance bottlenecks in applications. As the source of
such bottlenecks is generally the unavailability of system
resources, the performance counter can naturally be used to
estimate resource utilization of a system. Another benefit
of using PCs is that it is not necessary to modify an appli-
cation’s source code in order to make measurements. PCs
can be grouped into three categories: (i) those pertaining to
the processing fabric (CPU core or accelerators); (ii) those
pertaining to the memory subsystem; and (iii) those pertain-
ing to the system interconnect (in our case, PCIe). Fig. 4
illustrates the organization of a computer system as well as
the categories above. Fig. 5 shows a mapping between the
system organization and the PCs that are used in the BN
model (described below).2
BN Model. Measurements made from PCs have some inher-
ent noise (Weaver & McKee, 2008). The measurements can
only be stored in a fixed number of registers. Hence, only
a fixed number of measurements can be made at any one
point in time. As a result, one must make successive mea-
surements that capture marginally different system states.
Particular performance counters might become unavailable
(or return incorrect values). Finally, if a single schedul-
ing agent is controlling a cluster of machines (which is
common in data centers), measurements made on differ-
ent machines will not be in sync and will often be delayed
by network latency. As a result, PCs are often sampled
N times between successive scheduler invocations to get
around some of the sources of error. To maximize the per-
2A complete list of the PCs used in this paper can be found in
the supplementary material.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
CPU CPU
RAM RAM
PCR
G
P
U
PCS
N
IC
FP
G
A…
PCR
G
P
U
PCS
N
IC
FP
G
A…
Processing Fabric
Interconnect Elements
Memory Interconnect
System Interconnect
Memory
(Accelerator memory
omitted)
(PCIe Root Complex)
(PCIe Switch)
Figure 4. Organization of a multi-CPU computer.
formance estimation fidelity, we apply statistical methods
to systematically model the variance of the measurements.
For a single performance counter ot[c], if the error in mea-
surement ec can be modeled, then the measured value mc
can be modeled in terms of the true value vc plus measure-
ment noise ec, i.e., mc = vc + ec. Here, we focus only
on random errors, and assume zero systematic error. That
is a valid assumption because the only reason for system-
atic errors is hardware or software bugs. We assume that
the error can be modeled as ec ∼ N (0, σ) for some un-
known variance σ; hence, Pr(mc | vc) = N (mc, σ). That
follows from prior work based on extensive measurement
studies (Weaver & McKee, 2008). Now, given N measure-
ments of the value of the performance counter, we compute
their sample mean µ and sample variance S. A scaled and
shifted t-distribution describes the marginal distribution of
the unknown mean of a Gaussian, when the dependence on
variance has been marginalized out (Gelman et al., 1995);
i.e., vc ∼ µ + S/√N Student(ν = N − 1). In our experi-
ments, the confidence level of the t-distribution was 95%.
Now, given a distribution of vc for every element of ot, we
describe the construction of the BN model. Our goal is to
model resource utilization (a number in [0, 1]) for a relevant
set of architectural resources R. To do so, we use algebraic
models for composing PC measurements (vc) by using alge-
braic (deterministic) relationships derived from information
about the CPU architecture. Processor performance manu-
als (Yasin, 2014; Intel Corp., 2016; Hall et al., 2017) and
or vendor contributions in OS codebases (e.g., in the perf
module in Linux) provide such information. When available
in the later format (which is indeed the case for all modern
Intel, AMD, ARM, and IBM CPUs), these relationships can
be automatically parsed and be used to construct the BN.
As our error-corrected measurements are defined in terms of
distributions, the algebraic models that encode static infor-
mation about relationships (based on the microarchitecture
of the processor or topology of the system) now define statis-
tical relationships vcs (based on the Jacobian relationships
described in §3). Fig. 5 shows an example of the BN model.
However, the types and meanings of hardware counters
vary from one kind of architecture to another because of
Processing Fabric
Memory
Interconnect
Interconnect
Utilization
A
… ……
FP Scalar
Inst. Retired
FP Vector
Inst. Retired
FP Arithmetic
Utilization
Backend
Utilization
Port
Util.
Divider
P
#Cores
#Threads
μops issued
Core 
Utilization
CPU
Utilization
Memory BW
Utilization
#Sockets
DRAM Lat.
DRAM BW
Cache
Utilization
PCR
Utilization
PCS Utilization
S
L
Switched
DMA Utilization
System
Interconnect
A
Direct Attached
DMA Utilization
Outstanding
Requests
Figure 5. Bayesian network (uses the plate notation) used to esti-
mate resource utilization.
the variation in hardware organizations. As a result, the
model defined by the BN is parametric, changing with dif-
ferent processors and system topologies (i.e., across all the
different types of systems in a data center).
Consider the example of identifying memory bandwidth
utilization for a CPU core. According to the processor docu-
mentation, the utilization can be computed by measuring the
number of outstanding memory requests (which is available
as a PC), i.e., Outstanding Requests[≥θMB ]/Outstanding Requests[≥1].3
That is, identify the fraction of cycles in some time win-
dow that CPU-core stalls because of insufficient bandwidth.
Naturally, in order to sustain maximum performance, it is
necessary to ensure that no stalls occur. The value θMB
is processor-specific and might not always be known. In
such cases, we use the training approach described in §3
to learn θMB . The procedure is repeated for all relevant
system utilization counters (marked as “Util.” in Fig. 5),
which together represent bˆt. Such a BN model for a 16-core
Intel Xeon processor (with all PCIe lanes populated) has 68
nodes, of which 32 are directly measured and the remainder
are computed through inference.
BN Retraining. The architectural information required to
build the BN can be found in processor manuals (Intel Corp.,
2016; Sudhakar & Srinivasan, 2019; Hall et al., 2017) as
well as in machine-parsable databases in the Linux kernel
source code as part of the perf package. The only human
intervention required in the process of building the BN is
for filtering out those resources that cannot be controlled
with software (because they change too quickly). The BN
model should only be rebuilt when the underlying hardware
configuration changes, which Mars & Tang (2013) observe
happens every 5–6 years in a data center.
Implementation Details. We collect system-wide (for all
processes) performance counter measurements for a vari-
ety of hardware events (described in Table 1). The system
wide collection leads to occasional spurious measurements
(e.g., from interrupt handlers), however, this allows us to
make holistic measurements (e.g., capture system calls or
3Here X[≥ t] counts cycles in which X exceeds threshold t.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Table 1. Performance counters used in test evaluation. We have
disambiguated the names to ensure platform independence.
Performance Counters/Events
On-core Events
Core Clock Cycles, Reference Clock Cycles,
Temperature, Instructions (µops for
Intel) issued, Instructions (µops for
Intel) retired, Un-utilized slots due to
miss-speculation
Un-core & Memory Controller Events (per socket)
#Read/Write requests to DRAM (from all
channels), #Local DRAM accesses, #Remote
DRAM Accesses, #Read/Write requests to DRAM
(from all channels) from IO sources, #PCIe
Read, #PCIe Write, QPI(for Intel)/Nest(for
IBM) Transactions
OS/Driver Events
Free memory (CPU, GPU, FPGA), Total memory
(CPU, GPU, FPGA)
drivers that perform memory and DMA operations). We
make the minimum measurements to infer if a kernel sched-
uled to a CPU-hardware thread is core-bound (floating point-
and integer-intensive). This allows us to make scheduling
decisions on co-located kernels, i.e., those that get sched-
uled to SMT/hyperthreads bound to a core. The majority
of measurements are made at the level of un-core events
that captures performance of the memory interconnect and
the system bus: to identify kernels that are bandwidth bot-
tle necked. We do not explicitly model GPU performance
counters as low-level scheduling decisions (e.g., warp-level
scheduling) in GPUs are obfuscated by the NVIDIA run-
time/driver.
NN Model. The second part of the POMDP-based schedul-
ing model uses an NN (see Fig. 3) to learn the optimal policy
with which to schedule user tasks given a belief state. The
NN takes two graphs as inputs. The first input is the belief
state bˆt, encoded as vertex labels on a graph that describes
the topology of a computer system (i.e., the organization
shown in Fig. 4), and input labels that correspond to the
locations of inputs in the topology. The color coding in
Figs. 4, and 5 shows a mapping (i.e., vertex labels) between
nodes in the topology graph and bˆt. The second input is the
user’s program expressed as a DFG. We use graph network
(GN) layers (Battaglia et al., 2018) to “embed” the graphs
into a set of embedding vectors. GNs have been shown to
capture node, edge, and locality information. We chose
small, fully connected NNs for modeling the functional
transformations in the GN layers. Prior work in scheduling
(e.g., Grandl et al. (2016b); Wu et al. (2012)) has shown
the benefit of considering temporal information to capture
the dependencies of system resources over time as well as
the time evolution of the user DFG. We capture those rela-
tionships (between the embeddings of the input graphs) by
using an RNN, specifically an LSTM layer (Hochreiter &
Schmidhuber, 1997).
The action space A of the model is fixed as the number
of kernels/processors available in the system and is known
ahead of time. The action space consists of the follow-
ing types of actions. (i) Execution actions correspond to
execution of a kernel on a processor/accelerator. (ii) Recon-
figuration actions correspond to reconfiguration of a single
FPGA context to a kernel. (iii) No-Op actions correspond to
not scheduling any task in a particular scheduler invocation.
No-Ops are useful when the system resources are maxi-
mally subscribed, and execution of more tasks will hinder
performance. The scheduler is invoked every time there is
an idle processor/accelerator in the system (i.e., every time
a processor finishes the work assigned to it), causing the
system to take one of the above actions.
Reward Function. The reward rt is based on the objec-
tive of minimizing the runtime of a user DFG. At time t,
rt = −
∑t
i=0
1/Ti, where Ti is the wall clock time taken to
execute the i actions executing in the system at time t. We
picked rt as it represents the “makespan” of the schedule,
a metric that is popularly used in the traditional scheduling
literature and accurately represents the user-facing perfor-
mance of the system. Note that parallel actions are not
double-counted in this formulation. The BN and NN mod-
els are trained end-to-end using minimization of Eqn. 1
through back-propagation, as described in §3.
Implementation details of the BN and NN models are pre-
sented in the supplementary material.
5. Evaluation & Discussion
We evaluated the Symphony along the following dimen-
sions. (i) How well does Symphony perform compared to
the state of the art? (ii) How does the Symphony’s runtime
affect scheduling decisions? (iii) What are the savings in
training time compared to traditional methods? The eval-
uation testbed consisted of a rack-scale cluster of twelve
IBM Power8 CPUs, two NVIDIA K40, six K80 GPUs, and
two FPGAs. We illustrated the generality of techniques on a
variety of real-world workloads that used CPUs, GPUs, and
FPGAs: (i) variant calling and genotyping analysis (Van der
Auwera et al., 2013) on human genome datasets using tools
presented in Banerjee et al. (2016; 2017; 2019a); Li &
Durbin (2009; 2010); Langmead et al. (2009); McKenna
et al. (2010); Nothaft et al. (2015); Nothaft (2015); Rimmer
et al. (2014); Zaharia et al. (2011); (ii) epilepsy detection
and localization (Varatharajah et al., 2017) on intra-cranial
electroencephalography data; and (iii) in online security
analytics (Cao et al., 2015) for intrusion detection systems.
State of the Art. Traditional dynamic scheduling tech-
niques (e.g., Isard et al. (2009); Giceva et al. (2014); Ly-
erly et al. (2018); Ousterhout et al. (2013); Zhuravlev et al.
(2010); Zaharia et al. (2010)) use manually tuned heuristics
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  0.2  0.4  0.6  0.8  1
C
D
F
Oracle Normalized Runtime
Symphony Graphene Paragon Sparrow
Figure 6. Comparing performance of Symphony to that of other
popular schedulers for kernel executions in DFGs.
 0
 0.2
 0.4
 0.6
 0.8
 1
Spar
row
Grap
hen
e
Para
gon
Sym
pho
ny
Ora
cle
%
 o
f 
K
er
ne
ls
Scheduler
0 0-10% 10-20% >20%
Figure 7. Percentage of application executions that show a degrada-
tion in performance.
(e.g., fairness, shortest-job-first) that prioritize simplicity
and generality over achieving the best-case workload perfor-
mance, often allocating coarse-grained resources (e.g., GBs
of memory, CPU threads) and making simplifying assump-
tions about the underlying workload. Several ML-based
scheduling strategies have also been proposed, wherein the
above heuristics are learned from data. They use a vari-
ety of black-box ML models, e.g., model-free deep RL
in (Mao et al., 2016; 2018), collaborative filtering (Delim-
itrou & Kozyrakis, 2013; 2014), and other traditional ML
techniques like SVMs (e.g., Mars et al. (2011); Mars & Tang
(2013); Yang et al. (2013); Zhang et al. (2014)). A common
theme in these techniques is that of treating the system as
a black-box and performing scheduling to optimize appli-
cation throughput metrics. The above approaches are not
well-suited to heterogeneous, accelerator-rich systems in
which architectural diversity necessitates the use of low-
level resources, which cannot be measured directly and are
not semantically comparable across processors. As points
of comparison to Symphony, we used Graphene (Grandl
et al., 2016b), a heuristic-accelerated job shop optimization
solver4; Sparrow (Ousterhout et al., 2013), a randomized
scheduler; and Paragon (Delimitrou & Kozyrakis, 2013), a
collaborative filtering-based scheduler.
Baseline for Comparison. We defined the oracle schedule
to correspond to the best performance possible for running
an application on the evaluation system. It corresponds to
a completely isolated execution of an application. Here,
different concurrently executing kernels of the same appli-
cation contend for resources and might cause performance
degradation. For the benchmark applications, we accounted
for that by exhaustively executing schedules of the appli-
cation DFGs to find the one with the lowest runtime (i.e.,
the oracle run). We measured the runtime of kernel i in
workload (in the oracle run) j as toraclei,j across all kernels
and workloads. toraclei,j serves as the baseline for assessing
the performance of Symphony.
Effectiveness of Scheduling Model. First, we quantified
how well Symphony can handle scheduling of kernels in a
DFG taking into account of resource contention and inter-
4Graphene was not originally designed to execute on heteroge-
neous systems. In the supplementary material, we explain modifi-
cations we made to the algorithm.
ference at (i) intra-DFG level; and (ii) when executing with
an unknown co-located workload utilizing compute and I/O
resources. To do so, we measured the runtimes of each of
the kernels i in the workload j (as above) to compute tsi,j for
each scheduler s under test. In Fig. 6, we illustrate the distri-
bution of oracle-normalized runtimes for each of the kernels
in the workloads we tested, i.e., a distribution of tsi,j/toraclei,j
across 500 executions of the three above workloads. In the
figure, a distribution whose probability mass is closest to 1
is preferred, as it implies the least slowdown compared to
the oracle. We observe that the proposed technique signifi-
cantly outperformed the state-of-the-art. In our experiments,
the median and tail (i.e., 99th percentile) runtime of Sym-
phony outperformed the second best (in this case, Paragon)
by close to 32%. At the 99th percentile, the generated sched-
ules performed at a 6% loss relative to the oracle. Next, we
quantified the performance of end-to-end user workloads,
shown in Fig. 7. Here, we calculated 1− (∑i tsi,j)/(∑i toraclei,j )
for all 500 runs of the DFGs and grouped them into buckets
of different kinds of normalized performance. Symphony
significantly outperformed the other scheduling techniques,
running up to 60% of the applications with no performance
loss relative to the oracle execution, and the rest with a
performance loss of less than 20%.
Latency. There are two latencies to consider in comparing
schedulers: the latency of the entire user workload (“LW”,
shown in Fig. 6), and the latency of the scheduler execution
(“LS”, shown in Fig. 8). In Fig. 8, we show two config-
urations of the Symphony scheduler: (i) “No-Opt” which
uses a belief propagation-based update for the BN (and
MCMC-based inference); and (ii) “All-Opt” which uses the
sampling technique described in §3, accelerators5 to per-
form inference, and task batching (described below). LW
(≥ LS) is the user-facing metric of interest. Symphony out-
performs all baselines in terms of LW. In terms of median
LS, the Symphony is 1.8× and 1.6× faster than Paragon and
Graphene, respectively. In contrast, Sparrow, which ran-
domly assigns tasks to processors, has 3.6× lower median
latency than Symphony. However, the reduced LS comes at
the cost of increased LW (see Fig. 6).
Batching Task Execution. A key concern with Symphony
5The accelerators include an NVIDIA K80 GPU for NN infer-
ence and an FPGA for BN inference using Banerjee et al. (2019b).
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
10-3 10-2 10-1 100 101
Sparrow
Paragon
Graphene
C
D
F
Scheduler Latency (s)
No Opt
All Opt
Figure 8. Symphony’s latency (“All Opt” &
“No Opt”) compared to prior work.
 0.0001
 0.001
 0.01
 0.1
 1
 10
 1  10  100  1000
N
o
rm
. R
un
ti
m
e 
(%
)
Batch Size (Tasks)
Figure 9. Symphony’s performance (oracle
normalized, in %) with varying batch size.
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  2000  4000  6000  8000  10000 12000
Lo
ss
 O
ve
r 
O
ra
cl
e
Iteration Count
RNN No Opt. All Opt.
Figure 10. Training time for Symphony. An
iteration is 2 RL episodes of 20 steps.
is its large tail latency (100× larger than its median; see
Fig. 8) compared to the other schedulers (which have deter-
ministic runtime). This increased latency is brought about
by Symphony having to perform significantly more com-
pute if the RL-policy-update is triggered. The scheduler
latency adversely affects LW as the time spent executing
scheduler calls, is time not utilized to make progress on the
user workload. In order to deal with this issue, our eval-
uation executed Symphony on batches of tasks instead of
single tasks, thereby amortizing the cost of executing Sym-
phony across the batch. Task batching works synergistically
with the sampling based gradient propagation technique to
reduce the tail latency by as much 12× (see Fig. 3). Fig. 9
demonstrates the average improvement in LW normalized
to the oracle over a range of batch sizes. We observe that
the optimal value for batch size is about 128 tasks per batch.
This corresponds to the “All Opt” configuration in Figs. 8,
and 10 as well as Figs. 6, and 7. The “No Opt” configuration
in Fig. 8 is computed at a batch size of one.
Training Time. Finally, we quantified the improvement
in training time offered by Symphony using the sampling-
based gradient computation methodology presented in §3.
We used the following baselines for evaluation: (i) mod-
el-free RNN (labeled “RNN” in Fig. 10); and (ii) the
“All Opt.” and “No Opt.” configurations from above.
The RNN model here replaces the BN (and inference)
and system-topology-embedding GN (in Fig. 3) with a 3-
layer, fully connected NN to compute an embedding for
ot. Fig. 10 illustrates the differences in performance of the
these configurations with respect to degradation in perfor-
mance of the user DFGs relative to the oracle schedule (i.e.,
1 − (∑i tsi,j)/(∑i toraclei,j )). We observe that the RNN is sig-
nificantly less sample-efficient than the proposed POMDP
is; specifically, it is ~2.2× worse than Symphony. Further
linearly extrapolating time to convergence from iteration
12 × 103, the RNN would need > 48 × 103 iterations to
achieve the same accuracy as Symphony.
The difference in training time for the “No Opt.” and “All
Opt” in Fig. 10 can be attributed to (i) time taken to per-
form back-propagation for policy updates; and (ii) effective
scheduler latency. Linearly extrapolating the training-loss,
we observe that “All Opt” is at least 4.3× more sample ef-
ficient than “No Opt” to reach a 30% mean loss relative to
the oracle. That reduction is significant because the contin-
uous churn of user workloads and machine configurations
in a cloud, as pointed out in Mars et al. (2011), would re-
quire that the scheduling model be periodically retrained.
In absolute terms, the “All Opt” configuration is able to
achieve ~30% mean loss relative to the oracle scheduler
in 700 hours of training and ~4400 iterations of workload
execution. That corresponds to approximately 500 hours of
system execution; hence, the total process takes 1200 hours.
Though this might appear to be over 7 weeks of time, in
wall clock time this is approximately 2 week because we
use parallel A3C-based training. In fact, the limiting factor
here is the availability of FPGAs, of which we have only 2
in the evaluation cluster, hence limiting the number of RL
episodes that can be run in parallel.
6. Conclusion
This paper presents (i) a domain-driven Bayesian RL model
for scheduling that captures the statistical dependencies be-
tween architectural resources; and (ii) a sampling-based
technique that allows the computation of gradients of a
Bayesian model without performing full probabilistic in-
ference. As data center architectures become more com-
plex (Asanovic´, 2014; Shao & Brooks, 2015), techniques
like the one proposed here will be critical in the deployment
of future accelerated applications.
Acknowledgments
We thank K. Saboo, S. Lumetta, W-M. Hwu, K. Atchley,
and J. Applequist for their insightful comments on the early
drafts of this manuscript. This research was supported in
part by the National Science Foundation (NSF) under Grant
Nos. CNS 13-37732 and CNS 16-24790; by IBM under
a Faculty Award and through equipment donations; and
by Xilinx and Intel through equipment donations. Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do
not necessarily reflect the views of the NSF, IBM, Xilinx or
Intel.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur,
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,
Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,
M., Yu, Y., and Zheng, X. TensorFlow: A System for
Large-scale Machine Learning. In Proceedings of the
12th USENIX Conference on Operating Systems Design
and Implementation, pp. 265–283. USENIX Association,
2016.
Asanovic´, K. FireBox: A Hardware Building Block for
2020 Warehouse-Scale Computers. Santa Clara, CA,
February 2014. USENIX Association.
Asanovic´, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer,
K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K.,
Wawrzynek, J., Wessel, D., and Yelick, K. A View of the
Parallel Computing Landscape. Commun. ACM, 52(10):
56–67, October 2009.
Astrom, K. J. Optimal control of Markov processes with
incomplete state information. Journal of Mathematical
Analysis and Applications, 10(1):174–205, 1965.
Banerjee, S. S., Athreya, A. P., Mainzer, L. S., Jongeneel,
C. V., Hwu, W.-M., Kalbarczyk, Z. T., and Iyer, R. K.
Efficient and scalable workflows for genomic analyses.
In Proceedings of the ACM International Workshop on
Data-Intensive Distributed Computing, DIDC ’16, pp.
27–36, 2016.
Banerjee, S. S., El-Hadedy, M., Tan, C. Y., Kalbarczyk, Z. T.,
Lumetta, S. S., and Iyer, R. K. On accelerating pair-HMM
computations in programmable hardware. In Proc. 27th
International Conference on Field Programmable Logic
and Applications, FPL 2017, Ghent, Belgium, September
4-8, 2017, pp. 1–8, 2017.
Banerjee, S. S., El-Hadedy, M., Lim, J. B., Kalbarczyk,
Z. T., Chen, D., Lumetta, S. S., and Iyer, R. K. ASAP: Ac-
celerated Short-Read Alignment on Programmable Hard-
ware. IEEE Transactions on Computers, 68(3):331–346,
March 2019a.
Banerjee, S. S., Kalbarczyk, Z. T., and Iyer, R. K. AcMC2
: Accelerating Markov Chain Monte Carlo Algorithms
for Probabilistic Models. In Proceedings of the Twenty-
Fourth International Conference on Architectural Support
for Programming Languages and Operating Systems, pp.
515–528, 2019b.
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-
Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti,
A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela-
tional inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261, 2018.
Binder, J., Koller, D., Russell, S., and Kanazawa, K. Adap-
tive Probabilistic Networks with Hidden Variables. Ma-
chine Learning, 29(2/3):213–244, 1997.
Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N.,
Goglin, B., Mercier, G., Thibault, S., and Namyst, R.
hwloc: A Generic Framework for Managing Hardware
Affinities in HPC Applications. In Proc. 2010 18th Eu-
romicro Conference on Parallel, Distributed and Network-
based Processing, pp. 180–186, Feb 2010.
Cao, P., Badger, E., Kalbarczyk, Z., Iyer, R., and Slagell, A.
Preemptive intrusion detection: Theoretical framework
and real-world measurements. In Proceedings of the 2015
Symposium and Bootcamp on the Science of Security,
HotSoS ’15, pp. 5:1–5:12, 2015.
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry,
R., Bradshaw, R., and Nathan. FlumeJava: Easy, Efficient
Data-Parallel Pipelines. In ACM SIGPLAN Conference
on Programming Language Design and Implementation
(PLDI), pp. 363–375, 2010.
Chowdhury, M., Zhong, Y., and Stoica, I. Efficient coflow
scheduling with varys. In Proceedings of the 2014 ACM
Conference on SIGCOMM, SIGCOMM ’14, pp. 443–454,
2014.
Dagum, P. and Luby, M. Approximating probabilistic infer-
ence in Bayesian belief networks is NP-hard. Artificial
Intelligence, 60(1):141–153, 1993.
Delimitrou, C. and Kozyrakis, C. Paragon: QoS-aware
Scheduling for Heterogeneous Datacenters. SIGPLAN
Not., 48(4):77–88, March 2013.
Delimitrou, C. and Kozyrakis, C. Quasar: Resource-
efficient and QoS-aware Cluster Management. In Pro-
ceedings of the 19th International Conference on Archi-
tectural Support for Programming Languages and Oper-
ating Systems, ASPLOS ’14, pp. 127–144, 2014.
Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert,
M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and
Zhokhov, P. OpenAI Baselines. https://github.
com/openai/baselines, 2017.
Doweck, J. Inside 6th generation Intel Core code
named Skylake:: New Microarchitecture and Power
Management. https://www.hotchips.org/wp-
content/uploads/hc_archives/hc28/
HC28.23-Tuesday-Epub/HC28.23.90-High-
Perform-Epub/HC28.23.911-Skylake-
Doweck-Intel_SK3-r13b.pdf, 2016. Accessed
2019-03-05.
Dreyer, R. S. and Alpert, D. B. Apparatus for monitoring
the performance of a microprocessor, August 1997. US
Patent 5,657,253.
Foerster, J. N., Assael, Y. M., de Freitas, N., and White-
son, S. Learning to communicate to solve riddles with
deep distributed recurrent Q-networks. arXiv preprint
arXiv:1602.02672, 2016.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Gelman, A., Carlin, J., Stern, H., and Rubin, D. Bayesian
Data Analysis. Chapman & Hall, New York, 1995.
Giceva, J., Alonso, G., Roscoe, T., and Harris, T. Deploy-
ment of query plans on multicores. Proc. VLDB Endow.,
8(3):233–244, November 2014.
Grandl, R., Chowdhury, M., Akella, A., and Anantha-
narayanan, G. Altruistic Scheduling in Multi-resource
Clusters. In Proceedings of the 12th USENIX Conference
on Operating Systems Design and Implementation, pp.
65–80. USENIX Association, 2016a.
Grandl, R., Kandula, S., Rao, S., Akella, A., and Kulkarni,
J. Graphene: Packing and Dependency-aware Scheduling
for Data-parallel Clusters. In Proceedings of the 12th
USENIX Conference on Operating Systems Design and
Implementation, pp. 81–97, 2016b.
Hall, B., Bergner, P., Housfater, A. S., Kandasamy, M.,
Magno, T., Mericas, A., Munroe, S., Oliveira, M.,
Schmidt, B., Schmidt, W., et al. Performance optimiza-
tion and tuning techniques for IBM Power Systems pro-
cessors including IBM POWER8. IBM Redbooks, 2017.
Hausknecht, M. and Stone, P. Deep recurrent Q-learning
for partially observable MDPs. In 2015 AAAI Fall Sym-
posium Series, 2015.
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,
Joseph, A. D., Katz, R., Shenker, S., and Stoica, I.
Mesos: A platform for fine-grained resource sharing in
the data center. In Proceedings of the 8th USENIX Confer-
ence on Networked Systems Design and Implementation,
NSDI’11, pp. 295–308. USENIX Association, 2011.
Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson,
S. Deep variational reinforcement learning for POMDPs.
arXiv preprint arXiv:1806.02426, 2018.
Intel Corp. Intel 64 and ia-32 architectures optimization
reference manual. Intel Corporation, Sept, 2014.
Intel Corp. Intel® 64 and IA-32 Architectures Software
Developer Manuals. https://software.intel.
com/en-us/articles/intel-sdm, 2016. Ac-
cessed 2019-03-05.
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar,
K., and Goldberg, A. Quincy: Fair scheduling for dis-
tributed computing clusters. In Proceedings of the ACM
SIGOPS 22nd Symposium on Operating Systems Princi-
ples, SOSP ’09, pp. 261–276, 2009.
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,
Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforce-
ment learning with unsupervised auxiliary tasks. arXiv
preprint arXiv:1611.05397, 2016.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Plan-
ning and acting in partially observable stochastic domains.
Artificial Intelligence, 101(1–2):99–134, 1998.
Karkus, P., Hsu, D., and Lee, W. S. QMDP-Net: Deep
learning for planning under partial observability. In Ad-
vances in Neural Information Processing Systems, pp.
4694–4704, 2017.
Kleen, A. PMU-Tools. https://github.com/
andikleen/pmu-tools, 2010. Accessed 2019-03-
05.
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In
Advances in neural information processing systems, pp.
1008–1014, 2000.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L.
Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biol, 10(3):
R25, 2009.
Li, H. and Durbin, R. Fast and accurate short-read
alignment with burrows-wheeler rransform. Bioinfor-
matics, 25(14):1754–1760, may 2009. doi: 10.1093/
bioinformatics/btp324. URL http://dx.doi.org/
10.1093/bioinformatics/btp324.
Li, H. and Durbin, R. Fast and accurate long-read alignment
with Burrows–Wheeler transform. Bioinformatics, 26(5):
589–595, 2010.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J.,
Homer, N., Marth, G., Abecasis, G., Durbin, R., et al. The
sequence alignment/map format and SAMtools. Bioinfor-
matics, 25(16):2078–2079, 2009.
Lyerly, R., Murray, A., Barbalace, A., and Ravindran, B.
Aira: A framework for flexible compute kernel execution
in heterogeneous platforms. IEEE Transactions on Par-
allel and Distributed Systems, 29(2):269–282, Feb 2018.
ISSN 1045-9219. doi: 10.1109/TPDS.2017.2761748.
Mao, H., Alizadeh, M., Menache, I., and Kandula, S. Re-
source management with deep reinforcement learning. In
Proceedings of the 15th ACM Workshop on Hot Topics in
Networks, pp. 50–56. ACM, 2016.
Mao, H., Schwarzkopf, M., Venkatakrishnan, S. B.,
Meng, Z., and Alizadeh, M. Learning scheduling al-
gorithms for data processing clusters. arXiv preprint
arXiv:1810.01963, 2018.
Mars, J. and Tang, L. Whare-map: Heterogeneity in "homo-
geneous" warehouse-scale computers. SIGARCH Comput.
Archit. News, 41(3):619–630, June 2013.
Mars, J., Tang, L., and Hundt, R. Heterogeneity in “ho-
mogeneous” warehouse-scale computers: A performance
opportunity. IEEE Comput. Archit. Lett., 10(2):29–32,
July 2011. ISSN 1556-6056.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Mastrolilli, M. and Svensson, O. (acyclic) job shops are hard
to approximate. In 2008 49th Annual IEEE Symposium
on Foundations of Computer Science, pp. 583–592, Oct
2008. doi: 10.1109/FOCS.2008.36.
Mastrolilli, M. and Svensson, O. Improved bounds for
flow shop scheduling. In International Colloquium on
Automata, Languages, and Programming, pp. 677–688.
Springer, 2009.
McCool, M., Reinders, J., and Robison, A. Structured
Parallel Programming: Patterns for Efficient Computa-
tion. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 1st edition, 2012. ISBN 9780123914439,
9780124159938.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibul-
skis, K., Kernytsky, A., Garimella, K., Altshuler, D.,
Gabriel, S., Daly, M., and DePristo, M. A. The genome
analysis toolkit: A MapReduce framework for analyzing
next-generation DNA sequencing data. Genome Research,
20(9):1297–1303, jul 2010. doi: 10.1101/gr.107524.110.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-
land, A. K., Ostrovski, G., et al. Human-level control
through deep reinforcement learning. Nature, 518(7540):
529, 2015.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Harley, T.,
Lillicrap, T. P., Silver, D., and Kavukcuoglu, K. Asyn-
chronous methods for deep reinforcement learning. In
Proceedings of the 33rd International Conference on In-
ternational Conference on Machine Learning - Volume
48, ICML’16, pp. 1928–1937. JMLR.org, 2016.
Narasimhan, K., Kulkarni, T., and Barzilay, R. Language
understanding for text-based games using deep reinforce-
ment learning. arXiv preprint arXiv:1506.08941, 2015.
Nothaft, F. Scalable genome resequencing with adam and
avocado. Master’s thesis, EECS Department, University
of California, Berkeley, May 2015.
Nothaft, F., Massie, M., Danford, T., Zhang, Z., Laser-
son, U., Yeksigian, C., Kottalam, J., Ahuja, A., Ham-
merbacher, J., Linderman, M., Franklin, M. J., Joseph,
A. D., and Patterson, D. A. Rethinking data-intensive
science using scalable analytics systems. In Proceedings
of the 2015 ACM SIGMOD International Conference on
Management of Data, SIGMOD ’15, pp. 631–646, New
York, NY, USA, 2015. ACM. ISBN 978-1-4503-2758-9.
doi: 10.1145/2723372.2742787.
Ousterhout, K., Wendell, P., Zaharia, M., and Stoica, I. Spar-
row: Distributed, low latency scheduling. In Proceed-
ings of the Twenty-Fourth ACM Symposium on Operating
Systems Principles, SOSP ’13, pp. 69–84, New York,
NY, USA, 2013. ACM. ISBN 978-1-4503-2388-8. doi:
10.1145/2517349.2522716.
Rimmer, A., Phan, H., Mathieson, I., Iqbal, Z., Twigg, S.
R. F., Wilkie, A. O. M., McVean, G., and Lunter, G.
Integrating mapping-, assembly- and haplotype-based
approaches for calling variants in clinical sequencing
applications. Nature Genetics, 46(8):912–918, jul 2014.
Russell, S., Binder, J., Koller, D., and Kanazawa, K. Local
learning in probabilistic networks with hidden variables.
In Proceedings of the 14th International Joint Conference
on Artificial Intelligence - Volume 2, IJCAI’95, pp. 1146–
1152, San Francisco, CA, USA, 1995. Morgan Kaufmann
Publishers Inc. ISBN 1-55860-363-8.
Shachter, R. D. Bayes-ball: The rational pastime (for de-
termining irrelevance and requisite information in be-
lief networks and influence diagrams). arXiv preprint
arXiv:1301.7412, 2013.
Shao, Y. and Brooks, D. Research Infrastructures for Hard-
ware Accelerators. Synthesis Lectures on Computer Ar-
chitecture. Morgan & Claypool Publishers, 2015.
Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A.,
Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz,
N., Barreto, A., et al. The Predictron: End-to-end learning
and planning. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pp. 3191–
3199. JMLR, 2017.
Stuecheli, J., Blaner, B., Johns, C. R., and Siegel, M. S.
CAPI: A Coherent Accelerator Processor Interface. IBM
Journal of Research and Development, 59(1):7:1–7:7,
Jan 2015. ISSN 0018-8646. doi: 10.1147/JRD.2014.
2380198.
Sudhakar, A. T. and Srinivasan, M. IBM POWER
in-memory collection counters. https:
//developer.ibm.com/articles/power9-
in-memory-collection-counters/, 2019.
Accessed 2019-03-05.
Terpstra, D., Jagode, H., You, H., and Dongarra, J. Col-
lecting Performance Data with PAPI-C. In Müller, M. S.,
Resch, M. M., Schulz, A., and Nagel, W. E. (eds.), Tools
for High Performance Computing 2009, pp. 157–173,
Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin,
R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir,
K., Roazen, D., Thibault, J., Banks, E., Garimella, K. V.,
Altshuler, D., Gabriel, S., and DePristo, M. A. From
fastq data to high-confidence variant calls: The genome
analysis toolkit best practices pipeline. Current Protocols
in Bioinformatics, 43(1):11.10.1–11.10.33, 2013.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Varatharajah, Y., Chong, M. J., Saboo, K., Berry, B.,
Brinkmann, B., Worrell, G., and Iyer, R. EEG-GRAPH: A
Factor-Graph-Based Model for Capturing Spatial, Tempo-
ral, and Observational Relationships in Electroencephalo-
grams. In Advances in Neural Information Processing
Systems, pp. 5377–5386, 2017.
Weaver, V. M. and McKee, S. A. Can hardware performance
counters be trusted? In 2008 IEEE International Sympo-
sium on Workload Characterization, pp. 141–150. IEEE,
2008.
Wu, H., Diamos, G., Cadambi, S., and Yalamanchili, S. Ker-
nel Weaver: Automatically Fusing Database Primitives
for Efficient GPU Computation. In 2012 45th Annual
IEEE/ACM International Symposium on Microarchitec-
ture, pp. 107–118, Dec 2012.
Xu, L., Butt, A. R., Lim, S., and Kannan, R. A
Heterogeneity-Aware Task Scheduler for Spark. In Proc.
2018 IEEE International Conference on Cluster Comput-
ing (CLUSTER), pp. 245–256, Sep. 2018.
Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux:
Precise Online QoS Management for Increased Utiliza-
tion in Warehouse Scale Computers. In Proceedings of
the 40th Annual International Symposium on Computer
Architecture, ISCA ’13, pp. 607–618, 2013.
Yasin, A. A Top-Down method for performance analysis
and counters architecture. In Proc. 2014 IEEE Interna-
tional Symposium on Performance Analysis of Systems
and Software (ISPASS), pp. 35–44, March 2014.
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K.,
Shenker, S., and Stoica, I. Delay Scheduling: A Sim-
ple Technique for Achieving Locality and Fairness in
Cluster Scheduling. In Proceedings of the 5th European
Conference on Computer Systems, pp. 265–278, 2010.
Zaharia, M., Bolosky, W. J., Curtis, K., Fox, A., Patterson,
D., Shenker, S., Stoica, I., Karp, R. M., and Sittler, T.
Faster and more accurate sequence alignment with SNAP.
arXiv preprint arXiv:1111.5572, 2011.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J.,
McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I.
Resilient Distributed Datasets: A Fault-tolerant Abstrac-
tion for In-memory Cluster Computing. In Proceedings
of the 9th USENIX Conference on Networked Systems
Design and Implementation, NSDI’12, pp. 15–28, 2012.
Zhang, Y., Laurenzano, M. A., Mars, J., and Tang, L. SMiTe:
Precise QoS Prediction on Real-System SMT Processors
to Improve Utilization in Warehouse Scale Computers. In
2014 47th Annual IEEE/ACM International Symposium
on Microarchitecture, pp. 406–418, Dec 2014.
Zhu, P., Li, X., Poupart, P., and Miao, G. On improving
deep reinforcement learning for POMDPs. arXiv preprint
arXiv:1804.06309, 2018.
Zhuravlev, S., Blagodurov, S., and Fedorova, A. Addressing
Shared Resource Contention in Multicore Processors via
Scheduling. SIGPLAN Not., 45(3):129–142, March 2010.
Zook, J. M., Catoe, D., McDaniel, J., Vang, L., Spies, N.,
Sidow, A., Weng, Z., Liu, Y., Mason, C. E., Alexander, N.,
Henaff, E., McIntyre, A. B., Chandramohan, D., Chen, F.,
Jaeger, E., Moshrefi, A., Pham, K., Stedman, W., Liang,
T., Saghbini, M., Dzakula, Z., Hastie, A., Cao, H., Deikus,
G., Schadt, E., Sebra, R., Bashir, A., Truty, R. M., Chang,
C. C., Gulbahce, N., Zhao, K., Ghosh, S., Hyland, F.,
Fu, Y., Chaisson, M., Xiao, C., Trow, J., Sherry, S. T.,
Zaranek, A. W., Ball, M., Bobe, J., Estep, P., Church,
G. M., Marks, P., Kyriazopoulou-Panagiotopoulou, S.,
Zheng, G. X., Schnall-Levin, M., Ordonez, H. S., Mu-
divarti, P. A., Giorda, K., Sheng, Y., Rypdal, K. B., and
Salit, M. Extensive sequencing of seven human genomes
to characterize benchmark reference materials. Scientific
Data, 3:160025, Jun 2016.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Supplementary Material
A. Extended Motivational Example
Current schedulers prioritize the use of simple online heuris-
tics (Grandl et al., 2016b) and coarse-grained resource buck-
eting (e.g., core counts, free memory) and require user
labeling of commonly used system resources (Hindman
et al., 2011; Grandl et al., 2016a) to make scheduling de-
cisions. Those approaches are untenable in truly heteroge-
neous settings as (i) defining such heuristics is difficult over
the combinatorial space of application-processor/accelerator
configurations; and (ii) user-based resource usage labeling
requires in-depth understanding of the underlying system.
This paper demonstrates the use of ML to automatically
infer such heuristics and their evolution over time as new
user workloads and/or new accelerators are added.
A.1. Dealing with Architectural Heterogeneity
We reiterate that state-of-the-art schedulers do not model the
emergent heterogeneous compute platforms that are being
widely adopted in data centers and hence leave a lot to be
desired (as can also be seen in the performance of our base-
lines). Consider, for example, the execution of the forward
algorithm on PairHMM models (Banerjee et al., 2017), a
computation that is commonly performed in computational
genomics workloads. Fig. 11 shows the significant diversity
(nearly 100×) in performance of this single workload across
CPUs (from Intel and IBM), GPUs (two models of GPUs
from NVIDIA) and FPGA implementations. The increasing
heterogeneity necessitates rethinking of the design and im-
plementation of future schedulers, as the current approach
will require an extraordinary amount of manual tuning and
expertise to adapt to the emergent systems. In contrast,
the proposed technique eliminates that work and automates
the whole process of learning the right granularity of re-
sources and scheduling workloads in cloud-based, dynamic,
multi-tenant environments, thereby improving application
 1
 10
 100
 1  10
N
o
rm
al
iz
ed
 T
hr
o
ug
hp
ut
/W
at
t
Normalized Throughput
Intel Xeon
GPU - K40
FPGA
GPU - K80
IBM Power8
Figure 11. Architectural diversity leading to varied performance
for the PairHMM kernel.
KM
er
Se
ar
ch
La
nd
au
Vi
sh
kin
La
nd
au
-V
ish
kin
CI
GA
R
Fil
te
rP
air
s
En
tro
pyGa
us
sia
n
Th
re
sh
ol
di
ngPa
irH
M
M
KM
er
Co
ns
t
De
Br
ui
jn
Sm
ith
W
at
er
m
an
Ge
no
m
eD
iff
Ge
no
ty
pe
KM
er
Se
ar
ch
La
nd
au
Vi
sh
kin
La
nd
au
-V
ish
kin
CI
GA
R
Fil
te
rP
air
s
En
tro
py
Ga
us
sia
n
Th
re
sh
ol
di
ng
Pa
irH
M
M
KM
er
Co
ns
t
De
Br
ui
jn
Sm
ith
W
at
er
m
an
Ge
no
m
eD
iff
Ge
no
ty
pe
 0.6
 0.65
 0.7
 0.75
 0.8
 0.85
 0.9
 0.95
 1
Figure 12. Degradation in runtime of co-located kernels due to
shared resource contention.
performance and system utilization, all with minimal human
supervision. Prior work uses microarchitectural throughput
metrics such as clock cycles per instruction (Giceva et al.,
2014; Delimitrou & Kozyrakis, 2013; 2014; Mars et al.,
2011; Mars & Tang, 2013) as proxies for processor affini-
ties. In our case, such metrics are not usable because of the
wide diversity in processors, i.e., CPU-centric units cannot
describe the performance of GPUs/FPGAs.
A.2. Dealing with Resource Granularity
Traditional schedulers use coarse-grained resource buck-
eting, i.e., they schedule macro-resources like CPU core
counts and GBs of memory. That simplifies the design of
the scheduling algorithms (both the optimization algorithms
and attached heuristics), resulting in an inability to measure
low-level sources of resource contention in the system. The
contention of such low-level resources is often the cause for
performance degradation and variability. Consider, for ex-
ample, the concurrent execution of several compute kernels
(described in Appendix C.2) on co-located hyper-threads
(i.e., threads that share resources on a single core) on an Intel
CPU. If we abstract the problem at the level of CPU threads
and memory allocated, then those kernels should execute in
isolation. The normalized runtime variation is illustrated in
Fig. 12. We observe a slowdown of as much as 40% (i.e.,
the co-located runtime is 60% of the isolated runtime) for
some combinations of kernels, and almost no slowdown
for others. That problem is further exacerbated by the ar-
chitectural diversity in processors that we described earlier.
The proposed technique accounts for such contention by
explicitly collecting information on low-level system state
by using performance counter measurements, and by esti-
mating resource usage in the system by explicitly encoding
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
bˆt
<latexit sha1_base64="HsznIeA g1etH1ElFU/sprL4Gsuo=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBahp 5JUQY8FLx4r2A9oQ9lsN+3SzSbuToQS+ie8eFDEq3/Hm//GbZuDtj4YeL w3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpV C8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFK3f6YIgkGOC hX3Jq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YS OeM9SRSNu/Gxx74xcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0 Uwxs/EypJkSu2XBSmkmBM5s+TodCcoZxaQpkW9lbCxlRThjaikg3BW315 nbTrNe+yVr+/qjSqeRxFOINzqIIH19CAO2hCCxhIeIZXeHMenRfn3flY thacfOYU/sD5/AGnao+i</latexit>
T
<latexit sha1_base64="Q89a4NCA0DjO1ed CfKSg7xP2xbk=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoOQU9iNgh4DXjwmkBckS5iddJIxs7 PLzKwQlnyBFw+KePWTvPk3TpI9aGJBQ1HVTXdXEAuujet+O7mt7Z3dvfx+4eDw6PikeHrW1lGi GLZYJCLVDahGwSW2DDcCu7FCGgYCO8H0fuF3nlBpHsmmmcXoh3Qs+YgzaqzUaA6KJbfiLkE2i ZeREmSoD4pf/WHEkhClYYJq3fPc2PgpVYYzgfNCP9EYUzalY+xZKmmI2k+Xh87JlVWGZBQpW9 KQpfp7IqWh1rMwsJ0hNRO97i3E/7xeYkZ3fsplnBiUbLVolAhiIrL4mgy5QmbEzBLKFLe3Eja hijJjsynYELz1lzdJu1rxrivVxk2pVs7iyMMFXEIZPLiFGjxAHVrAAOEZXuHNeXRenHfnY9Wac 7KZc/gD5/MHqSWMwg==</latexit>
Gt
<latexit sha1_base64="CMWOkEwXZ0ETb5w 9I4LG1L2bXZk=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FD3qsaGuhDWWz3bRLN5 uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFR28Sp ZrzFYhnrTkANl0LxFgqUvJNoTqNA8sdgfD3zH5+4NiJWDzhJuB/RoRKhYBStdH/Tx3654tbcO cgq8XJSgRzNfvmrN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42fzU6fkzCoDEs balkIyV39PZDQyZhIFtjOiODLL3kz8z+umGF75mVBJilyxxaIwlQRjMvubDITmDOXEEsq0sLc SNqKaMrTplGwI3vLLq6Rdr3nntfrdRaVRzeMowgmcQhU8uIQG3EITWsBgCM/wCm+OdF6cd+dj0 Vpw8plj+APn8wcgEo2c</latexit>
 e
<latexit sha1_base64="yVMLtc6jrW1JHP/cY AeufxrqRPs=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2FZoY9lsJ+3azSbsboQS+ h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMTqPqA aBZfYMtwIvE8U0igQ2AnG1zO/84RK81jemUmCfkSHkoecUWOldi8Z8QfslytuzZ2DrBIvJxXI0eyX v3qDmKURSsME1brruYnxM6oMZwKnpV6qMaFsTIfYtVTSCLWfza+dkjOrDEgYK1vSkLn6eyKjkdaTKL CdETUjvezNxP+8bmrCKz/jMkkNSrZYFKaCmJjMXicDrpAZMbGEMsXtrYSNqKLM2IBKNgRv+eVV0q7 XvPNa/fai0qjmcRThBE6hCh5cQgNuoAktYPAIz/AKb07svDjvzseiteDkM8fwB87nD4INjwA=</lat exit>
 v
<latexit sha1_base64="/yzOIcihTglbLxMhA oJcqEEqjOU=">AAAB7XicbVBNSwMxEJ2tX7V+rXr0EixCT2W3CnosePFYwX5Au5Zsmm1js8mSZAtl6 X/w4kERr/4fb/4b03YP2vpg4PHeDDPzwoQzbTzv2ylsbG5t7xR3S3v7B4dH7vFJS8tUEdokkkvVCbG mnAnaNMxw2kkUxXHIaTsc38799oQqzaR4MNOEBjEeChYxgo2VWr1kxB4nfbfsVb0F0Drxc1KGHI2+ +9UbSJLGVBjCsdZd30tMkGFlGOF0VuqlmiaYjPGQdi0VOKY6yBbXztCFVQYoksqWMGih/p7IcKz1NA 5tZ4zNSK96c/E/r5ua6CbImEhSQwVZLopSjoxE89fRgClKDJ9agoli9lZERlhhYmxAJRuCv/ryOmn Vqv5ltXZ/Va5X8jiKcAbnUAEfrqEOd9CAJhB4gmd4hTdHOi/Ou/OxbC04+cwp/IHz+QOb0Y8R</lat exit>
 u
<latexit sha1_base64="K6bzwCpDPkmSfYi2m awVoTmRG7g=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvQU9mtgh4LXjxWsK3QriWbZtvYbLIkWaEs/ Q9ePCji1f/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqG5VqylpUCaXvQ2K Y4JK1LLeC3SeakTgUrBOOr2d+54lpw5W8s5OEBTEZSh5xSqyT2r1kxB/Sfrni1bw58Crxc1KBHM1+ +as3UDSNmbRUEGO6vpfYICPacirYtNRLDUsIHZMh6zoqScxMkM2vneIzpwxwpLQrafFc/T2RkdiYSR y6zpjYkVn2ZuJ/Xje10VWQcZmklkm6WBSlAluFZ6/jAdeMWjFxhFDN3a2Yjogm1LqASi4Ef/nlVdK u1/zzWv32otKo5nEU4QROoQo+XEIDbqAJLaDwCM/wCm9IoRf0jj4WrQWUzxzDH6DPH5pNjxA=</lat exit>
⇢e!v
<latexit sha1_base64="1FgLzgIPzwJaoQS9f /tk51NRg+Y=">AAAB/nicbVBNS8NAEN3Ur1q/ouLJy2IReipJFfRY8OKxgv2AJpbNdtMu3WTD7qRSQ sG/4sWDIl79Hd78N27bHLT1wcDjvRlm5gWJ4Boc59sqrK1vbG4Vt0s7u3v7B/bhUUvLVFHWpFJI1Qm IZoLHrAkcBOskipEoEKwdjG5mfnvMlOYyvodJwvyIDGIeckrASD37xFND+ZAxT/HBEIhS8hGPpz27 7FSdOfAqcXNSRjkaPfvL60uaRiwGKojWXddJwM+IAk4Fm5a8VLOE0BEZsK6hMYmY9rP5+VN8bpQ+Dq UyFQOeq78nMhJpPYkC0xkRGOplbyb+53VTCK/9jMdJCiymi0VhKjBIPMsC97liFMTEEEIVN7diOiS KUDCJlUwI7vLLq6RVq7oX1drdZbleyeMoolN0hirIRVeojm5RAzURRRl6Rq/ozXqyXqx362PRWrDym WP0B9bnD+eVlgI=</latexit>
⇢v!u
<latexit sha1_base64="/ylRlsw724yXpliYa KK+E9jD0mg=">AAAB/nicbVBNS8NAEN3Ur1q/ouLJy2IReipJFfRY8OKxgv2AJpbNdtMu3WTD7qRSQ sG/4sWDIl79Hd78N27bHLT1wcDjvRlm5gWJ4Boc59sqrK1vbG4Vt0s7u3v7B/bhUUvLVFHWpFJI1Qm IZoLHrAkcBOskipEoEKwdjG5mfnvMlOYyvodJwvyIDGIeckrASD37xFND+ZCNPcUHQyBKyUecTnt2 2ak6c+BV4uakjHI0evaX15c0jVgMVBCtu66TgJ8RBZwKNi15qWYJoSMyYF1DYxIx7Wfz86f43Ch9HE plKgY8V39PZCTSehIFpjMiMNTL3kz8z+umEF77GY+TFFhMF4vCVGCQeJYF7nPFKIiJIYQqbm7FdEg UoWASK5kQ3OWXV0mrVnUvqrW7y3K9ksdRRKfoDFWQi65QHd2iBmoiijL0jF7Rm/VkvVjv1seitWDlM 8foD6zPHwDRlhI=</latexit>
 e
<latexit sha1_base64="yVMLtc6jrW1JHP/cY AeufxrqRPs=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2FZoY9lsJ+3azSbsboQS+ h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMTqPqA aBZfYMtwIvE8U0igQ2AnG1zO/84RK81jemUmCfkSHkoecUWOldi8Z8QfslytuzZ2DrBIvJxXI0eyX v3qDmKURSsME1brruYnxM6oMZwKnpV6qMaFsTIfYtVTSCLWfza+dkjOrDEgYK1vSkLn6eyKjkdaTKL CdETUjvezNxP+8bmrCKz/jMkkNSrZYFKaCmJjMXicDrpAZMbGEMsXtrYSNqKLM2IBKNgRv+eVV0q7 XvPNa/fai0qjmcRThBE6hCh5cQgNuoAktYPAIz/AKb07svDjvzseiteDkM8fwB87nD4INjwA=</lat exit>
 u
<latexit sha1_base64="K6bzwCpDPkmSfYi2m awVoTmRG7g=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvQU9mtgh4LXjxWsK3QriWbZtvYbLIkWaEs/ Q9ePCji1f/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqG5VqylpUCaXvQ2K Y4JK1LLeC3SeakTgUrBOOr2d+54lpw5W8s5OEBTEZSh5xSqyT2r1kxB/Sfrni1bw58Crxc1KBHM1+ +as3UDSNmbRUEGO6vpfYICPacirYtNRLDUsIHZMh6zoqScxMkM2vneIzpwxwpLQrafFc/T2RkdiYSR y6zpjYkVn2ZuJ/Xje10VWQcZmklkm6WBSlAluFZ6/jAdeMWjFxhFDN3a2Yjogm1LqASi4Ef/nlVdK u1/zzWv32otKo5nEU4QROoQo+XEIDbqAJLaDwCM/wCm9IoRf0jj4WrQWUzxzDH6DPH5pNjxA=</lat exit>
⇢e!u
<latexit sha1_base64="4vktl5Sgb5C5dMN8o K4GPl/qzCY=">AAAB/nicbVBNS8NAEJ34WetXVDx5WSxCTyWpgh4LXjxWsB/QxLLZbpqlm2zY3SglF PwrXjwo4tXf4c1/47bNQVsfDDzem2FmXpByprTjfFsrq2vrG5ulrfL2zu7evn1w2FYik4S2iOBCdgO sKGcJbWmmOe2mkuI44LQTjK6nfueBSsVEcqfHKfVjPExYyAjWRurbx56MxH1OPcmGkcZSikeUTfp2 xak5M6Bl4hakAgWaffvLGwiSxTTRhGOleq6Taj/HUjPC6aTsZYqmmIzwkPYMTXBMlZ/Pzp+gM6MMUC ikqUSjmfp7IsexUuM4MJ0x1pFa9Kbif14v0+GVn7MkzTRNyHxRmHGkBZpmgQZMUqL52BBMJDO3IhJ hiYk2iZVNCO7iy8ukXa+557X67UWlUS3iKMEJnEIVXLiEBtxAE1pAIIdneIU368l6sd6tj3nrilXMH MEfWJ8/5hCWAQ==</latexit>
1
2
u
<latexit sha1_base64="duUhN co2rxaq7e1qIVI2mMKjW3M=">AAAB6HicbVBNS8NAEJ34WetX1aOX xSL0VJIq6LHgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+ agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZ YLGLVDahGwSW2DDcCu4lCGgUCO8Hkbu53nlBpHssHM03Qj+hI8pAz aqzUTAelslt1FyDrxMtJGXI0BqWv/jBmaYTSMEG17nluYvyMKsOZw Fmxn2pMKJvQEfYslTRC7WeLQ2fk0ipDEsbKljRkof6eyGik9TQKbGd EzVivenPxP6+XmvDWz7hMUoOSLReFqSAmJvOvyZArZEZMLaFMcXsr YWOqKDM2m6INwVt9eZ20a1XvqlprXpfrlTyOApzDBVTAgxuowz00o AUMEJ7hFd6cR+fFeXc+lq0bTj5zBn/gfP4A2ymM4w==</latexit>
V
<latexit sha1_base64="NcMdSLJT foMoaT3ko+faYuigCOQ=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSL0VJIq6L HgxWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSA TXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGg UCO8Hkbu53nlBpHssHM03Qj+hI8pAzaqzUbA9KZbfqLkDWiZeTMuRoDEpf/W HM0gilYYJq3fPcxPgZVYYzgbNiP9WYUDahI+xZKmmE2s8Wh87IpVWGJIyVLW nIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyI yYWkKZ4vZWwsZUUWZsNkUbgrf68jpp16reVbXWvC7XK3kcBTiHC6iABzdQh3 toQAsYIDzDK7w5j86L8+58LFs3nHzmDP7A+fwBrC2MxA==</latexit>
u0
<latexit sha1_base64="/LZlp05kxaIN aR7+vaOzkpgaIhY=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBaxp5JUQY8FLx6r2Fpo Q9lsJ+3SzSbsboQS+g+8eFDEq//Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8X t0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfk SHkoecUWOl+/S8X664NXcOskq8nFQgR7Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqp xoSyMR1i11JJI9R+Nr90Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2 MyyQ1KNliUZgKYmIye5sMuEJmxMQSyhS3txI2oooyY8Mp2RC85ZdXSbte8y5q9bvLS qOax1GEEziFKnhwBQ24hSa0gEEIz/AKb87YeXHenY9Fa8HJZ47hD5zPHzutjRQ=</la texit>
Graph network layers corresponding to NN
Edge 
Level
Verte
x 
Level
Global 
Level
Bayesian 
Inference
BN Model
Graph 
Network
System 
Topology
Graph 
Network
LSTM
FC
FC
Critic
Actor
Computer Systems
Data Flow Graph
Action
PMU
Measurements
V (bˆt)
<latexit sha1_base64="CKgbEygiY 5kI0bdutUnTJqmk42w=">AAAB8nicbVBNS8NAEJ34WetX1aOXxSrUS0mqoMeCF 48V7AekoWy223bpZhN2J0IJ/RlePCji1V/jzX/jts1BWx8MPN6bYWZemEhh0H W/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1LDpVC8iQIl7ySa0yiUvB2 O72Z++4lrI2L1iJOEBxEdKjEQjKKV/FalO6JIwh5e9kplt+rOQVaJl5My5Gj0S l/dfszSiCtkkhrje26CQUY1Cib5tNhNDU8oG9Mh9y1VNOImyOYnT8mFVfpkEG tbCslc/T2R0ciYSRTazojiyCx7M/E/z09xcBtkQiUpcsUWiwapJBiT2f+kLzRn KCeWUKaFvZWwEdWUoU2paEPwll9eJa1a1buq1h6uy/XzPI4CnMIZVMCDG6jDP TSgCQxieIZXeHPQeXHenY9F65qTz5zAHzifPxn+kGM=</latexit>
ot
<latexit sha1_ba se64="y9ZDDVNc7+JGZgF5Qaq1B+Jdu mY=">AAAB6nicbVBNS8NAEJ3Ur1q/q h69LFbBU0mqoMeCF48V7Qe0oWy2m3bp Jht2J0IJ/QlePCji1V/kzX/jts1BWx8 MPN6bYWZekEhh0HW/ncLa+sbmVnG7t LO7t39QPjxqGZVqxptMSaU7ATVcipg3 UaDknURzGgWSt4Px7cxvP3FthIofcZJ wP6LDWISCUbTSg+pjv1xxq+4cZJV4O alAjka//NUbKJZGPEYmqTFdz03Qz6hG wSSflnqp4QllYzrkXUtjGnHjZ/NTp+T cKgMSKm0rRjJXf09kNDJmEgW2M6I4Ms veTPzP66YY3viZiJMUecwWi8JUElRk 9jcZCM0ZyokllGlhbyVsRDVlaNMp2RC 85ZdXSatW9S6rtfurSv0sj6MIJ3AKF+ DBNdThDhrQBAZDeIZXeHOk8+K8Ox+L 1oKTzxzDHzifP1vOjcA=</latexit>
bˆt
<latexit sha1_ba se64="bmwmM+LHOhk1tZjZGsLufgyB3 7g=">AAAB73icbVBNS8NAEJ3Ur1q/q h69LFbBU0mqoMeCF48V7Ae0oWy2m3bp ZhN3J0IJ/RNePCji1b/jzX/jts1BWx8 MPN6bYWZekEhh0HW/ncLa+sbmVnG7t LO7t39QPjxqmTjVjDdZLGPdCajhUije RIGSdxLNaRRI3g7GtzO//cS1EbF6wEn C/YgOlQgFo2ilTm9EkQR97JcrbtWdg 6wSLycVyNHol796g5ilEVfIJDWm67kJ +hnVKJjk01IvNTyhbEyHvGupohE3fja /d0rOrTIgYaxtKSRz9fdERiNjJlFgOy OKI7PszcT/vG6K4Y2fCZWkyBVbLApT STAms+fJQGjOUE4soUwLeythI6opQxt RyYbgLb+8Slq1qndZrd1fVepneRxFOI FTuAAPrqEOd9CAJjCQ8Ayv8OY8Oi/O u/OxaC04+cwx/IHz+QOmNo+e</latex it>
f⇡(bˆt;⇥⇡)
<latexit sha1_base64="fgiyA7ns fUDxiW5IqYaIi/1up4g=">AAACBHicdVDJSgNBEO2Je9yiHr00RiFehukkJ gEvghePEbJBJgw9nZ6kSc9Cd40Qggcv/ooXD4p49SO8+Td2FkFFHxQ83qu iqp6fSKHBcT6szNLyyura+kZ2c2t7Zze3t9/ScaoYb7JYxqrjU82liHgTBE jeSRSnoS952x9dTv32DVdaxFEDxgnvhXQQiUAwCkbycoeB5yai4A4pYN+Dc +w2hhzoVDz1cnnHdkiZVGrYsckZcYpVQ0q1aqVMMLGdGfJogbqXe3f7MUt DHgGTVOsucRLoTagCwSS/zbqp5gllIzrgXUMjGnLdm8yeuMUnRunjIFamIs Az9fvEhIZaj0PfdIYUhvq3NxX/8ropBLXeRERJCjxi80VBKjHEeJoI7gvFG cixIZQpYW7FbEgVZWByy5oQvj7F/5NW0SYlu3hdzl8cL+JYR4foCBUQQVV 0ga5QHTURQ3foAT2hZ+veerRerNd5a8ZazBygH7DePgGbNpda</latexit>
fV (bˆt;⇥V )
<latexit sha1_base64="37HfhmiNq 7vPq+bSc2lChEwWbR8=">AAACAHicdVDJSgNBEO1xjXGLevDgpTEK8TJMJzEJe Al48RghGyRh6On0JE16FrprhBBy8Ve8eFDEq5/hzb+xswgq+qDg8V4VVfW8WA oNjvNhrayurW9sprbS2zu7e/uZg8OmjhLFeINFMlJtj2ouRcgbIEDydqw4DTz JW97oeua37rjSIgrrMI55L6CDUPiCUTCSmzn23WauO6SAPReucLc+5EDd5oWby Tq2Q4qkVMGOTS6Jky8bUqiUS0WCie3MkUVL1NzMe7cfsSTgITBJte4QJ4behC oQTPJpuptoHlM2ogPeMTSkAde9yfyBKT43Sh/7kTIVAp6r3ycmNNB6HHimM6Aw 1L+9mfiX10nAr/QmIowT4CFbLPITiSHCszRwXyjOQI4NoUwJcytmQ6ooA5NZ2 oTw9Sn+nzTzNinY+dtitnq2jCOFTtApyiGCyqiKblANNRBDU/SAntCzdW89Wi /W66J1xVrOHKEfsN4+ATvslXQ=</latexit>
Pr(st, at 1, ot;⇥BN )
<latexit sha1_base64="hJ2 RmGw//ZRWtYwqWCaC X5CZ3yQ=">AAACE3 icdVBNSyNBEO3xazW 6btSjl8YouIs7TMd oAl5EL54kglEhCUN Pp2Iaez7orhHDMP/B i3/FiwdFvHrx5r+x EyO4sj4oeLxXRVW9I FHSoOe9OmPjE5NTP 6ZnCrNzP+d/FRcWT0 ycagENEatYnwXcgJ IRNFCigrNEAw8DBaf Bxf7AP70EbWQcHWM /gXbIzyPZlYKjlfzi nxbCFWZ1na8bHzco 9zP8y/INGvu4Q1vHP UAr7R3m9LdfLHmux ypsu0Y9l20xr1y1Z LNW3a4wylxviBIZoe 4XX1qdWKQhRCgUN6 bJvATbGdcohYK80Eo NJFxc8HNoWhrxEEw 7G/6U0zWrdGg31rYi pEP180TGQ2P6YWA7 Q44989UbiP/zmil2a +1MRkmKEIn3Rd1UU YzpICDakRoEqr4lXG hpb6WixzUXaGMs2B A+PqXfk5Oyyzbd8lG ltLs6imOaLJMVsk4 YqZJdckDqpEEEuSa 35J48ODfOnfPoPL23 jjmjmSXyD5znNzN/ nQA=</latexit>
aˆt ⇠ ⇡(at|bˆt)
<latexit sha1_base64="tXY7 Yvnb32HYxwcbabyUjcjZvoA=">AAACCnicbVC7TsMwFHV4lvIKM LIYClJZqqQgwViJhbFI9CE1UeS4TmvVeci+QapKZxZ+hYUBhFj5 Ajb+BifNAC1HutLxOffK9x4/EVyBZX0bS8srq2vrpY3y5tb2zq 65t99WcSopa9FYxLLrE8UEj1gLOAjWTSQjoS9Yxx9dZ37nnknF4 +gOxglzQzKIeMApAS155pEzJICJB9hRPMROwqvZ4wHnuu/BmWdW rJqVAy8SuyAVVKDpmV9OP6ZpyCKggijVs60E3AmRwKlg07KTKpY QOiID1tM0IiFT7iQ/ZYpPtdLHQSx1RYBz9ffEhIRKjUNfd4YEhm rey8T/vF4KwZU74VGSAovo7KMgFRhinOWC+1wyCmKsCaGS610xH RJJKOj0yjoEe/7kRdKu1+zzWv32otI4KeIooUN0jKrIRpeogW5Q E7UQRY/oGb2iN+PJeDHejY9Z65JRzBygPzA+fwD4z5kR</late xit>
Neural Network
1
2
Figure 13. Proposed POMDP model.
the measurements in its POMDP model.
B. Implementation Details
The scheduling framework functions as follows.
1. The scheduler first makes measurements by using the
available processor performance counters (e.g., instruc-
tions retired, cache misses).
2. When a processor becomes idle (finishes running the
current kernel), it invokes the scheduler.
3. The measurements are fed into the scheduler’s BN model
as input. Using those measurements, the BN model com-
putes the utilization of different levels of architectural
resources in the system (e.g., memory bandwidth utiliza-
tion, PCIe link utilization). We refer to those utilizations
as the state of the system.
4. The computed utilization numbers, user programs repre-
sented as a DFG, and a system topology graph are fed
into an NN. The NN produces a scheduling decision that
is actuated in the system. The action space consists of a
kernel-processor pair.
5. Finally, the scheduler gets feedback from the system (i.e.,
the reward) in terms of the time it took for the job to run
as a result of its scheduling decision.
6. While in training mode, if an incorrect decision is made,
Symphony enqueues an update of the policy parameters
using back-propagation on the A2C/A3C loss function.
An incorrect decision is one where kernel input-output
dependencies are not respected, or a kernel-accelerator
pair is picked where the accelerator does not provide an
implementation of the kernel.
B.1. Graph Network Details
The structure of the graph network used in the proposed
model is illustrated in Fig. 13. The numbers of parameters
used in the different layers of the graph network are listed
in Table 2.
Table 2. Mapping of the graph network layer functions in Fig. 13.
We use the notation FCNN(a, b) to denote a 2-hidden fully-
connected layers with a and b hidden units, respectively.
Function in GN Function in 1 Function in 2
φe FCNN(64, 32) FCNN(64, 32)
φv FCNN(32, 16) –
φu FCNN(16, 16) FCNN(32, 16)
ρe→v
∑
e –
ρv→u
∑
v –
ρe→u – ReLU(e)
B.2. Hyperparameters
The hyperparameters used to train the proposed POMDP
model are listed in Table 3.
B.3. System Measurement Details
Topology Information. Consider the example of standard
NUMA based computing system with PCIe based acceler-
ators shown in Fig. 14. The system contains (i) multiple
CPUs which have non-uniform access to memory, (ii) sev-
eral accelerators (including GPUs and FPGAs) each with
their own memory, and (iii) a system interconnect which
connects all of the components of the system together. Sym-
phony encodes the system topology as a graph T = (P,N)
(also shown in Fig. 14). The nodes of the graph P corre-
spond to the processing elements (and attached memory)
and memory/system interconnects. Each of the these nodes
p ∈ P have an attached resource utilization vector. For
example, in an Intel processor, the utilization vector would
Table 3. Hyperparameters used in the model.
Hyperparameter Value
Learning Rate 0.005
LSTM Unroll Length 20
ns 20
ne 2
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
CPU 0 CPU 1
Memory Bus
M
em
o
ry M
em
o
ry
PCIe 
Switch
PCIe 
Switch
PCIe 
Switch
PCIe 
Switch
GPU CAPI FPGA
N
IC
G
P
U
P
C
Ie D
evice
P
C
Ie D
evice
P
C
Ie D
evice
P
C
Ie D
evice
P
C
Ie D
evice
G
P
U
PCIe Bus
PCIe Backplane
To Network
Directly Connected PCIe Lanes
PCIe lanes connected 
over switches
PCIe Based Accelerator
PCIe Bus
Memory 
Interconnect 
CPU 
Thread
CPU Core
Figure 14. Example of a dual-socket NUMA-based system topol-
ogy with a PCIe-interconnect and -devices. Figure on the right
shows an graph-encoding of the topology.
include utilization like that of micro-op issue ports, floating
point unit utilization etc. (Doweck, 2016; Intel Corp., 2014).
The scheduler queries the system topology and builds the
topology graph T (which is used as an input to the RL
agent) using hwloc (Broquedis et al., 2010). hwloc pro-
vides information about CPU cores, caches, NUMA mem-
ory nodes, and the PCIe interconnect layout (i.e., connec-
tions between the PCIe root complex and PCIe switches),
as well as connection information on peripheral acceler-
ators, storage, and network devices in the system. The
scheduler does not explicitly model the rack-scale or data
center network (unlike some previous approaches, e.g., Is-
ard et al. (2009); Chowdhury et al. (2014)), but the BN
and RL model can be extended to do so. Our measure-
ments considers injection bandwidth at the network inter-
face card (NIC) to be a proxy for network performance, i.e.,
the NIC is modeled as an accelerator that accepts data at
min(PCIe Bandwidth, Injection Bandwidth).
Performance Counter Measurements. Performance
counters’ configuration and access instructions require ker-
nel mode privileges, and hence those operations are sup-
ported by Linux: system calls to configure and read the
performance counter data. Symphony uses a combination
of user-space tools, e.g., libPAPI (Terpstra et al., 2010),
PMUTools (Kleen, 2010), and perf that wrap around the sys-
tem call interface to make both system-specific and system-
independent measurements.
We configure the performance counters to make system-
wide measurements (i.e., for all processes). If the perfor-
mance counter measurements are configured in that way, it
might incur security risks, particularly by opening up side
channels through which attackers could infer workload char-
acteristics. However, analysis or mitigation of such risks
is not in the scope of this paper and may form the basis of
future work.
All kernel executions are non-preemptive in the context
Host CPU
 Symphony 
Runtime 
Env
FPGA Driver 
and Library
PCIe
PCIe 
Endpoint
Storage 
Device
NIC
Control 
Unit
Pipelined Interconnect
M
em
o
ry
 
C
o
nt
ro
lle
r
O
n-b
o
ard
 D
R
A
M
PE(0,0)
PE(0,1)
PE(0,2)
PE(0,3)
PE(n,0)
PE(n,1)
PE(n,2)
PE(n,3)
…
O
n-
ch
ip
 
M
em
o
ry
PCIe DMA 
Controller
CAPI 
Controller
P2P Device 
Access
Systolic element
Can use 
either  PCIe 
or CAPI
Runtime System Controlled Dynamically Reconfigured OS Controlled Off-the-shelf IPs
GPU
Pipelined systolic connections
(Application Accelerators)
Figure 15. Architecture of the FPGA-based hardware co-processor
controlled by Symphony.
of the proposed runtime, however the OS scheduler can
preempt CPU threads. Further we prevent the OS scheduler
from re-balance tasks/threads once assigned to a particular
CPU. This is achieved by explicitly setting affinities of
threads to cores (i.e., pinning them).
Performance Penalties. Monitoring of performance coun-
ters without having to perform interrupts is almost free.
In our implementation, we capture on-core performance
counters directly before and after a single kernel invocation.
Un-core performance counters are measured periodically
(every million dynamic instructions on a core) by using a
performance monitoring interrupt. On an IBM PowerPC
processor, the interrupt handler initiates a DMA transfer of
the performance counters to memory (Sudhakar & Srini-
vasan, 2019), thereby incurring no performance penalty
(other than the time to service the interrupt). On Intel pro-
cessors, the interrupt handler has to explicitly read the per-
formance counter registers and write them to memory. In
our tests (on Intel processors), we observed a ~3% perfor-
mance penalty for applications with interrupts enabled. That
corresponds to an execution of a usermode interrupt with an
average 900-ns latency.
Distributed Execution. In our evaluation we have de-
ployed Symphony in a rack-scale distributed context (over
an EDR Infiniband network fabric) as a centralized sched-
uler controlling all processing resources. Here, all the perfor-
mance counter measurements are sent over the network to a
centralized server that makes scheduling decisions. This ap-
proach works well at the scale of a rack, where all resources
are essentially one hop away at 0.2-µs latency. Extending
Symphony to larger or slower networks might present chal-
lenges, where network latency causes stale performance
counter data to reach the scheduler. We will address these
challenges in future work.
B.4. Dynamically Reconfigurable FPGA Accelerator
Our implementation and evaluation of Symphony uses a
custom FPGA accelerator (see Fig. 15). Due to space limita-
tions, here we briefly describe the features of the accelerator.
• Processing Elements (PEs). The co-processor is opti-
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters
Table 4. Hardware specifications of test cluster.
Name # Specifications
M1 2 CPU IBM Power8 (SMT 8); 870 GB
RAM; GPU NVIDIA K80; FPGA
Alpha Data 7V3
M2 4 CPU IBM Power8 (SMT 4); 512 GB
RAM; GPU NVIDIA K40; FPGA
Nallatech 385
N 1 Mellanox FDR Infiniband
mized to execute the computational kernels as a single
instruction of the application. Sets of four neighboring
PEs are directly connected as a systolic element, thereby
enabling high bandwidth data transfer in between PEs and
forming the quantum of reconfiguration.
• Host-FPGA Communication. The board interfaces with
the host CPU over PCIe and can be configured to commu-
nicate with the host processor over this interface in one of
two ways: (i) using direct memory access (DMA) to the
hosts memory over the PCIe bus, or (ii) using IBM’s co-
herent accelerator processor interface (CAPI) (Stuecheli
et al., 2015).
• Dynamic Reconfiguration. The configuration of the accel-
erator (i.e., which kernels PEs are available at any time)
is controlled by Symphony. Symphony treats the recon-
figuration of the accelerator as a kernel that has to be
dispatched to the FPGA. The state of the accelerator is
fed into Symphony along with the system topology T .
• Launching Kernels. Remember CPU executors (i.e.,
threads which are given tasks to execute) are pinned or
bound to underlying hardware SMT thread. Accelerators
however require the CPUs to initiate their execution. As a
result, each accelerator in the system is assigned a proxy
executor thread that orchestrates (i.e., launches, polls for
completion etc.) its execution. These executors are re-
sponsible for managing their own queues for maintaining
tasks that are “waiting” for execution.
C. Evaluation Environment
C.1. Evaluation System
All evaluation experiments are performed on an 11 node
rack-scale test-bed of IBM Power8 CPUs, NVIDIA K40
and K80 GPUs, as well as FPGAs (listed in Table 4). All the
machines in the cluster are connected using a single switch
EDR Infiniband network.
C.2. Evaluation Workloads
We illustrated the generality of the proposed approach on a
variety of real-world workloads (listed in Table 5) that used
CPUs, GPUs, and FPGAs:
1. variant-calling and genotyping analysis (Van der Auw-
era et al., 2013) on human genome datasets appropriate
for clinical use (consisting of Align, IR, and HC in
Table 5),
2. epilepsy detection and localization (Varatharajah et al.,
2017) on intra-cranial electroencephalography data;
and
3. online security analytics (Cao et al., 2015) on
network- and host-level intrusion detection system
event-streams.
For the variant-calling and genotyping workload we use the
NA12878 genome sample from the GIAB consortium (Zook
et al., 2016) for all our experiments as it is representative
of human clinical datasets. For the EEG and AT workloads,
we use the same datasets as discussed in the original papers.
Table 5. Enumeration of workloads used to evaluation.
Application Processors Implementations
CPU GPU FPGA
Alignment (Align) 3 3 3 (Li & Durbin, 2009; 2010; Langmead et al., 2009; Zaharia et al.,
2011; Banerjee et al., 2019a; 2016),
Indel Realignment (IR) 3 7 7 (McKenna et al., 2010; Nothaft et al., 2015)
Variant Calling (HC) 3 3 3 (Li et al., 2009; McKenna et al., 2010; Nothaft, 2015; Rimmer
et al., 2014; Banerjee et al., 2017)
EEG-Graph (EEG) 3 3 3 (Varatharajah et al., 2017; Banerjee et al., 2019b)
AttackTagger (AT) 3 3 3 (Cao et al., 2015; Banerjee et al., 2019b)
