Intelligent Orchestration of ADAS Pipelines on Next Generation
  Automotive Platforms by Ghose, Anirban et al.
Intelligent Orchestration of ADAS Pipelines on Next
Generation Automotive Platforms
Anirban Ghose, Srijeeta Maity, Arijit Kar, Kaustubh Maloo, Soumyajit Dey
Indian Institute of Technology, Kharagpur
Email: anirban.ghose@cse.iitkgp.ernet.in,{srijeeta.maity, arijit.kar14, kaustubh.maloo}@iitkgp.ac.in, soumya@cse.iitkgp.ac.in
Abstract—Advanced Driver-Assistance Systems (ADAS) is one of the
primary drivers behind increasing levels of autonomy, driving comfort in
this age of connected mobility. However, the performance of such systems
is a function of execution rate which demands on-board platform-level
support. With GPGPU platforms making their way into automobiles,
there exists an opportunity to adaptively support high execution rates
for ADAS tasks by exploiting architectural heterogeneity, keeping in
mind thermal reliability and long-term platform aging. We propose a
future-proof, learning-based adaptive scheduling framework that lever-
ages Reinforcement Learning to discover suitable scenario based task-
mapping decisions for accommodating increased task-level throughput
requirements
Index Terms—ADAS, OpenCL, Machine Learning, Control Theory,
Heterogeneous Multicore, Real Time Scheduling
I. INTRODUCTION
Recent versions of automotive software standards like adap-
tive AUTOSAR [1] recommend that multiple software features
should share compute platforms in an adaptive co-scheduled man-
ner accommodating dynamic mapping and scheduling of software
tasks/features. In the context of Advanced Driver-Assistance (ADAS)
software, while existing works partially explore this promise [2], they
do not address how ADAS functionalities can be mapped dynamically
on modern multicore platforms which are also heterogeneous in
nature, i.e. different cores on the SoC adhere to different computing
paradigms like control-flow intensive processing of CPUs, SIMD
style high throughput processing of GPUs, re-configurable blocks like
FPGAs etc. Accommodating dynamic task mapping requests can be
difficult in heterogeneous architectures due to the following issues, -
i) the suitability of a specific task on a core type needs to be learned
using profiling runs, ii) a mapping request needs to be satisfied while
keeping in mind the architectural demands of other existing tasks.
An ADAS system constitutes multiple object detection pipelines
that process sensor data periodically leveraging state of the art
Deep Neural Networks (DNNs) and Convolutional Neural Networks
(CNNs). The pipelines are used to detect objects in the vicinity
and accordingly dispatch commands to other vehicular subsystems
such as park assist, anti-lock braking systems etc. for taking relevant
actions. Therefore, there exists a natural requirement for real time
guarantees for executing these object detection pipelines. Recent
works [2], [3] emphasize on designing efficient scheduling algorithms
at the system level in addition to algorithmic optimizations on neural
network workloads [4], [5] for meeting these real time requirements.
Additionally, ADAS detection pipelines impose different frames per
second (FPS) requirements for different detection tasks depending on
the current environmental context. Even for the same detection task,
given different driving scenarios, these FPS requirements are subject
to change to meet a desired level of object detection accuracy. For
example, a pedestrian detection system would like to process images
at a higher frame rate if the on-board GPS points to the fact that
the vehicle is approaching a congested area. In situations like this,
for accommodating increased FPS requests, the underlying scheduler
must allocate resources for increased object detection accuracy while
maintaining real time guarantees. The state of the art ADAS schedul-
ing algorithms for heterogeneous CPU-GPU platforms optimize the
execution of ADAS pipelines on the GPU by i) decreasing the overall
latency of detection jobs via software pipelining approaches so as to
increase the object detection accuracy [2], ii) leveraging algorithmic
fusion techniques for processing multiple frames concurrently [2]
or iii) opting for data fusion approaches followed by concurrent
execution of multiple DNN workloads in the GPU hardware [3].
We note that the existing approaches are not equipped to handle
tasks with time varying dynamic FPS requirements dictated by
different driving contexts. Additionally, leveraging a single line of
approach for scheduling these pipelines may not always yield the
best possible scheduling solution. For example, fusing and processing
frames concurrently on hardware impose a high memory footprint
for selected pipelines and may not be a feasible approach always.
Given the set of ADAS detection workloads, designing automotive
computing solutions with the ability to sustain the maximum FPS
requirements of all detection pipelines simultaneously may lead to
over-provisioning of resources on a restricted memory architecture,
and also increased power consumption and thermal aging of the
heterogeneous platform. A generalized approach exploring oppor-
tunistic GPGPU optimizations must be envisioned that combines the
techniques mentioned above for ascertaining optimal application to
architecture scheduling decisions at runtime while keeping in mind
the overall power budget of the target platform. Since investigating
every possible mapping decision imposes a considerably large search
space of decisions to be evaluated for finding an optimal solution,
an intelligent task manager needs to be designed which will predict
task-device mapping decisions for each pipeline subject to dynamic
FPS requirements over time.
Request
INFERENCE 
ENGINECPU GPU
ORACLE
ADAS ECU
Scheduling Decisions
CLOUDEDGE
Environment Frames
Fig. 1. Runtime System Overview
The present work proposes an intelligent runtime system which can
manage the mapping and scheduling of ADAS detection pipelines
in next-generation automotive embedded platforms in a self-learning
fashion so that the time varying dynamic detection requirements of
existing pipelines are efficiently managed while maintaining real time
guarantees. An overview of the proposed software architecture for our
runtime system is depicted in Fig. 1. We assume there exists an oracle
executing as a service on the vehicle software stack which keeps
track of environmental parameters such as the terrain the vehicle is
currently driving in, observes the output of existing ADAS detection
ar
X
iv
:2
00
4.
05
77
7v
1 
 [c
s.D
C]
  1
3 A
pr
 20
20
pipelines, interacts with on-board sensors (e.g. GPS) and generates
requests that characterize FPS requirements for detection job(s).
Our proposed intelligent runtime system comprises an inference
engine which exist as a service running on a cloud server. The engine
leverages a learned model trained using Reinforcement Learning (RL)
techniques. The model is specific for the current set of detection
pipelines executing on the ADAS ECU and is periodically retrained
in the cloud whenever an over-the-air software update occurs such
as i) the injection of a new detection pipeline in the current set of
jobs and ii) the refinement of parameters for an existing pipeline.
The inference engine uses the learned model to determine a set of
task-device mapping decisions and accordingly informs the oracle
whether the requests can be accommodated or not.
The admissible decisions are communicated to a low level sched-
uler running on the ADAS ECU. The decisions reported by the
inference engine are based on ground-truth learned models which
assume that latency of a task for a predicted mapping decision
remains constant for all invocations on a given device. However, this
is not the case for a shared compute platform due to interference
from other tasks. For handling such mispredicted scenarios, the low-
level scheduler employs state-of-the-art control-theoretic scheduling
approaches that apply relevant core level DVFS to ensure predictable
task latency. The salient features of the proposed work are summa-
rized as follows.
• We characterize a Reinforcement Learning (RL) based problem
formulation in the context of real time scheduling on ADAS
platforms and present a training methodology for the same.
• We present an inference engine which is a discrete event
scheduling simulator that determines task-device mapping de-
cisions subject to dynamic oracle requests.
• We create an intelligent runtime scheduler which leverages
control-theoretic schemes to mitigate potential deadline misses
due to bad quality mapping decisions. We provide extensive
validation results justifying the usefulness of our RL assisted
ADAS deployment architecture.
II. PROBLEM FORMULATION
Let J = {G1, G2, · · · , GN} be the set of N ADAS jobs to be
scheduled. We model an ADAS job as a directed acyclic graph (DAG)
Gk = 〈Tk, Ek〉 where Tk = {tk1 , tk2 , · · · , tkn} denotes the set of
tasks, Ek ⊆ Tk × Tk denotes the set of edges where each edge
(tki , t
k
j ) denotes that task t
k
j cannot start execution until and unless
tki has finished for a DAG Gk. In the context of ADAS detection
pipelines, each task refers to a data parallel computational kernel
[6]. Given J , we denote an oracle request to be of the form R =
{〈G1, w1, p1〉, 〈G2, w2, p2〉, · · · , 〈GN , wN , pN 〉} where each tuple
here specifies that every job Gk ∈ J arrives periodically with period
pi and processes wi frames in each execution instance. The set of
requests R is made by the oracle based on observations of the current
driving scenario. As long as the scenario does not change, the request
is maintained at its current state. Note that existing ideas of processing
multiple frames concurrently at a given rate as well as fusing the
decision over multiple frames by increasing the execution frequency,
can both be captured by our task models. Given the specification for
each DAG Gk in R, let us denote Gk = {G1k, · · · , Ghk} as the set
of all execution instances of DAG Gk where h = H/pk and H is
the hyper-period which is the l.c.m of the periods (inverse of rate)
of DAGs. We denote the jth execution instance of a DAG Gk as
Gjk and denote the i
th task belonging to it as tj,ki . We define the set
G = G1∪G2∪· · ·∪GN to be the set of all execution instances of all
DAGs in the job set J executing in the hyper-period H . Given G,
we denote a hyper-period snapshot H to be the set of tasks of each
DAG execution instance Gjk ∈ G waiting for a dispatch decision. We
say that every DAG Gjk ∈ G has finished execution iff H becomes
empty. We denote F(H) as the head of H comprising tasks that are
ready to execute and define it as follows.
Definition II.1. Given a hyper-period snapshotH, the frontier F(H)
is typically a set of independent tasks belonging to a subset of DAGs
G′ ⊆ G such that the following precedence constraints hold: i) each
DAG Gjk = 〈T jk , Ejk〉 ∈ G′ must finish execution before Gj+1k and
ii) for each DAG Gjk ∈ G′, predecessors of each task tj,ki ∈ T jk
belonging to F(H) i.e tasks tj,kl such that (tj,kl , tj,ki ) ∈ Ejk have
finished execution.
The frontier F is ordered by the following ranking measures.
Definition II.2. The rank measure blevel of a task t in DAG
G=〈T,E〉 represents the best case execution time estimate to finish
tasks in the longest path starting from t to a task that has no
successors in G assuming all resources are available and is computed
as blevel(t)=et+maxt′∈succ(t) blevel(t′), where et is the worst case
execution time (WCET) of task t and succ(t) = {t′|(t, t′) ∈ E}.
Definition II.3. The rank measure local deadline of a task t in
DAG instance G=〈T,E〉 represents the absolute deadline of t and is
computed as local deadline = d − blevel(t) + et, where d is the
absolute deadline of G, blevel is the aforementioned rank measure
and et is the WCET of t.
Task execution flow: Let us consider for some hyper-period snapshot
H, a frontier of tasks F(H) sorted by the local deadline ranking
measure. Given this sorted list of tasks in F(H), the set of avail-
able devices in a target heterogeneous multicore P , and the task
tmin ∈ F(H) with the minimum local deadline as inputs, let a
mapping function M return a task-device mapping m = 〈T, P 〉
where T comprises a set of tasks comprising the task tmin and
its descendants. The quantity P represents one device in the target
heterogeneous platform P . The choice of T is motivated by the fact
that in certain runtime contexts it may be beneficial that multiple
tasks/kernels are fused and mapped to P for achieving better register
and cache usage and avoiding the launch overhead of individual tasks.
Related works [7], [8] have been proposed over the years which
investigate the efficacy of kernel fusion on heterogeneous CPU/GPU
architectures by considering different runtime contexts. The primary
objective of this work lies in learning these contexts in the form of
a policy function pi which may be used to design M that modifies
the hyper-period states. The reason for leveraging a learning based
approach can be attributed to the large space of scheduling decisions
that are possible using kernel fusion. Considering a DAG G of depth
D, we assume only vertical fusion i.e. all tasks selected upto a
particular depth in the mapping decision m = 〈T, P 〉 are fused and
mapped to P . The total number of possible fusion based mapping
configurations for the entire DAG G is therefore equal to the number
of integer compositions of D which is 2D [9]. Furthermore, since
each fusion based mapping configuration has the option of getting
mapped to a CPU or GPU device, the total number of possible
scheduling decisions for a single DAG is actually ω(2D). Considering
m jobs in J , the total space of scheduling decisions is ω(2mD). We
next elaborate with an illustrative example how M is used to obtain
mapping decisions in this exponential search space.
Given tmin ∈ F(H), the function M is applied on H to yield
a new hyper-period snapshot H′ and a new frontier F(H′). This
process of applying M and generating subsequent hyper-period
snapshots is continued until each DAG in G has been scheduled.
t11,1
t21,1
t31,1 t41,1
t11,2
t21,2
t31,2 t41,2
t11,3
t21,3
t31,3 t41,3
t12,1
t22,1
t32,1
t42,1
t11,1
t21,1
t12,1
t22,1
t31,1 t41,1
t32,1
t42,1 t11,2
t21,2
t31,2 t41,2
t12,2
t22,2
CPU GPUJ = {G1 ,G2}
G11
G12
G13
G = {G11 ,G12 ,G13 ,G21,,G22 }
G21
G22
t12,2
t22,2
t32,2
t42,2
t11,1
t21,1
t31,1 t41,1
t11,2
t21,2
t31,2 t41,2
t11,3
t21,3
t31,3 t41,3
t12,1
t22,1
t32,1
t42,1
G11
G12
G13
G21
G22
t12,2
t22,2
t32,2
t42,2
t11,1
t21,1
t31,1 t41,1
t11,2
t21,2
t31,2 t41,2
t11,3
t21,3
t31,3 t41,3
t12,1
t22,1
t32,1
t42,1
G11
G12
G13
G21
G22
t12,2
t22,2
t32,2
t42,2
F F
H(0) H(1) H(k)
t11,1
t21,1
t12,1
t22,1
t32,1
t42,1 t41,1
t31,1
ti
m
e
M M M
Fig. 2. RL Assisted Task Mapping
We present a representative example depicting a complete schedule
of task-device mappings for a job set J = {G1, G2} in Fig. 2.
Given a hyper-period H , there exists three instances of job G1
(deadlines are shown using dashed red lines) and two instances
of job G2 (deadlines are shown using dotted blue lines). The set
G is therefore {G11, G21, G31, G12, G22}. Initially the frontier contains
tasks G11 and G12 which have no predecessors i.e. F = {t1,11 , t2,11 }.
Tasks are selected from F based on the local deadline rank measure.
The mapping function M generates a task-device mapping decision
m = 〈{t1,11 , t1,12 }, CPU〉. This is shown in the Gantt chart in the
right hand side of Fig. 2. This in turn creates the hyper-period
snapshot H(1) as depicted in the figure. The corresponding list of
mapping decisions is depicted in the Gantt chart schedule in Fig. 2.
We summarize our formulation of learning based dispatch as follows.
Given a set G constructed from a job set J of ADAS detection
pipelines for a given oracle request R to be executed on the
heterogeneous platform P , the objective of the proposed scheduling
scheme is to leverage a policy function pi trained using RL which
dictates the choice of task-device mapping functionM.
The set of task-mapping decisions reported by pi are platform
agnostic and does not consider the latency variation that might occur
due to platform level interference factors such as thermal throttling,
memory thrashing, shared memory contention by other tasks etc.
This variation may potentially lead to scenarios where the actual
deadline miss rate exceeds the predicted deadline miss rate. For this
purpose, in the deployment phase, the scheduling scheme monitors
the execution times of each task-device mapping decision and enters
into a safe control-theoretic mode of dispatch whenever it observes
a potential deadline violation. To understand this we define the
following measure.
Definition II.4. The slack measure for the zth task mapping decision
m = 〈T, P 〉 pertaining to some DAG instance Gjk is defined as
sl(z)(Gjk) = D
(z)(T )−wrt(T ) where D(z)(T ) denotes the current
time remaining to meet the deadline for the DAG instance Gjk and
wrt(T ) =
∑
t∈T∪SUCC(T )(et) denotes the worst case remaining
time for executing tasks belonging to T and the set SUCC(T ) which
comprises all descendants of tasks in T in Gjk.
In order to ensure that deadline requirements are met at runtime,
the inequality sl(z)(Gjk) ≥ δsl must be respected, where we use δsl
t1
1,1
t2
1,1
t3
1,1 t4
1,1
t1
1,2
t2
1,2
t3
1,2 t4
1,2
t1
1,3
t2
1,3
t3
1,3 t4
1,3
t1
2,1
t2
2,1
t3
2,1
t4
2,1
G1
1
G1
2
G1
3
G2
1
G2
2
t1
2,2
t2
2,2
t3
2,2
t4
2,2
F
t3
2,1
t4
2,1
CPU GPU
t1
1,1
t2
1,1
t1
2,1
t2
2,1
t3
1,1
t4
1,1
t1
1,2
t2
1,2
t3
1,2 t4
1,
2
t1
2,2
t2
2,2
t1
1,1
t2
1,1
t1
2,1
t2
2,1
t3
2,1
t4
2,1
t4
1,1
t3
1,1
ti
m
e
D(1)(T1)
wrt(T2)
sl(2)(G1
1) sl(3)(G1
1)
D(3)(T3)
wrt(T3)
D(1)(T'1)
wrt(T'1)
sl(1)(G2
2) sl(2)(G2
2)
D(2)(T'2)
wrt(T'2)
s(3)(G1
1)>δsl
D(z)(Ti) wrt(Ti) sl
(z)(Gj
k)
sl(1)(G1
1)
wrt(T1)
D(2)(T2)
sl(3)(G2
2)
D(3)(T'3)
wrt(T'3)
sl(1)(G2
2)>δsl
T2={t3
1,1} T3={t4
1,1}T1={t1
1,1 
,t2
1,1}
T'2={t1
2,1 
,t2
2,1}
T'3={t3
2,1 
,t4
2,1}
T'1={t1
2,2 
,t2
2,2}
sl(2)(G2
2)>δsl
Fig. 3. Safe Low Level Scheduling
to model the overall uncertainty associated with the WCET estimates
of the tasks constituting the DAG. The scheduling scheme enters into
the safe mode if it observes that sl(z)(Gjk) < δsl and applies core-
level DVFS iteratively to each successive mapping decision for Gjk
until it is ensured that sl(z
′)(Gjk) ≥ δsl where z′ > z.
We elaborate the safe low-level dispatch mechanism with the
help of Fig. 3. The sequence of mapping decisions for G11 is
the set {〈T1, CPU〉, 〈T2, CPU〉, 〈T3, GPU〉}. We observe for T2,
that sl(2)(G11) ≥ δsl and thus the safe mode of dispatch for
the scheduler is not engaged. However for T3, we observe that
sl(3)(G11) < δsl. Even though T1 had finished execution, T3 could
not start immediately because the GPU device was engaged by
tasks in DAG G12. This in turn affected the deadline requirement
D(3)(T3) for T3, initiating the safe mode of dispatch, which increased
the frequency of the GPU device to ensure that G11 respected its
deadline. Similarly, considering the set of mapping decisions for G22
i.e. {〈T ′1, CPU〉, 〈T ′2, GPU〉, 〈T ′3, CPU〉}, it may be observed in
Fig 3, that the scheduler engages into safe mode for T ′1 and T ′2 and
applies core-level DVFS for the CPU and GPU devices respectively.
The delay in starting T ′1 due to resource contention of both CPU and
GPU devices by tasks in G21 violates the slack constraints and forces
the safe mode to be initiated.
The proposed system-level solution for real time ADAS scheduling
in this communication therefore operates in two distinct phases -
i) an AI enabled approach which searches through the exponential
space of scheduling decisions and intelligently selects a global set
of task-mapping decisions for each ADAS pipeline and ii) a control-
theoretic scheme which performs locally for a particular pipeline. The
first phase extracts from the exponential search space, the possibly
best global scheduling decisions. The second phase is initiated for
these decisions only if the runtime slack constraints defined above
are violated. The combined approach therefore ensures opportunistic
switching to high frequency mode thus reducing thermal induced
degradation of lifetime reliability for the overall platform; something
which can happen with pure frequency scaling based task scheduling
techniques.
III. METHODOLOGY
In the recent past, several works [10]–[12] have emerged which
leverage deep reinforcement learning methods to learn an optimal
policy function for solving scheduling problems. We leverage sample
efficient RL approaches such as Q-learning with experience replay
for training DQNs as well as Double DQNs (DDQNs) for learning
state-action value functions Q(s, a) that characterize the goodness
of choosing an action a given a state s. The overall methodology
depicting the training process for ascertaining Q(s, a), followed by
subsequent usage of the trained model in the deployment phase is
next elaborated with the help of Fig. 4.
Training Phase: The input to the training phase is a set of oracle
requests pertaining to the set of ADAS jobs J and the set of target
platform devices P . The output is a state-action value function Q.
For each such oracle request, the training process involves a series of
steps for updating the network weights of Q(s, a). This is explained
as follows.
(i) State Extraction: Given an oracle request R, the hyper-
period snapshot H is first constructed. The observed state vec-
tor s for H at any point of time is defined as a vector
[dt1, dt2, · · · , dt|P|, rt1, rt2, · · · , rt|J |, dr1, dr2, · · · , dr|J |] where
dti represents the time left for the ith device ∈ P to become free.
The quantity rti denotes the ‘best case’ time estimate remaining for
the currently executing instance of DAG in Gi = 〈Ti, Ei〉 ∈ J to
finish. This is calculated by the blevel estimate of the current task
t ∈ Ti ∈ F . The quantity dri represents the ‘time to deadline’
estimate for the same instance of the ith DAG in J to finish
so that the deadline of Gi is respected. This is obtained by the
difference between the time elapsed since the beginning of the hyper-
period and the absolute deadline of the DAG instance Gi in that
hyper-period. The three quantities defined together capture resource
availability (dti), estimate of the ‘best case’ time for a DAG to finish
completely (rti), and estimate of the time in which the DAG must
finish for respecting deadline constraints (determined by dri). In Fig.
4, considering a system with 1 CPU and 1 GPU device and a total
of 4 periodic DAGs, a sample state vector is depicted.
(ii) Action Selection: Given the observed state s as input, the action
a is determined using an −greedy policy function. Given the action
a and the task with the earliest local deadline tmin ∈ F(H), we
define the mapping function M(tmin, a) = 〈T, P 〉 where T is the
set of tasks comprising tmin and its descendants up to a depth d and a
device P ∈ P . In our setting, the value of the action a can be used to
infer both d and P as follows. An action a is represented as an integer
a ∈ [0, · · · , n(D + 1) − 1] where D denotes the maximum height
of a DAG Gi ∈ J and |P| = n is the total number of devices/cores
in the target platform. An action value a = i(D + 1) + d represents
fusing a task with descendants up to depth d and mapping to the
i-th device so that given a, we have fusion depth d = a%(D + 1)
and device id i = a/(D+ 1) considering [0, D] as domain of depth
values and [0, n− 1] as domain of device ids.
(iii) Reward Assignment: Once the mapping function M(tmin, a)
is applied, the current hyper-period snapshot H is updated and the
subsequent next state s′ is observed. An incomplete transition tuple
of the form 〈s, a, s′ 〉 is pushed to a replay buffer memory B which
is used during training updates. We note that a transition tuple is
pushed to B for each such action mapping decision for some task tj,ki
belonging to DAG instance Gjk. The final empty element of the tuple
represents the future reward r for this transition which is updated
once the DAG instance Gjk finishes execution. A reward of value of
−1 is assigned if it is observed that the finishing timestamp of Gjk
exceeds its absolute deadline and a reward of +1 for the alternate
case. The same reward value r is used to update every incomplete
transition tuple in B pertaining to each task action mapping decision
taken during the lifetime of Gjk.
(iv) Training Updates: Training updates are done by sampling
the replay memory B randomly, constructing a set B ⊆ B of
complete transition tuples and calculating the average loss function
L = 1||B||
{∑
(s,a,r,s′)∈B L(δ(s, a, r, s
′))
}
where L is the Hu-
ber Loss function [13] of the error term δ(s, a, r, s′). The quan-
tity δ(s, a, r, s′) represents the temporal difference (TD) error for
a transition tuple (s, a, r, s′). We consider two standard training
paradigms - i) learning a simple Deep Q Network (DQN) where
TD error δ(s, a, r, s′) = Q(s, a) − (r + γmaxaQ(s′, a)) and ii)
learning a Double DQN (DDQN) [14] approach where δ(s, a, r, s′) =
Q(s, a)−(r+γQ(s′, argmaxaQ(s′, a))). The quantity γ represents
the discount factor and is set to one given the episodic setting
of our problem. The network is updated by mini-batch gradient
descent using the loss calculated multiple times for each training
run pertaining to a given oracle request.
The overall steps discussed so far for updating the network Q
is repeated for each oracle request Ri, for a total of num runs
number of times which is an experimental parameter. The entire
process of invoking training updates for each run of each oracle
request is iteratively repeated until the average reward observed for
oracle requests converge. The trained network Q is used to obtain
the corresponding optimal policy pi∗(s) = argmaxaQ(s, a) which
is leveraged in the deployment phase.
Deployment Phase: Given an oracle request Ri and the resulting set
of DAG instances G, the inference engine considers the hyper-period
snapshot H corresponding to Ri and iteratively does the following
steps - i) observes state s, ii) uses pi∗(s) to select action a, iii)
applies mapping functionM using selected action a onH. The entire
inference process invokes multiple inference passes over the learned
network until H becomes empty, finally yielding a set of task-device
mapping decisions Π(G). Using these decisions and the available
WCET estimates of tasks, the engine simulates the schedule specified
in Π(G), assesses the percentage of deadline misses and accordingly
suggests admissible scheduling decisions to the low-level scheduler.
1: sp(z − 1)← 1
2: for each mapping mz ∈ Π(Gjk) do
3: 〈Tz, Pz〉 ← mz
4: D(z)(Tz)← observe current time to deadline
5: sl(z)(Gjk)← D(z)(Tz)− wrt(Tz)
6: if sl(z)(Gjk) < δsl then
7: sp(z) = sp(z − 1) + ρ ∗ err(z)/b(z)
8: freq = lookup(sp(z), Pz)
9: set Pz frequency to freq
Algorithm 1: Control Theoretic Scheduling Scheme
The low level scheduler maps tasks following decisions specified
in Π(G) and uses the local control theoretic scheduling scheme
outlined in Algorithm 1 for each sequence of mapping decisions
Π(Gjk) pertaining to each DAG instance G
j
k. The scheme is inspired
from the state-of-the-art pole-based self-tuning control techniques
[15] that dynamically model the speedup of T as a function of core
clock frequencies of the device P . The algorithm iterates over each
mapping decision mz = 〈Tz, Pz〉 in Π(Gjk) (lines 2-9), checks the
slack constraint sl(z)(Gjk) ≥ δsl and increases the core frequency
of Pz for the duration of Tz if required. This is done using the
speedup equation sp(z) = sp(z−1)+ρ∗err(z)/b(z) (line 7) where
sp(z) denotes the speedup requirement for Tz such that the error
term err(z) = sl(z)(Gjk) − δsl is rendered positive. The quantity
b(z) =
∑
t∈T et represents the WCET estimate for executing T
at the baseline frequency and ρ represents the pole value of the
controller. Given the required speedup, the scheme uses lookup tables
computed offline during the profiling phase which map speedup
values of Tz to core frequency values of the device Pz to obtain
the required operating frequency freq. The core-level frequency of
t0
t1 t2
t3 t4 t5
t6 t7
GPU
F
Mapping Function M (t1,a)
dt1 rt2 rt3 rt4dt2 rt1 dr2 dr3 dr4dr1
CPU GPU
Ready Times DAG Remaining Times DAG Deadline Requirements
G1
1 /d1 G2
1 /d2 G3
1/d3
G1 G2 G3 G4 G1 G2 G3 G4CPU GPU
F
STATE 
DAG/deadline :
ORACLE
REQUEST
H
DAG Gk
j
finished
?
Update Hyper-period snapshot H
Update transitions for Gk
j
with obtained reward
REPLAY BUFFER
s2a2 s3  ___G2
1
s1 a1 s2  ___G1
1 s3a3 s4  ___ s4a4 s5  ___
Update 
Network 
Weights
ε-greedy 
policy
Observe next  state 
and push transition 
<s,a,s',__>
s a
GENERATE 
ORACLE
REQUESTS
for J
OS
RAM
INFERENCE
ENGINE
TRAINING PHASE DEPLOYMENT PHASE
G1 Gn
LOW LEVEL SCHEDULER
Mapping 
Decisions
Lookup 
Table
Controller
Safe Dispatch
Policy Dispatch
Frequency 
TuningCPU GPU
Fig. 4. Training and Inference Methodology Overview
Pz is increased to freq and task Tz is executed. This process is
repeated only for those task-device mapping instances that violate
the slack constraints.
IV. EXPERIMENTAL RESULTS
We consider the Odroid XU4 embedded heterogeneous platform
comprising two quad-core ARM CPUs (Big and Little), and one Mali
GPU. We map 1) the host OS (Ubuntu 18.04 LTS) on two cores of
Little CPU , 2) our low level scheduler as an independent OpenCL
process in the other two cores of Little CPU, 3) the ADAS detection
pipelines in the Big CPU and the GPU. We leverage OpenCL [6], a
popular heterogeneous computing language for implementing these
object detection pipelines. For our experimental evaluation, we have
implemented a total of four representative object detection pipelines
from scratch where two pipelines (G1 and G2) represent vanilla DNN
benchmark implementations, each comprising 5 tasks and the remain-
ing two (G3 and G4) represent CNN benchmark implementations,
each comprising 6 tasks. We have built these pipelines using platform
optimized implementations of elementary data parallel kernels (such
as convolution, general matrix multiplication, pooling, softmax etc. )
available in the ARM OpenCL SDK [16]. Our experiments require
profiling data for each task as well as each fused task variant on the
target platform for setting up the environment in our training phase.
We have developed a code template generator which automatically
synthesizes OpenCL code for all possible fused task variants that
are possible for each pipeline (5 × (5 − 1)/2 = 10 for the DNN
benchmarks and 15 for the CNN benchmarks). This is useful for
on-the-fly fused variant generation of future pipelines which may be
downloaded on the platform.
Environment Setup The WCET estimates of each task and each
fused task variant in each of the pipelines are obtained by lever-
aging a co-run degradation based profiling approach outlined in
[17]. While profiling each benchmark on a particular device (Big
CPU or Mali GPU), we execute a micro-kernel benchmark contin-
uously on the other device in parallel to ensure maximum shared
memory interference on the target platform. The WCET estimates
τCPU (ti) and τGPU (ti) represent the time taken (averaged over
10 profiling runs) to execute the task ti on the CPU or GPU
device respectively in the worst possible scenario when the system
memory bandwidth is completely exploited. Using these WCET
estimates, the overall WCET of the DAG Gi = 〈Ti, Ei〉 is given
by τ(Gi) =
∑
t∈Ti max(τCPU (ti), τGPU (ti)).
The oracle request R for our experiments takes the form
{〈G1, w1, p1〉, 〈G2, w2, p2〉, 〈G3, w3, p3〉, 〈G4, w4, p4〉}, with wi =
1 for all DAGs. For generating oracle requests, we vary pi for each
Gi with values from the set {τ(Gi), 2 ∗ τ(Gi), 3 ∗ τ(Gi)}. Since
each DAG processes one frame at a time and can arrive using one of
the three period values, the total number of oracle requests possible
for the job set comprising 4 DAGs is 34 = 81. We train our DQN
and DDQN using these 81 oracle requests and discuss our findings
below.
DQN
DDQN
Fig. 5. Training Results
Training Results We set the number of training runs per epoch i.e.
num runs to be 100. The training algorithm processes the same
oracle request R, i.e. it explores schedules for the same resultant
set of DAGs G for a total of 100 episodes before processing the
next oracle request R′. Additionally, when the training algorithm
moves from processing one request R to the next request R′, it is
ensured that only one period value in the request R is changed to
yield R′. This is done so that during the training phase, after learning
the Q-Network for a given oracle request R, the RL environment is
not drastically changed during processing of request R′. The neural
network architectures used for both the DQN and DDQN contains 1
input layer of size 10, 1 hidden layer of size 16 with ReLU activation
and one output layer of size 12 equipped with a softmax function
for predicting action probabilities. The corresponding training results
are summarized in Fig. 5. In both the sub-figures, the x-axis is
labelled with the training epoch number where each epoch consists
of a total of 81 ∗ 100 = 8100 episodes. Each point of the blue
line plot represents the mean reward for each episode. The yellow
line plot presents a general trend for the reward where each point
represents the mean reward averaged over a consecutive set of 500
episodes. From the two sub-figures, we may conclude that DDQN
presents stable training behaviour compared to the DQN where the
rewards are oscillating between positive and negative values for the
entire duration of training. This may be attributed to the fact that
DQN training performance suffers from over-estimating error values
during the training process thereby learning sub-optimal policies in
the process [14]. We next compare the schedules generated by our
DDQN with a baseline policy explained as follows.
|R3.42| = 9
|R2.85| = 18
|R1.14| = 9 |R1.71| = 18
|R2.28| = 27
|R2.85| = 18
|R3.42| = 9
%
(a
)
(b)
Baseline
RL Agent
Baseline
RL Agent
Fig. 6. Baseline vs. RL, |Rx| : # oracle requests with TI(R) = x
Testing Results: We leverage the classical Global EDF scheduling
algorithm outlined in [18] which is a dynamic EDF scheduling
algorithm for executing DAGs on multicore processors as our baseline
policy. The algorithm at any point of time considers a task ti ∈ F
with the minimum local deadline and simply dispatches it to the de-
vice on which the WCET of the task t i.e. τ(ti) is minimum. A com-
parative evaluation between the schedules observed using the baseline
algorithm and the schedules reported by our RL scheme is elaborated
using Fig. 6. The blue line plot in Fig. 6(a), represents percentage of
deadline misses observed in a hyper-period for schedules determined
by the baseline algorithm and the orange line plot represents the
same dictated by the DDQN. Each point on the y-axis represents the
deadline miss percentage and each point on the x-axis represents an
oracle requestR characterized by a throughput index (TI) value. The
throughput index represents the throughput requirement for an oracle
request R in terms of Floating Point Operations per second (FLOPs)
and is calculated as TI(R) = ∑i FLOPs(Gi) × wi/pi where
FLOPs(Gi) represents the total number of floating point operations
required for a detection pipeline, wi and pi are as discussed earlier.
It may be observed that for oracle requests demanding higher TI
values, the percentage of deadline misses increases upto 80% for
the baseline algorithm whereas it remains below 20% for the RL
scheme. Fig. 6(b) represents a horizontal bar chart where each bar
gives the average lateness observed for an oracle request (green bars
for RL schemes and red bars for baseline). For an oracle request,
the average lateness of the resultant set of DAGs G is computed by
averaging over the individual lateness values of each DAG. It may be
observed that for oracle requests with the higher throughput index, the
baseline reports average positive lateness values for most schedules
whereas the RL scheme consistently reports negative average lateness
values, implying that on the average DAG instances finish before their
deadlines using the RL scheme.
Target Platform Results: We consider that the inference engine
admits an oracle request if the deadline miss percentage is less than a
threshold th = 15% for the computed schedule. For establishing the
efficacy of our low level scheduler, we select borderline admissible
schedules (with deadline miss % d near to but less than th) and
present our findings in Table I. Each row represents results for some
request R, with the resulting DAG set G (characterized by 〈|G|, d〉)
as reported by the inference engine. In each case, Column 2 provides
the number of misses reported by the inference engine, Column
3/4 reports the actual deadline misses when the inferred schedule
is deployed without/with the safe dispatch mode of the low level
scheduler being engaged. It may be observed that for schedules
suffering from high miss% under actual deployment without safe
mode, the low level scheduler improves their performance with safe
mode engaged, thus correcting the deployment badness of the original
inference.
TABLE I
DEPLOYED LOW LEVEL SCHEDULING RESULTS
〈|G|, d%〉 Inference Engine#misses
Deployed System
#misses
Safe mode
#misses
〈39, 10%〉 4 4 4
〈29, 10%〉 3 3 1
〈35, 14%〉 5 10 ⇒ 28% > th 4 ⇒ 11%< th
V. CONCLUSION
Our proposed combination of RL and low-level performance recov-
ery technique is possibly the first approach that synergizes AI tech-
niques with real time control-theoretic scheduling techniques towards
generating kernel-fusion based runtime mapping decisions for real-
time heterogeneous platforms accelerating ADAS workloads. Future
work entails incorporating an edge to cloud feedback mechanism
so that on-board mapping decision observations can be leveraged to
refine the cloud based inference model using periodic updates.
REFERENCES
[1] S. Fu¨rst and et al., “Autosar–a worldwide standard is on the road,” in
VDI Congress Electronic Systems for Vehicles 2009.
[2] M. Yang and et al., “Re-thinking CNN Frameworks for Time-Sensitive
Autonomous-Driving Applications: Addressing an Industrial Challenge,”
in RTAS 2019.
[3] H. Zhou and et al., “Sˆ 3DNN: Supervised Streaming and Scheduling
for GPU-Accelerated Real-Time DNN Workloads,” in RTAS 2018.
[4] S. Ren and et al., “Faster R-CNN: Towards Real-time Object Detection
with Region Proposal Networks,” in NIPS 2015.
[5] J. Redmon and et al., “You only look once: Unified, Real-time Object
Detection,” in CVPR 2016.
[6] J. Stone and et al., “Opencl: A parallel programming standard for
heterogeneous computing systems,” CiSE, vol. 12, no. 3, p. 66, 2010.
[7] B. Qiao and et al., “Automatic Kernel Fusion for Image Processing
DSLs,” in SCOPES 2018.
[8] Y. Xing and et al., “DNNVM: End-to-end Compiler Leveraging Het-
erogeneous Optimizations on FPGA-based CNN Accelerators,” TCAD,
2019.
[9] J. Opdyke, “A unified approach to algorithms generating unrestricted and
restricted integer compositions and integer partitions,” JMMA, vol. 9, pp.
53–97, 2010.
[10] H. Mao and et al., “Resource Management with Deep Reinforcement
Learning,” in HotNets 2016.
[11] Z. Fang and et al., “Qos-aware Scheduling of Heterogeneous Servers for
Inference in Deep Neural Networks,” in CIKM 2017.
[12] G. Domeniconi and et al., “CuSH: Cognitive ScHeduler for Heteroge-
neous High Performance Computing System,” in DRL4KDD 2019.
[13] P. Huber, Robust Statistics. Springer, 2011.
[14] H. Van Hasselt and et al., “Deep reinforcement learning with double
q-learning,” in AAAI 2016.
[15] N. Mishra and et al., “Caloree: Learning control for predictable latency
and low energy,” in SIGPLAN Notices 2018.
[16] ARM, “Mali OpenCL Compute SDK,” https://developer.arm.com/ip-
products/processors/machine-learning/compute-library, 2013.
[17] Q. Zhu and et al., “Co-run Scheduling with Power Cap on Integrated
CPU-GPU Systems,” in IPDPS 2017.
[18] M. Qamhieh and et al., “Global EDF Scheduling of Directed Acyclic
Graphs on Multiprocessor Systems,” in RTNS 2013.
