Response-time analysis of DAG tasks supporting heterogeneous computing by Serrano Gracia, María Astón & Quiñones, Eduardo
Response-Time Analysis of DAG Tasks Supporting
Heterogeneous Computing
Maria A. Serrano
Barcelona Supercomputing Center (BSC)








Hardware platforms are evolving towards parallel and heteroge-
neous architectures to overcome the increasing necessity of more
performance in the real-time domain. Parallel programming models
are fundamental to exploit the performance capabilities of these
architectures. This paper proposes a novel response time analysis
(RTA) for verifying the schedulability of DAG tasks supporting
heterogeneous computing. It analyzes the impact of executing part
of the DAG in the accelerator device. As a result, the response time
upper bound of the system is more precise than the one provided
by currently existing RTA targeting homogeneous architectures.
1 INTRODUCTION
Parallel and heterogeneous hardware architectures become main-
stream in the embedded domain to cope the increasing performance
requirements. These architectures integrate low power general-
purpose multi-cores (known as host) with dedicated accelerator
devices like DSP fabrics, GPUs or FPGAs. Some examples are the
NVIDIA Tegra X1[14], TI Keystone II[20] or Xilinx UltraScale[11].
Parallel programming models are fundamental to effectively ex-
ploit the huge performance capabilities of these architectures. As
an example, OpenMP [15] is increasingly being adopted in architec-
tures targeting embedded systems. As a matter of fact, all parallel
architectures presented above support OpenMP in their software
development kit. Moreover, OpenMP incorporates a host-centric
acceleration model to efficiently offload code and data to devices.
Functional and non-functional verification is fundamental when
designing real-time embedded systems. However, the use of parallel
programming models like OpenMP involves many challenges to
assure that software satisfies both functional and non-functional
requirements. This paper addresses the latter, focusing on the worst-
case response time analysis of OpenMP programs. It is worth men-
tioning however that recent works have addressed functional veri-
fication of OpenMP programs [16], demonstrating the benefits of
using OpenMP in real-time embedded systems.
Regarding timing verification, the sporadic DAG task model [6]
analyzes the response time of systems composed of parallel tasks
modeled with direct acyclic graphs (DAGs) [4, 9, 12]. Interestingly,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
DAC ’18, June 24–29, 2018, San Francisco, CA, USA
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5700-5/18/06. . . $15.00
https://doi.org/10.1145/3195970.3196104
recent works demonstrate that this model resembles the OpenMP
tasking model [13, 19, 21, 22]. However, none of the previous works
consider the impact that heterogeneous computing has on the non-
functional response time analysis verification.
This paper introduces a new response time analysis for the spo-
radic DAG tasks model supporting parallel and heterogeneous com-
puting. In heterogeneous computing, the workload offloaded into
the accelerator device does not cause any interference on the paral-
lel workload executed in the host, and vice versa. Therefore, our
analysis takes this into account and first identifies the portion of
the DAG that can potentially execute in parallel with the offloaded
workload. Then, DAG transformation techniques are used to safely
reduce the interference factor, given that the impact of offloading
workload is to reduce the interference in the host.
Our results reveal that, compared to existing RTA targeting ho-
mogeneous architectures, the response time is significantly reduced
when the offloaded computation represents more than 10% of the
overall DAG task workload. In fact, our response time is comparable
to the one obtained with an ILP solution (only applicable for small
DAGs composed up to 100 nodes). Interestingly, our DAG transfor-
mation technique allows to improve the average performance of
the task up to 23% when considering a host featuring 16 cores.
2 SYSTEM MODEL
Consider a parallel heterogeneous architecture composed of a host
processor with m identical cores and a single accelerator device
(e.g. a FPGA, GPU or DSP fabric). Moreover, consider a host-centric
acceleration model in which the host offloads code and data to the
accelerator device and collects results.
A parallel real-time task is represented by τ =< G,T ,D >. G =
(V ,E) is theDAGmodeling its parallel execution.V = {v1,v2, ...,vn ,
vOf f } is the set of nodes. Nodes vi ∈ V , 1 ≤ i ≤ n represent se-
quential jobs executed in the host and node vOf f represents the
workload executed in the accelerator device, named offloaded node
(there is only one). All nodes in V are characterized by its worst-
case execution time (WCET) Ci or COf f . E = V × V is the set of
edges representing precedence constraints among pairs of nodes. If
(v1,v2) ∈ E, then v1 must complete before v2 can begin execution
(transitive edges do not exists, i.e if (v1,v2) ∈ E and (v2,v3) ∈ E
then (v1,v3) < E). Nodes with no incoming arcs are sources of the
DAG, while nodes with no outgoing arcs are sinks. Without loss of
generality, we assume that each DAG has exactly one sourcev1 and
one sink vn node. If this is not the case, a dummy source/sink node
with zero WCET can be added to the DAG, with edges to/from all
the source/sink nodes. Finally, T is the minimum inter-arrival time
of τ and D is the constrained relative deadline (D ≤ T ).
(a) Heterogeneous DAG
(b) Best case scheduling
(c) Worse case scheduling
Figure 1: Heterogeneous DAG task scheduling example.
Two interesting DAG properties arevol (G ) and len(G ):vol (G ) =∑
vj ∈V Cj represents the volume of the DAG. In a parallel architec-
ture, the volume denotes the WCET of the task when executing
sequentially on a single core in the host and a single accelerator, as-
suming that core and accelerator cannot execute in parallel. len(G )
is the length of the critical path of the DAG, i.e. the longest path.
It corresponds to the minimum amount of time needed to execute
the task on a sufficiently large number of host cores.
Interestingly, our system model resembles the OpenMP paral-
lel programming model [15][21]. OpenMP implements a very ad-
vanced and coupled task and host-centric acceleration models. It
incorporates easy-to-use data clauses to express data directionality
when moving data back and forth to/from the device memories.
OpenCL and CUDA provide similar functionality at lower-level. A
compiler method to derive an OpenMP-DAG compliant with the
OpenMP semantics is proposed in [22]. The OpenMP accelerator
model can be easily incorporated into the DAG by distinguishing
those nodes executed in the host from those executed in the device.
3 HETEROGENEOUS MODEL
3.1 Starting Point: Homogeneousmodel
In [19], authors computed a response time upper bound of a DAG
task τ running onm homogeneous cores as:




vol (G ) − len (G )
)
(1)
where len(G ) is the length of the critical path of G and vol (G ),
its volume. The factor 1m
(
vol (G ) − len(G )
)
upper-bounds the self-
interference i.e, the interference contribution from the task itself
to its critical path. In order to verify the schedulability of τ , the
result provided by Equation 1 must be compared with τ ’s relative
deadline D, Rhom (τ ) ≤ D.
3.2 Towards an heterogeneous model
Clearly, heterogeneous computing reduces the actual interference
compared to homogeneous, as the offloaded node does not occupy re-
sources in the host. However, this interference reduction in the host
may not imply a reduction on the response time, as the precedence
constraints defined in E may defeat heterogeneous benefits.
In order to illustrate this phenomenon, consider the DAG task
τ shown in Figure 1(a) composed of six nodes v1, . . .v5, vOf f
(with WCET shown in parenthesis). The critical path is {v1,v3,v5}






= 5, resulting in Rhom (τ ) = 13. Since vOf f does not
(a) New DAG (b) Scheduling of the new DAG
Figure 2: Transformation of the DAG task in Figure 1.
execute in the host, one might subtract its contribution to the self-
interference factor, see Figure 1(b), resulting in Rhom (τ ) = 11.
However, the reduction in the self-interference factor does not
guarantee a trustworthy response time upper bound because vOf f
may not necessarily execute in parallel with the nodes running in
the host. See Figure 1(c) in which all cores in the host remain idle
while vOf f is running. In this case, the response time is 12, which
is higher than the reduced Rhom (τ ) computed above, 11.
Overall, the DAG portion that potentially executes in parallel
with the offloaded node (and so reducing the interference) is not
guaranteed to actually execute in parallel with it.
3.3 Safe self-interference reduction
In order to safely reduce the self-interference factor, it is first neces-
sary to guarantee that there is enough workload to be executed in
the host in parallel with vOf f . To do so, we propose an algorithm
that: (1) identifies the sub-DAG that may potentially execute in
parallel with vOf f , named GPar = (V Par ,EPar ), and (2) adds a
synchronization point to guarantee that GPar and vOf f actually
execute in parallel.
Figure 2(a) shows the proposed transformation of the DAG pre-
sented in Figure 1(a). Hence, by inserting a synchronization point
between nodesv4 andv2,v3, it is guaranteed thatvOf f and {v2,v3}
execute in parallel. Figure 2(b) shows the scheduling of the trans-
formed DAG. Synchronization forces v1 and v4 to be scheduled
first, avoiding the scheduling scenario shown in Figure 1(c).
Clearly, this strategy may impact on the average performance of
the tasks because: (1) the critical path can potentially enlarge (e.g.,
the length of the transformed DAG in Figure 2(a) is 10 instead of 8 in
the original DAG) and (2) the potential parallelism is reduced due to
the synchronization point (e.g., in Figure 2(a), v4 can not longer be
executed in parallel with v2 and v3). Interestingly, our experiments
demonstrate the opposite effect when the offloaded workload is
large enough (see Section 5.2). The reason is that, ensuring the
parallel executing of GPar and vOf f avoids scheduling scenarios
in which the offloaded node is running while the host processor
remains idle, as shown in Figure 1(c).
Overall, this strategy allows to derive a RTA for heterogeneous
architectures. It will be presented in Section 4 but first, we introduce
the algorithm to transform the DAG.
3.4 DAG Transformation Algorithm
Algorithm 1 generates a newDAG in which the parallel execution of
vOf f andGPar is guaranteed. To do so, given a DAGG = (V ,E), it
first generates the transformed DAGG ′ = (V ′,E ′) which includes a
new synchronization node vsync (Csync = 0). Then, it identifies the
sub-DAG GPar = (V Par ,EPar ) which includes all the nodes that
can potentially execute in parallel with vOf f . vsync is introduced
just before vOf f and GPar to simultaneously begin execution.
Consider the example shown in Figure 3 in order to facilitate
the algorithm explanation. Figure 3(a) shows the original DAG G,
in which the synchronization point to be included is represented
with a dashed red line. Figure 3(b) shows the resultant DAG G ′,
including the new synchronization nodevsync represented as a red
square node, and GPar .
3.4.1 Initialization. First Pred (vOf f ), the set of nodes from
which vOf f can be reached, and Succ (vOf f ), the set of nodes
reachable from vOf f , are computed (line 1). Then, the algorithm
initializes V ′, which includes all the original nodes in V plus the
synchronization node vsync , and E ′, which includes all the edges
in E. A local variable directPred is used to store vOf f ’s direct
predecessors1.
3.4.2 Loop over vOf f ’s direct predecessors. The transformation
starts with a loop (line 3) which iterates over vOf f ’s direct prede-
cessors, vi , and (1) adds vi to directPred , (2) adds an edge from vi
to the extra synchronization nodevsync and (3) removes (vi ,vOf f )
edge. In Figures 3(a) and 3(b) this loop operates over nodes v8 and
v9 to remove their edge with vOf f and to add new edges with the
new node vsync (green edges). The nested loop in line 6 updates
the edges between vi and vi ’s successors (parallel nodes to vOf f )
since they are now vsync ’s successors. In Figures 3(a) and 3(b) this
loop removes (v8,v11) and adds (vsync ,v11), black edges.
In line 9 a new edge between the extra synchronization node
vsync and the offloaded node vOf f is added. This corresponds to
the yellow edge (vsync ,vOf f ) in Figure 3(b).
3.4.3 Loop over other vOf f ’s predecessors. The second part of
the algorithm, starting in line 10, iterates over all the nodes vi from
which vOf f can be reached, except its direct predecessors. Then,
a nested loop is used to check if vi ’s successors, named vj , are
parallel tovOf f (line 12). If this is the case, thenvj is now avsync ’s
successor instead of a vi ’s successor (line 13). Notice that, since
transitive edges do not exist, it is not required to check if vj is in
Succ (vOf f ) to determine if vj is parallel to vOf f . In Figures 3(a)
and 3(b), these nested loops are used to remove pink edges (v1,v2)
and (v3,v7) and to add pink edges (vsync ,v2) and (vsync ,v7).
3.4.4 Creating GPar . Finally, the parallel sub-DAG GPar is cre-
ated, containing all the parallel nodes to vOf f (line 14) and all the
corresponding edges involving these nodes (line 17). In Figure 3(b)
GPar is surrounded by a dashed blue line.
4 RESPONSE TIME ANALYSIS FOR
HETEROGENEOUS DAG TASKS
In this sectionwe extend the RTA presented in Equation 1 to support
heterogeneous computation. Our analysis is based on the trans-
formed DAG task τ ′ in which GPar and vOf f are guaranteed to
execute in parallel. This allows a reduction of the self-interference
factor, being the new response time upper bound more accurate
than Rhom .
Figure 4 shows the generic structure of a transformed DAG task
τ ′. Given this structure, since there is a synchronization nodevsync ,
1If (vi , vj ) ∈ E then vi is a direct predecessor of vj .
(a) Original DAGG = (V , E ) (b) New DAGG ′ = (V ′, E ′)
Figure 3: Heterogeneous DAG task transformation τ ⇒ τ ′.
Algorithm 1 Transform DAG τ ⇒ τ ′
Input: G = (V , E )
Output: (1) G′ = (V ′, E′); (2) GPar = (V Par , EPar )
1: Compute Pred (vOf f ) and Succ (vOf f );
2: V ′ = V ∪ {vsync } ; E′ = E ; directPred = ∅;
3: for each (vi , vOf f ) ∈ E′ do
4: directPred = directPred ∪ {vi }
5: E′ = E′ ∪ {(vi , vsync ) } \ {(vi , vOf f ) }
6: for each (vi , vj ) ∈ E′ do
7: if vj , vsync then
8: E′ = E′ ∪ {(vsync , vj ) } \ {(vi , vj ) }
9: E′ = E′ ∪ {(vsync , vOf f ) }
10: for each vi ∈ {Pred (vOf f ) \ direcPred } do
11: for each (vi , vj ) ∈ E′ do
12: if vj < Pred (vOf f ) then
13: E′ = E′ ∪ {(vsync , vj ) } \ {(vi , vj ) }
14: V Par = V \ Pred (vOf f ) \ Succ (vOf f )
15: for each (vi , vj ) ∈ E do
16: if vi , vj ∈ V Par then
17: EPar = EPar ∪ {(vi , vj ) }
the relationship between the GPar and vOf f can be classified as
follows: either (1) the response time upper bound ofGPar , denoted
as Rhom (GPar ) 2, is bigger or equal than the offloaded workload
COf f (see Figure 5(a)); or (2)COf f is bigger than Rhom (GPar ) (see
Figure 5(b)). From this relationship, the following theorem considers
three possible execution scenarios to propose a new response time
analysis supporting heterogeneous computing:
Theorem 1. Consider an heterogeneous DAG task τ ′ whose gen-
eral structure is shown in Figure 4. Depending on the execution sce-
nario, its response time upper bound is computed as follows:
• Scenario 1. vOf f does not belong to the critical path.
Rhet (τ ′) = len (G′) + 1
m
(
vol (G′) − len (G′) −COf f
)
(2)
• Scenario 2.1. vOf f belongs to the critical path and COf f ≥
Rhom (GPar ).
Rhet (τ ′) = len (G′) + 1
m
(
vol (G′) − len (G′) − vol (GPar )
)
(3)
2Rhom (GPar ) is computed with Equation 1. Notice that, for simplicity, the input is a
DAG structure GPar instead of a task τ .
Figure 4: Generic hetero-
geneous DAG.
(a) Scenarios 1 and 2.2.
(b) Scenario 2.1.
Figure 5: Scheduling possibilities of the
generic DAG task in Figure 4.
• Scenario 2.2. vOf f belongs to the critical path and COf f ≤
Rhom (GPar ).





vol (G′) − len (G′) − len (GPar )
) (4)
Proof. The transformed DAG in Figure 4 includes a synchro-
nization node vsync (Csync = 0) which guarantees that GPar and
vOf f start their execution at the same time (tsync in Figures 5(a)(b)).
In case of Scenario 1 (represented in Figure 5(a)), since vOf f
does not belong to the critical path, there exits at least one path in
GPar whose length is greater than COf f , i.e. len(GPar ) > COf f .
Therefore, Rhom (GPar ) = len(GPar )+ 1m
(
vol (GPar )−len(GPar )
)
must be greater than COf f , and so tPar > tOf f from Figure 5(a) is
always accomplished. As a consequence, COf f does not generate
interference that may increase the response time of τ ′ and it can
be safely subtracted from the self-interference factor, as done in
Equation 2.
In case of Scenarios 2.1 and 2.2, since vOf f belongs to the critical
path, none of the nodes inGPar belong to it and so they contribute
to the self-interference factor.
In the former scenario (represented in Figure 5(b)), COf f is
greater (or equal) than Rhom (GPar ), then tPar ≤ tOf f in Figure
5(b)) and so GPar cannot generate interference that may increase
the response time of τ ′. Hence, its complete workload vol (GPar )
can be safely subtracted from the self-interference factor, as done
in Equation 3.
In the latter scenario (represented in Figure 5(a)),COf f is smaller
(or equal) than Rhom (GPar ) (see tOf f ≤ tPar in Figure 5(a)). There-
fore, even thoughvOf f belongs to the critical path, it does not deter-
mine the response time of τ ′ butGPar does instead. In this case, we
can safely replace COf f by Rhom (GPar ) in the critical path. Since
the contribution of GPar is also considered in the self-interference
factor, vol (GPar ) can be subtracted from it, in order not to count
twice for it. By replacing the mentioned terms and subtracting




vol (G′) − len (G′) −vol (GPar )
)








vol (G′)−len (G′)−vol (GPar )
)
. By
simplifying the terms, Equation 4 follows. □
It is important to remark that scenarios 2.1 and 2.2 are equivalent
when COf f = Rhom (GPar ). Hence, if starting from Equation 4
Figure 6: Percentage change of the average execution time of τ w.r.t. τ ′
when n ∈ [100, 250].
we replace COf f by Rhom (GPar ) = len(GPar ) + 1m
(
vol (GPar ) −
len(GPar )
)
, we rapidly reach Equation 3.
5 EXPERIMENTAL RESULTS
This section evaluates our response time analysis supporting het-
erogeneous computing based on randomly generated DAG tasks
[12, 18]. In order to evaluate the accuracy of our response time
analysis, we implemented an ILP formulation (based on [13]) that
computes the minimum time interval needed to execute a given
heterogeneous DAG task onm cores and one accelerator device. It
provides a node to core mapping so that the heterogeneous DAG
task makespan is minimized.
All the algorithms and experiments have been implemented in
MATLAB® and the ILP formulation has been coded and solved
with the IBM ILOG CPLEX Optimization Studio [10].
5.1 Experimental setup
All experiments consider an heterogeneous architecture composed
of a host processor with 2, 4, 8 and 16 cores and one single ac-
celerator device. Random DAG tasks are generated by recursively
expanding nodes either to terminal nodes or parallel sub-DAGs,
until a maximum recursion depthmaxdepth is reached.maxdepth
also determines the longest possible path of the DAG. The prob-
abilities of generating a parallel sub-DAG or a terminal node are
ppar and 1 − ppar , respectively. Moreover, the maximum number
of branches for any parallel sub-DAG is npar and the minimum
and maximum number of nodes of each DAG are nmin and nmax ,
respectively. The WCET of each node, exceptvOf f , is uniformly se-
lected as a positive integer in the interval [Cmin ,Cmax ] = [1, 100].
Once a DAG is generated, we randomly select vOf f among all the
nodes.COf f is assinged with the interval [1,CMAXOf f ], whereC
MAX
Of f
represents a percentage (up to 60%) of DAG’s volume.
For each experiment, we generate 100DAGs for each target value
ofCOf f . Moreover, we usedppar = 0.5 and two types of DAG tasks:
(1) Small tasks, with n ≤ 100, npar = 6 andmaxdepth = 3 (longest
path equals 7), used for the ILP solution not capable of dealing with
larger tasks and (2) Large tasks with n ∈ [100, 400], npar = 8 and
maxdepth = 5 (longest path equals 11).
5.2 Impact of the DAG Transformation
This section evaluates the impact that the extra synchronization
point vsync has in the task’s performance. To do so, we simulate
the execution of the original and transformed DAG tasks (τ and τ ′,
respectively), assuming the work-conserving breadth-first scheduler
(a)m = 2 cores, n ∈ [3, 20] (b)m = 8 cores, n ∈ [30, 60]
Figure 7: Increment of Rhet (τ ′) and Rhom (τ ) w.r.t. the minimum
makespan of τ . Notice that x-axes are different.
implemented in GOMP, the OpenMP implementation in the GNU
Compiler Collection (GCC) [1].
Figure 6 shows the percentage change3 of the average execution
time of τ with respect to τ ′, when increasing the offloaded workload
COf f with respect to τ ’s volume from 1% to 70%. The experiment
considers m = 2, 4, 8 and 16 cores and a number of nodes n ∈
[100, 250] (similar trends have been observed when n ∈ [250, 400]).
vsync has a negative impact on the average performance of τ ′,
compared to τ , when COf f represents a small portion of DAG’s
volume (less than 11%, 8%, 6% and 4.5% for m = 2, 4, 8 and 16,
respectively). The reason is that an extra synchronization point
limits the parallelism. This negative impact increases as the num-
ber of cores increases (and so more parallelism can be potentially
exploited). WhenCOf f represents 1% of the DAG’s volume, τ is 3%
faster than τ ′ form = 2, and 15% faster form = 16.
Surprisingly, when COf f increases the trend is inverted; τ re-
sults 24% slower than τ ′ form = 2 when COf f represents the 28%
of DAG’s volume, and 4% slower for m = 16 when COf f repre-
sents the 8%. The reason is that vsync guarantees that the host
processor is not idle while executing vOf f (see Figure 1(c)). The
performance benefits of vsync decreases as m increases because
the self-interference factor has less impact as the number of cores
increases (see Theorem 1).
Finally, it is worth noting that for higher values of COf f , the
difference between τ and τ ′ performance seems to decrease. How-
ever, the absolute difference remains constant. As COf f increases
it becomes the dominant factor in τ and τ ′ execution times and so
both equally increase as well. The trend of the percentage of an
absolute difference with respect to an increasing time is to decrease.
5.3 Accuracy of the response time analysis
This section analyses the accuracy of Rhet (Equations 2, 3, 4) and
Rhom (Equation 1) with respect to the minimum makespan of a
DAG task (ILP solver). Given the ILP complexity, we only consider
small tasks for which the ILP solver is able to provide an optimal
solution in less than 12 hours.
Figure 7 shows the increment of the response time upper bound
provided by Rhom (τ ) (Equation 1) and Rhet (τ ′) (Equations 2 to
4) with respect to the minimum makespan of τ computed by the
ILP solver, when varying COf f with respect to τ ’s volume. We
evaluated 2, 4, 8 and 16 cores but for the sake of space, only results
form = 2 and 8 cores are shown (Figure 7(a) and (b), respectively).
When COf f represents less than 2% of vol (τ ), Rhet (τ ′) is 19%
and 54% higher than the minimum makespan for m = 2 and 8,
3The percentage change computes the relative change of two values from the same
variable; in our case the average execution time.
(a)m = 2 cores (b)m = 8 cores
Figure 8: Percentage of scenarios occurrence, n ∈ [100, 250].
Figure 9: Percentage change of Rhom (τ ) w.r.t. Rhet (τ ′), n ∈ [100, 250].
respectively. For m = 4 and 16, Rhet (τ ′) is 40% and 57% higher,
respectively (not shown in the figure). This pessimism however
decreases as COf f increases, being less than 1% when COf f repre-
sents 48.1%, 42.7%, 24.5% and 15% of vol (τ ), form = 2, 4, 8 and 16,
respectively. The reason is that COf f becomes the dominant factor
of Rhet (τ ′) and so GPar is not relevant any more (see Figure 5(b)).
Rhom (τ ) provides more accurate results than Rhet (τ ′) when
COf f represents less than 3.1% and 11.2% of vol (τ ) form = 2 and
8, respectively,. The reason of this trend, as also shown in Section
5.2, is that vsync impacts negatively on both, average and upper
bound response time. Form = 4 and 16, Rhom (τ ) provides better
results whenCOf f represents less than 12.2% and 8.7%, respectively
(not shown in the figure). This trend however is inverted when
COf f increases, and so Rhet (τ ′) provides more accurate results
than Rhom (τ ); e.g. when Rhet (τ ′) provides a response time only 1%
higher than the minimum makespan, Rhet (τ ′) is up to 20% higher.
5.4 Homogeneous vs. Heterogeneous
This section evaluates our response time analysis Rhet (τ ′), com-
pared withRhom (τ ). All figures consider randomly generated DAGs
with n ∈ [100, 250] (similar trends observed when n ∈ [250, 400]).
In order to better understand the benefits brought by Rhet (τ ′), it
is important first to understand the execution scenarios presented
in Theorem 1. For the randomly generated tasks, Figure 8 shows
the occurrence percentage of the scenarios described in Section
4, when varying the percentage of COf f over vol (τ ) from 0.12%
to 50%. We evaluated heterogeneous architectures featuring host
processors withm = 2, 4, 8 and 16 cores but, for the sake of space,
only results for (a)m = 2 and (b)m = 8 are shown.
Scenario 1 is the dominant one when the percentage of COf f
over vol (τ ) is less than 8%. This scenario corresponds to the case
in which vOf f does not belong to the critical path and therefore,
is independent of m. From that point on, scenario 2.2 becomes
more relevant as vOf f belongs to the critical path but COf f is
still smaller than the response time of GPar . When COf f becomes
higher that Rhom (GPar ), occurrences of scenario 2.1 increase. As
m increases, occurrences of scenario 2.1 start to increase earlier
because higher parallelism can be exploited in the host, and so
Rhom (GPar ) becomes smaller.
Interestingly, intersection of scenarios 2.1 and 2.2, i.e. when
COf f = R
hom (GPar ) (Equations 3 and 4 are equivalent), results
in maximum benefit of Rhet with respect to Rhom (see below).
This occurs when COf f is 32%, 20%, 14% and 10% over vol (τ ) for
m = 2, 4, 8 and 16, respectively. The reason is that utilization of
both host and device is maximized, i.e. there are less idle times.
Figure 9 shows the percentage change ofRhom (τ ) with respect to
Rhet (τ ′), considering a host processor withm = 2, 4, 8 and 16 cores
and varyingCOf f with respect tovol (τ ) from 0.12% to 50%. In gen-
eral, our response time analysis Rhet (τ ′) improves over Rhom (τ )
when considering heterogeneous computation. This improvement
increases as COf f increases due to self-interference factor reduc-
tion. Rhom only outperforms Rhet for small values of COf f due to
the negative impact of the synchronization point. Concretely, this
occurs when COf f represents less than 1.6%, 3.4%, 4.6% and 5%
over vol (τ ) form = 2, 4, 8 and 16, respectively.
As pointer above, the maximum benefit of Rhet (τ ′) with respect
to Rhom (τ ) is reached when COf f = Rhom (GPar ). In this case,
Rhom (τ ) is 70%, 55%, 40% and 30% higher than Rhet (τ ′) for m =
2, 4, 8 and 16, respectively. Notice that asm increases, the benefit of
Rhet (τ ′) is smaller because the self-interference factor is divided
bym (Equations 2 to 4).
Results presented in Figure 9 correspond to an average response
time upper bound over all generated DAG tasks. However, the
maximum observed difference between Rhom (τ ) and Rhet (τ ′) is
95.0%, 82.5%, 65.3% and 47.7% form = 2, 4, 8 and 16, respectively.
6 RELATEDWORK
Parallel task models are increasingly being used in the real-time
domain. A response-time analysis is presented in [2] for fork/join
tasks under partitioned fixed-priority scheduling. The sporadic
DAG model [17][6] used in this work has been also considered
with conditional nodes [12][5], global [3][18], partitioned [9] or
federated [4] scheduling approaches. In [21] authors study the
similarities of the DAG model and parallel programming models
such as OpenMP. A dynamic scheduliability test is provided in [19]
and static scheduling heuristics are presented in [13].
Regarding heterogeneous architectures, real-time tasks have
been traditionally modeled as self-suspending task. Most of the
published work consider that tasks are scheduled on a uniprocessor
platform and utilizes a device to accelerate part of the execution.
Unfortunately, it has been shown that many previous works con-
cerning the analysis of self-suspending tasks are flawed. Refer to
[8] for a complete review of self-suspending tasks theory and an
explanation of the existing misconceptions. Finally, in [7] authors
design a framework to support real-time systems on FPGAs and
provide a response time analysis to verify the schedulability of a set
of tasks with software parts and hardware accelerated functions.
7 CONCLUSIONS
This paper presents a novel response time analysis for verifying
the schedulability of DAG tasks supporting heterogeneous com-
puting. To do so, we first identify the portion of the DAG running
in the host, GPar , that can potentially execute in parallel with the
workload offloaded to the device, vOf f . Secondly, we propose a
DAG transformation to guarantee the parallel execution of GPar
and vOf f . We build our response time analysis upon this transfor-
mation. Interestingly, besides the timing guarantees provided, this
DAG transformation also results in higher average performance
when the offloaded workload represents more than 10% of DAG’s
volume. The reason is that the scenario in which the host processor
is idle waiting for the device to finish is avoided. Our response
time analysis significantly outperforms the homogeneous one by
70%, 55%, 40% and 30%, for 2, 4, 8 and 16 cores in the host processor,
respectively. Moreover, for small DAG tasks (3-100 nodes), our re-
sponse time upper bound is comparable to the minimum makespan
derived with an ILP solution. In the future, we intend to improve the
analysis by considering (i) more tasks assigned to the accelerator
device, and (ii) more devices in the heterogeneous architecture.
ACKNOWLEDGMENTS
This work is supported by the Spanish Ministry of Science and
Innovation under contract TIN2015-65316-P.
REFERENCES
[1] GOMP project. URL: https://gcc.gnu.org/projects/gomp/.
[2] P. Axer, S. Quinton, M. Neukirchner, R. Ernst, B. Döbel, and H. Härtig. Response-
time analysis of parallel fork-join workloads with real-time constraints. In ECRTS,
2013.
[3] S. Baruah. Improved multiprocessor global schedulability analysis of sporadic
DAG task systems. In ECRTS, 2014.
[4] S. Baruah. The federated scheduling of systems of mixed-criticality sporadic
DAG tasks. In RTSS, 2016.
[5] S. Baruah, V. Bonifaci, and A. Marchetti-Spaccamela. The global EDF scheduling
of systems of conditional sporadic DAG tasks. In ECRTS, 2015.
[6] S. Baruah, V. Bonifaci, A. Marchetti-Spaccamela, L. Stougie, and A. Wiese. A
generalized parallel task model for recurrent real-time processes. In RTSS, 2012.
[7] A. Biondi, A. Balsini, M. Pagani, E. Rossi, M. Marinoni, and G. Buttazzo. A
framework for supporting real-time applications on dynamic reconfigurable
FPGAs. In RTSS, 2016.
[8] J.-J. Chen, G. Nelissen, W.-H. Huang, M. Yang, et al. Many suspensions many
problems: A review of self-suspending tasks in real-time systems. Tech. Rep, 854
(2nd ver.), March 2017.
[9] J. Fonseca, G. Nelissen, V. Nelis, and L. M. Pinho. Response time analysis of
sporadic DAG tasks under partitioned scheduling. In SIES, 2016.
[10] IBM ILOG Cplex Optimization studio. URL: http://www-
01.ibm.com/software/commerce/optimization/cplex-optimizer.
[11] S. Leibson and N. Mehta. Xilinx ultrascale: The next-generation architecture for
your next-generation architecture. Xilinx White Paper WP435, 2013.
[12] A. Melani, M. Bertogna, V. Bonifaci, A. Marchetti-Spaccamela, and G. C. Buttazzo.
Response-time analysis of conditional DAG tasks in multiprocessor systems. In
ECRTS, 2015.
[13] A. Melani, M. A. Serrano, M. Bertogna, I. Cerutti, E. Quinones, and G. Buttazzo.
A static scheduling approach to enable safety-critical OpenMP applications. In
ASP-DAC, 2017.
[14] NVIDIA Tegra. K1: A new era in mobile computing. NVIDIA White Paper, 2014.
[15] OpenMP Architecture Review Board. OpenMP 4.5 Complete Specification. URL:
http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf, 2015.
[16] S. Royuela, M. A. Serrano, A. Duran, X. Martorell, and E. Quinones. A functional
safety openmp for critical real-time embedded systems. In IWOMP, 2017.
[17] A. Saifullah, D. Ferry, C. Lu, and C. Gill. Real-time scheduling of parallel tasks
under a general DAG model. Tech. Rep, 2012.
[18] M. A. Serrano, A. Melani, M. Bertogna, and E. Quinones. Response-time analysis
of DAG tasks under fixed priority scheduling with limited preemptions. In DATE,
2016.
[19] M. A. Serrano, A. Melani, R. Vargas, A. Marongiu, M. Bertogna, and E. Quinones.
Timing characterization of OpenMP4 tasking model. In CASES, 2015.
[20] Texas Instruments. 66AK2Hxx Multicore Keystone II System-on-Chip (SoC). No-
vember 2012.
[21] R. Vargas, E. Quinones, and A. Marongiu. OpenMP and timing predictability: a
possible union? In DATE, 2015.
[22] R. E. Vargas, S. Royuela, M. A. Serrano, X. Martorell, and E. Quinones. A light-
weight OpenMP4 run-time for embedded systems. In ASP-DAC, 2016.
