On the Analysis of Parallel Real-Time Tasks with Spin Locks by Jiang, Xu et al.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 1
On the Analysis of Parallel Real-Time Tasks with
Spin Locks
Xu Jiang1,2, Nan Guan2∗, He Du1,2, Weichen Liu3, Wang Yi4
1 Northeastern University, China
2 The Hong Kong Polytechnic University, Hong Kong
3 Nanyang Technological University, Singapore
4 Uppsala University, Sweden
Abstract—Locking protocol is an essential component in resource management of real-time systems, which coordinates mutually
exclusive accesses to shared resources from different tasks. Although the design and analysis of locking protocols have been
intensively studied for sequential real-time tasks, there has been little work on this topic for parallel real-time tasks. In this paper, we
study the analysis of parallel real-time tasks using spin locks to protect accesses to shared resources in three commonly used request
serving orders (unordered, FIFO-order and priority-order). A remarkable feature making our analysis method more accurate is to
systematically analyze the blocking time which may delay a task’s finishing time, where the impact to the total workload and the longest
path length is jointly considered, rather than analyzing them separately and counting all blocking time as the workload that delays a
task’s finishing time, as commonly assumed in the state-of-the-art.
Index Terms—Real-Time Scheduling, Spin Lock, Parallel tasks, Multi-core.
F
1 INTRODUCTION
Real-time systems are playing a more important role in our
daily life as computing is closely integrated to the physical
world. Violating timing constraints in such systems may
lead to catastrophic consequences such as loss of human life.
Therefore, real-time systems must manage resource in a way
such that timing correctness can be guaranteed. Locking
protocol is an essential component in resource management
of real-time systems, which coordinates mutually exclusive
accesses to shared physical/logical resources by different
tasks. Inappropriate design or incorrect analysis of locking
protocols will lead to incorrect system timing behavior,
e.g., as in the famous software failure accident in Mars
Pathfinder [1].
Multi-cores are becoming mainstream hardware plat-
forms for real-time systems, to meet their rapidly increasing
requirements in high performance and low power consump-
tion. To fully utilize the processing capacity of multi-cores,
software should be parallelized. While locking protocols
for sequential real-time task systems have been intensively
studied in classical real-time scheduling theory [2], [3], [4],
[5], there is little work on this topic for parallel real-time
tasks. On the other hand, there has been much work on
scheduling algorithms and analysis techniques for parallel
real time tasks [6], [7], [8], where tasks are assumed to be
independent from each other and the locking issue is not
considered.
Recently, spin locks were studied for parallel real-time
tasks in [9] where each parallel task is scheduled exclusively
on several pre-assigned processors (i.e., by the federated
*Corresponding author: Nan Guan
scheduling approach [6]). However, the analysis in [9] is
pessimistic. The contribution of our work is to develop
new techniques for the schedulability analysis of real-time
parallel tasks with spin locks and significantly improve the
analysis precision against the state-of-the-art.
Both [9] and our work only require knowledge of the
total worst-case execution time (WCET) Ci and longest path
length Li of each task, but not the exact graph structure (the
benefits of only using the abstract Ci and Li information
in the analysis will be discussed in Section 2.4). In [9]’s
analysis, all blocking time caused by spin locks is considered
to contribute to the workload that delays the finishing time
of a parallel task, which is added to Ci and Li in their worst-
case scenarios separately. This is quite pessimistic since many
blocking time can not delay the finishing time of a par-
allel task due to the parallelism and intra-dependencies.
Moreover, the worst-case scenario leading to the maximal
increase to Ci is in general different from the worst-case
scenario leading to the maximal increase to Li.
To solve these problems, in this work we first develop
new schedulability analysis techniques for parallel tasks
with spin locks, where the blocking time contributing to
the workload that may delay a task’s finishing time is
systematically defined and analyzed. Further, we develop
blocking analysis techniques for three common request serv-
ing orders, i.e., unordered, FIFO-order and priority-order,
where the impact to Ci and Li is jointly considered thus
achieving higher analysis precision.
We conduct experiments to evaluate the precision im-
provement using our new techniques compared with [9],
with both randomly generated tasks and workload gener-
ated according to realistic OpenMP programs. Experimental
ar
X
iv
:2
00
3.
08
23
3v
1 
 [c
s.D
C]
  1
8 M
ar 
20
20
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 2
Fig. 1. An example of a DAG task τi.
results show that our techniques consistently outperform [9]
under different settings.
2 PRELIMINARY
2.1 Task Model
We consider a task set T consisting of several periodic DAG
tasks T = {τ1, τ2, ..., τ|T |} to be executed on m processors.
A task τi has a period Ti, a relative deadline Di and a
workload structure modeled by a Directed Acyclic Graph
(DAG)Gi = 〈Vi, Ei〉, where Vi is the set of vertices andEi is
the set of edges in Gi. Tasks have constrained deadlines, i.e.,
Di ≤ Ti. Each vertex v ∈ Vi is characterized by a worst-case
execution time (WCET) c(v). We use Ci to denote the total
WCET of all vertices of τi: Ci =
∑
v∈Vi c(v). The utilization of
task τi is Ui = Ci/Ti and the density of task τi is Γi = Ci/Di.
In this paper, we only consider DAG tasks with Γi > 1, as
those with Γi ≤ 1 can be executed sequentially and handled
by existing techniques for sequential real-time tasks.
Each edge (u, v) ∈ Ei represents the precedence relation
between vertices u and v, where u is a predecessor of v,
and v is a successor of u. We assume each DAG has a
unique head vertex (with no predecessors) and a unique
tail vertex (with no successors). This assumption does not
limit the expressiveness of our model since one can always
add a dummy head/tail vertex to a DAG having multi-
ple entry/exit points. A complete path in a DAG task is a
sequence of vertices pi = {v1, v2, ..., vp}, where the first
element v1 is the head vertex of Gi, the last element vp
is the tail vertex of Gi, and vj is a predecessor of vj+1
for each pair of consecutive elements vj and vj+1 in pi.
The length of each path pi is len(pi) =
∑
v∈pi c(v). We use
Li to denote the longest length among all paths in Gi:
Li = maxpi∈Gi{len(pi)}. Task τi generates a potentially
infinite sequence of jobs, which inherit τi’s DAG workload
structure Gi. Let J be a job released by τi, then we use r(J)
to denote J ’s release time and f(J) to denote J ’s finish
time. The absolute deadline of J is calculated by r(J) + Di.
At runtime, we say a vertex (of a job J ) is eligible at some
time point if all its predecessors (of the same job J ) have
been finished and thus it can immediately execute if there
are available processors. Fig. 1 shows a DAG task example
τi with 7 vertices, where Ci = 10 and Li = 5 (the longest
path is {v1, v4, v6, v7} or {v1, v2, v6, v7}).
2.2 Resource and Lock Model
There is a limited set of serially-reusable shared resources
(called resources for short) Θ = {`1, `2, ..., `|Θ|} in the sys-
tem, such as I/O ports, network links, message buffers,
or other shared data structures. Resources are protected by
spin locks, i.e., the program must acquire, hold and release the
lock affiliated to `q before, during and after executing the
code segment accessing `q . We assume the code segment
wrapped by a pair of lock acquisition and lock release
does not cross different vertices. A vertex must execute
non-preemptively when it is holding a lock. When a vertex
acquires a lock affiliated to `q being held by other vertices
(either from the same task or from other tasks), the acquir-
ing vertex must spin non-preemptively until it successfully
obtains the lock, and we say this vertex is spinning for `q .
When multiple vertices are spinning for the same re-
source at the same time, we consider three kinds of order in
which their requests will be served: unordered, FIFO-order
and priority-order. In priority-order, each task is assigned a
unique priority and all requests from vertices of a same task
have the same priority. Note that the priorities are only used
to decide the order when requests from different tasks to a
resource are served.
A vertex may access different shared resources and thus
hold different locks. However, we assume the locks are
non-nested, i.e., a vertex never acquires another lock when
holding a lock. We use Θi to denote the set of resources
accessed by vertices of task τi.
The worst-case time of each single access to `q by task τi
(i.e., the maximal duration for a vertex in τi to hold the lock
affiliated to `q once) is denoted by Li,q , and the worst-case
number of accesses to `q by τi is denoted by Ni,q . Note that
a vertex’s WCET includes the resource access time. On the
contrary, the time spent by a vertex on spinning for some
resource, called blocking time [10], [11], is not included in the
WCET estimation.
TABLE 1
Notations adopted in this paper.
Notations Descriptions
τi a DAG task
Gi the workload structure of τi
Vi the set of vertices in Gi
Ei the set of edges of Gi
c(v) WCET of a vertex v
Ci total WCET of all vertices of τi
pi a path
λ a key path
len(pi) the total WCET of all vertices on pi
Li the longest length among all path of τi
`q a shared resource
Ni,q number of accesses to `q from τi
Li,q the worst-case time of each single access to `q by τi
Wi working time of a job of τi
Γi idle time of a job of τi
Bi blocking time of a job of τi
Bλ,Ii intra-task key path blocking time of a job of τi
Bλ,Oi inter-task key path blocking time of a job of τi
Bλ,Ii intra-task delay blocking time of a job of τi
Bλ,Oi inter-task delay blocking time of a job of τi
Bλ˜,Ii intra-task parallel blocking time of a job of τi
Bλ˜,Oi inter-task parallel blocking time of a job of τi
Ii defined in Lemma 3
IIi,q defined in (6)
IOi,q defined in (7)
ηqi,j defined in (11)
∆qi,j defined in (18)
Ri worst-case response time of τi
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 3
2.3 Scheduling Model
There are in total m processors in the system, which will
be partitioned into several subsets and each subset is as-
signed to a task. We use mi to denote the number of
processors assigned to task τi. At runtime, τi is scheduled
by a work-conserving scheduling algorithm [6] exclusively on
these mi processors. Note that although a task τi executes
exclusively on its own mi processors, its timing behavior
is still interfered by other tasks due to the contention on
the shared resources. The response time R(J) of a job J is
R(J) = f(J)−r(J), and the worst-case response time (WCRT)
Ri of task τi is the maximum R(J) among all its released
jobs J . Task τi is schedulable if Ri ≤ Di. The problem to
solve in this paper is how to partition the m processors to
each task such that it is guaranteed to be schedulable.
2.4 Remark
The analysis techniques of this paper only require the
knowledge of Ci and Li of each task τi, as well as Ni,q and
Li,q for each pair of task τi and resource `q . It is not required
to know the exact graph structure of the task, neither the
exact distribution of the resource access requests within the
task. This makes our analysis techniques general, in the
sense that they are directly applicable to more expressive
models, e.g., the conditional DAG model, as long as we
still can obtain the Ci, Li, Ni,q , Li,q information. Moreover,
as pointed out by [9], parallel programs are often data-
dependent and their internal graph structures usually can
only be unfolded at run time, so the exact graph structure of
a parallel task can vary from one release to the next. There-
fore, the analysis techniques using abstract information are
more practical than those relying on exact graph structure
information.
Of course if one can model the resource access behavior
in a more detailed manner, e.g., giving the exact worst-case
duration of each access and the information about which
resource is accessed by which part of the task at which
time point, it will certainly lead to more precise results in
general. However, in practice it is not always possible to
model realistic systems with those detailed information due
to the flexibility and non-determinism of software behav-
ior. Study on finer-grained resource access models and the
corresponding analysis techniques is left as our future work.
It is necessary to mention that the main scope of this
paper is to present blocking and schedulability analysis
techniques when scheduling DAG tasks with spin locks.
We do not make any constraint to the scheduler that each
paralleled program is scheduled with, as long as the work-
conserving is satisfied (e.g., EDF). A limitation in this paper
is that we assume locks to be non-nested. In fact, the analysis
of nested locks is more complicated even for sequential
tasks, and this problem is still vastly open [12]. However,
when nested-locks are used in practice, we can adopt some
techniques such as group locks [12] to transform nested-
locks into independent locks such that techniques presented
in this paper are still applicable.
3 DISCUSSION OF EXISTING TECHNIQUES
There has been significant work of locking protocols and
blocking analysis for sequential tasks (see Section 7 for more
(a) An example of a DAG task.
(b) A possible sequence.
Fig. 2. An example of blocking behavior of a DAG job.
details). However, it is not a proper choice to directly apply
blocking analysis techniques for sequential tasks on DAG
tasks.
First, the definition of blocking for DAG tasks is different
than that under sequential tasks. Under sequential task
models, the blocking time of each task is analyzed individ-
ually, and the exact definition of blocking time as well as
the blocking analysis techniques are developed according
to some particular schedulability tests [4], [10], [13] where
the blocking time can be accounted in. The main object of
locking protocols (with blocking analysis) is to bound such
maximum blocking (e.g., the priority inversion blocking
[12], [13]) to an individual task. However, this is not the case
for DAG tasks where the schedulability analysis object is the
whole DAG task. For example, the DAG task in Figure 2.(a)
has 8 vertices where c(v7) = 2 and each of other vertices has
a WCET of 1. v2, v3, v5 and v6 need to access a same shared
resource `q for 1 time unit, while the remaining vertices do
not need to access any share resource. A possible execution
sequence of a job of τi is shown in Figure 2.(b) where
P1, P2 and P3 denote the processors. It can be observed
that v2 can not be blocked by v3 if v2 blocks v3 in a
DAG job (which is also the case for sequential tasks but
each vertex must be analyzed individually in a worst-case
blocking scenario). Moreover, although v6 is blocked by v5,
the finishing time of the DAG job is not delayed. The reason
is that the impact of blocking time on the schedualbility of
a DAG job is actually reflected by its impact on the progress
of a particular path, i.e., {v1, v3, v4, v7, v8} in Figure 2.(b).
These are quite different than that under sequential task
models where the blocking time of each task is analyzed
individually. To develop blocking analysis for DAG tasks,
we first need to systematically define the notion of blocking
and analyze which blocking should be accounted according
to its influence on the timing behavior of a DAG task.
Second, as discussed in Section 2.4, the exact distribu-
tion of resource access requests is not known under the
model considered in this paper. Therefore, it is impossible
to directly apply blocking analysis techniques for sequential
tasks on the task model considered in this paper. There
may also be cases where the exact graph structure and
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 4
more concrete information about the resource access. In this
case, one may utilize such concrete information and use
sequential locking protocols to perform blocking analysis.
We will evaluate the performance when directly applying
OMLP and its associated blocking analysis techniques on
DAG tasks in Section 6 to validate the problems discussed
in this section.
4 PREPARATION
In this section, we introduce some useful concepts and
present schedulability analysis techniques for parallel tasks
that are applicable irrelevant of the locking protocols and
request serving orders. Then in the next section we will
apply these results to develop specific blocking analysis
techniques for unordered, FIFO- and priority- request serv-
ing orders, respectively.
When we say a vertex is executing, it may be either
holding or not holding a lock. We say a processor is busy if
some vertex is executing or spinning on this processor, and
say a processor is busy with a vertex v if vertex v is executing
or spinning on this processor. A processor is said to be idle
if it is not busy.
Let Ji denote an arbitrary job of τi, which is released at
r(Ji) and finished at f(Ji). The total amount of time spent
on mi processors assigned to τi during [r(Ji), f(Ji)) is mi ·
(f(Ji)−r(Ji)), which can be divided into three disjoint parts
mi · (f(Ji)− r(Ji)) = Bi +Wi + Γi:
• Blocking Time Bi: the cumulative length of time on
mi processors spent on spinning.
• Working Time Wi: the cumulative length of time
on mi processors spent on executing workload of Ji
(either holding a lock or not).
• Idle Time Γi: the cumulative length of time on mi
processors being idle.
Fig. 3. Illustration of different times.
Fig. 3 shows a possible scheduling sequence of a job
of the task in Fig. 1 on 3 processors, with release time 0
and finish time 8. Suppose that v2, v3, v4 need to access
the same shared resource `q for 1 time unit, while the
remaining vertices do not need to access any share resource.
The blocking time is 5 (the area wrapped by red solid lines),
the idle time is 9 (the area wrapped by blue dash lines), and
the working time is 10 (the remaining area between [0, 8) on
all the 3 processors).
Given mi processors assigned to task τi, we have:
Lemma 1. τi’s worst-case response time Ri is bounded by:
Ri ≤ Bi + Γi + Ci
mi
. (1)
Proof. The response time of Ji is f(Ji) − r(Ji). By mi ·
(f(Ji) − r(Ji)) = Bi + Wi + Γi and Wi ≤ Ci, we know
Ji’s response time is bounded by Bi+Ci+Γimi . Since Ji is an
arbitrary job of τi, Ri is also bounded by Bi+Ci+Γimi .
By Lemma 1, the problem of bounding Ri boils down to
bounding Bi+Γi. Before going further into the analysis, we
first introduce the concept of key path:
Definition 1 (Key Path). A key path of job Ji, denoted by
λ = {v1, v2, ...vp}, is a complete path inGi, s.t., ∀j : 1 < j ≤ p,
vj−1 is a predecessor of vj with the latest finish time among all
predecessors of vj .
Lemma 2. Let λ = {v1, v2, ...vp} be a key path of Ji. All mi
processors must be busy at any time point in [r(Ji), f(Ji)) when
no processor is busy with vertices in λ.
Proof. Let vj and vj+1 be two successive elements in λ. By
Definition 1, all predecessors of vj+1 have finished at the
finish time of vj (and thus vj+1 is eligible for execution
at that time point). Therefore, all processors must be busy
between the finish time of vj and the starting time of vj+1.
Applying the above reasoning to each pair of successive
elements in λ, the lemma is proved.
In the following, we divide the blocking time Bi into
several disjoint parts. There are two dimensions to divide
Bi. First, we can divide Bi into:
• Key Path Blocking Time Bλi , the cumulative length
of time spent on spinning by a vertex in λ.
• Delay Blocking Time Bλi , the cumulative length of
time on all mi processors spent on spinning during
all the subintervals in [r(Ji), f(Ji)) when no proces-
sor is busy with a vertex in λ.
• Parallel Blocking Time Bλ˜i , the cumulative length of
time on all other mi−1 processors spent on spinning
during all the subintervals in [r(Ji), f(Ji)) when one
processor is busy with a vertex in λ.
In the second dimension we divide Bi according to
whether the processor is waiting for a resource locked by
the same task or by a different task:
• Intra-task Blocking Time, the cumulative length of
time spent on spinning and waiting for a resource
locked by the same task,
• Inter-task Blocking Time, the cumulative length of
time spent on spinning and waiting for a resource
locked by other tasks,
so each of Bλi , B
λ
i and B
λ˜
i can be further divided into:
Bλi = B
λ,I
i +B
λ,O
i ; B
λ
i = B
λ,I
i +B
λ,O
i ; B
λ˜
i = B
λ˜,I
i +B
λ˜,O
i
where the superscript I denotes intra-task blocking time
and O denotes inter-task blocking time. Finally, Bi can be
divided into the following six disjoint parts:
Bi = B
λ,I
i +B
λ,O
i +B
λ,I
i +B
λ,O
i +B
λ˜,I
i +B
λ˜,O
i . (2)
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 5
Fig. 4. An example of blocking time.
We use the example in Fig. 4 to demonstrate different
types of blocking time. Suppose the upper part of the
figure is a running sequence of a job of the task τi in
Fig. 1. Suppose its key path is λ = {v1, v4, v6, v7}. The
lower part in Fig. 4 is a running sequence of a job of
another task τj with Vj = {u1, u2, u3, u4} and Ej =
{(u1, u2), (u1, u3), (u2, u4), (u3, u4)}. All vertices of τj have
the same WCET of 1. Entire vertices v2, v3 and v4 in τi and
u2, u3 in τj access the same shared resource. The blocks
wrapped by the red dash lines represent that a vertex in
the key path is executing or spinning. In this example, Bi is
divided into the six disjoint parts as follows:
• Bλ,Ii = 2, which includes [3, 4) and [5, 6) on P3,
• Bλ,Oi = 1 which includes [4, 5) on P3,
• Bλ,Ii = 1, which includes [2, 3) on P2,
• Bλ,Oi = 2, which includes [1, 2) on both P1 and P2,
• Bλ˜,Ii = 1, which includes [3, 4) on P2,
• Bλ˜,Oi = 1, which includes [4, 5) on P2.
Lemma 3. The response time of Ji is upper bounded by:
R(Ji) ≤ Ci + (mi − 1) · Li + Ii
mi
(3)
where Ii = (mi − 1) ·Bλ,Ii +Bλ,Ii +mi ·Bλ,Oi +Bλ,Oi .
Proof. We start by deriving an upper bound for Γi. We
use len∗ to denote sum of lengths of subintervals in
[r(Ji), f(Ji)) during which a processor is busy with a vertex
in λ (i.e., a vertex in λ is either executing or spinning).
By Lemma 2, we know a processor can be idle only in
these subintervals on mi − 1 processors, so Γi is bounded
by len∗ · (mi − 1). Moreover, the area len∗ · (mi − 1)
may not completely be idle time. Some vertex may be
executing/spinning in parallel with the execution/spinning
of vertices in the key path λ, which can be excluded from
len∗ · (mi − 1) to get a tighter upper bound for Γi. In
particular, we can subtract the following blocking time from
len∗ · (mi − 1) to still safely bound Γi:
• The parallel blocking time Bλ˜i = B
λ˜,I
i + B
λ˜,O
i . This
type of blocking time occurs in parallel with the
execution/spinning of vertices in λ, which can be
excluded from the area len∗ · (mi − 1).
• The intra-task key path blocking time Bλ,Ii . When
some vertex in λ is experiencing intra-task blocking,
there must be a vertex in the same task τi holding the
corresponding lock, so the same amount of time as
Bλ,Ii should be excluded from the area len
∗ ·(mi−1).
By the above discussion, we can get
Γi ≤ len∗ · (mi − 1)−Bλ,Ii −Bλ˜,Ii −Bλ˜,Oi (4)
(An example illustrating the upper bound for Γi is provided
after the proof.)
On the other hand, we know len∗ is the sum of len(λ)
and the total amount of time when some vertex in λ is
spinning (for a resource held by either the same task or a
different task), i.e.,
len∗ = len(λ) +Bλ,Ii +B
λ,O
i (5)
By (4) and (5) we have:
Γi ≤ (mi−2) ·Bλ,Ii +(mi−1) · (Bλ,Oi +len(λ))−(Bλ˜,Ii +Bλ˜,Oi )
⇒ Bi + Γi ≤ (mi − 1) · len(λ) + Ii (by (2))
and by Lemma 1 the lemma is proved.
Now we give the intuition of the upper bound for Γi in
the above proof. For example, as shown in Fig. 4, a job of τi
is released at time 0 and finished at time 10. During intervals
[0, 1) and [3, 10), a vertex in the key path λ = {v1, v4, v6, v7}
is either executing or spinning. A processor can be idle only
in these two time intervals. Since len(λ) = 5, Bλ,Ii = 2 and
Bλ,Oi = 1, so len
∗ = 5 + 2 + 1 = 8, which equals the sum
of the length of intervals [0, 1) and [3, 10). Therefore, the
gross upper bound for Γi that counts the total area in all
the time intervals on all the processors in parallel with the
execution/spinning of vertices in λ is len∗ × (mi − 1) = 16.
In the following we show that part of this total area can be
excluded to bound Γi. [3, 4) and [5, 6) on P3 are the intra-
task key path blocking time. P1 is holding the lock in [3, 4)
and P2 is holding the lock in [5, 6), so we can subtract 2
units when counting the idle time. P2 is spinning during
[3, 5) ([3, 4) is intra-task parallel blocking time and [4, 5) is
inter-task parallel blocking time), so we can subtract another
2 units when counting the idle time. In summary, the idle
time Γi is bounded by Γi ≤ 16− 2− 2 = 12.
From Lemma 3, the parallel blocking time does not con-
tribute to the total work that may delay the finishing time
of a parallel task, and the analysis is now boiled down to
bounding Ii constituted by key path blocking time and
delay blocking time.
5 BLOCKING ANALYSIS
By adopting results we presented in Section 4, in the fol-
lowing we develop blocking analysis techniques for three
request serving orders. We define the contribution to Ii by
each individual resource `q , caused by intra- and inter-task
blocking, respectively:
IIi,q = (mi − 1)Bλ,Ii,q +Bλ,Ii,q (6)
IOi,q = miBλ,Oi,q +Bλ,Oi,q (7)
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 6
We use Bλ,O,ji,q and B
λ,O,j
i,q to denote the inter-task key
path blocking time and delay blocking time on `q of τi
caused by requests from task τj respectively where τj 6= τi,
and we have:
Bλ,Oi,q =
∑
j 6=i
Bλ,O,ji,q
Bλ,Oi,q =
∑
j 6=i
Bλ,O,ji,q
Then we divide the contribution to IOi,q by each individual
task τj 6= τi:
IOi,q =
∑
j 6=i
IO,ji,q =
∑
j 6=i
(
miB
λ,O,j
i,q +B
λ,O,j
i,q
)
.
Then Ii can be written as
Ii =
∑
`q∈Θi
(
IIi,q + IOi,q
)
=
∑
`q∈Θi
IIi,q +∑
j 6=i
IO,ji,q
 (8)
We use x to denote the number of accesses to resource
`q by vertices in the key path λ. We know x is in the scope
[0, Ni,q], but do not know its exact value. We define IIi,q(x)
and IOi,q(x) as the parameterized versions of IIi,q and IOi,q
with respect to x respectively, then
Ii ≤
∑
`q∈Θi
max
x∈[0,Ni,q ]
(
IIi,q(x) + IOi,q(x)
)
.
In the following, for different access polices we bound
IIi,q(x) and IOi,q(x) with a particular x, with which we then
bound Ii.
5.1 Unordered
We first develop analysis techniques that are applicable
without distinguishing the specific order in which requests
are served.
Lemma 4. IIi,q ≤ (mi − 1)(Ni,q − x)Li,q .
Proof. The total access time to resource `q by vertices of Ji
not in the key path λ is at most (Ni,q − x)Li,q which can be
divided into two disjoint parts, i.e., (Ni,q −x)Li,q = X +Y ,
where
• X is the total access time to resource `q that causes
key path blocking. We know
Bλ,Ii,q = X (9)
• Y is the total access time to `q that does not cause
key path blocking. By definition, key path blocking
and delay blocking cannot happen at the same time.
Therefore, any lock holding time that causes intra-
task delay blocking must be included in Y . Each time
unit in Y can cause at most (mi − 1) intra-task delay
blocking time (one processor is holding the lock and
at most mi−1 processors are spinning). In summary,
the intra-task delay blocking is bounded by
Bλ,Ii,q ≤ (mi − 1)Y (10)
By (9) and (10) we have
IIi,q = (mi − 1)Bλ,Ii,q +Bλ,Ii,q ≤ (mi − 1)(X + Y )Bλ,Ii,q
= (mi−1)(Ni,q − x)Li,q
The lemma is proved.
Lemma 4 directly implies:
Corollary 1. IIi,q ≤ (mi − 1)Ni,qLi,q
In the following we bound IOi,q . We use ηqi,j to denote the
maximal number of jobs of τj that may have contention on
resource `q with the analyzed job Ji of task τi, which can be
computed by [12], [14]:
ηqi,j =
{
dDi+DjTj e if both τi and τj access `q
0 otherwise
(11)
ηqi,j = 0 if either τi or τj does not access `q , since there is
no inter-task blocking between τi and τj due to `q .
Lemma 5. IO,ji,q ≤ miηqi,jNj,qLj,q
Proof. The maximum number of jobs of τj that may contend
with Ji on `q is η
q
i,j . The total access time to `q by all other
jobs of τj during [r(Ji), f(Ji)) is at most η
q
i,jNj,qLj,q . We
divide it into two disjoint parts ηqi,jNj,qLj,q = X+Y , where:
• X is the total access time to resource `q by τj that
causes key path blocking. We know
Bλ,O,ji,q = X (12)
• Y is the total access time to `q by τj that does not
cause key path blocking. By definition, key path
blocking and delay blocking cannot happen at the
same time. Therefore, any resource access time that
causes inter-task delay blocking must be included in
Y . Each time unit in Y can cause at most mi inter-
task delay blocking time (at most mi processors are
spinning). Therefore, the inter-task delay blocking is
bounded by
Bλ,O.ji,q ≤ miY (13)
By (12) and (13) we have
IO,ji,q = miBλ,O,ji,q +Bλ,O,ji,q ≤ mi(X + Y ) = miηqi,jNj,qLj,q
Now we are ready to bound τi’s worst-case response
time.
Theorem 1. For unordered, Ri is bounded by:
Ri ≤
Ci+(mi−1)(Li+
∑
`q∈Θi
Ni,qLi,q)
mi
+
∑
j 6=i
∑
`q∈Θi
ηqi,jNj,qLj,q
Proof. By condition (8), Corollary 1 and Lemma 5, we have
Ii ≤
∑
`q∈Θi
(mi − 1)Ni,qLi,q +mi∑
j 6=i
ηqi,jNj,qLj,q
 (14)
and by Lemma 3 the theorem is proved.
Task τi is schedulable if Ri ≤ Di, so we can calculate the
value of mi for τi to be schedulable based on Theorem 1:
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 7
Corollary 2. Task τi is schedulable on mi processors if
Di − (
∑
j 6=i
ηqi,j
∑
`q∈Θi
Nj,qLj,q+Li+
∑
`q∈Θi
Ni,qLi,q) > 0 (15)
and
mi =

Ci − (Li +
∑
`q∈Θi
Ni,qLi,q)
Di − (
∑
j 6=i
ηqi,j
∑
`q∈Θi
Nj,qLj,q+Li+
∑
`q∈Θi
Ni,qLi,q)

If each task can get enough processors according to
Corollary 2, the whole system is schedulable. Otherwise,
the system is decided to be unschedulable. This procedure
is shown in Algorithm 1.
Algorithm 1 Processor partitioning algorithm for un-
ordered.
1: for each task τi ∈ T do
2: if (15) is satisfied then
3: calculate mi according to Corollary 2;
4: if less than mi processors are available then
5: return unschedulable
6: end if
7: assign mi processors to τi
8: else
9: return unschedulable
10: end if
11: end for
12: return schedulable
5.2 FIFO-order
In the following we develop analysis techniques for FIFO-
order. We first derive an upper bound for IIi,q(x) with a
particular x:
Lemma 6. IIi,q(x) ≤ FI(x) in FIFO-order, where
FI(x) = ((Ni,q − x)(mi − 1)−max{1− x, 0}∆)Li,q
and ∆ = min{Ni,q,mi}
(
mi − min{Ni,q,mi}+12
)
.
Proof. We prove the lemma in two cases.
1) x 6= 0. By Lemma 4 we know for any x it holds:
IIi,q(x) ≤ (Ni,q − x)(mi − 1)Li,q (16)
2) x = 0. In this case, Bλ,Ii,q = 0 and B
λ,I
i,q is bounded
by the maximum blocking time that may be in-
troduced by Ni,q requests on mi processors which
equals (α(α−1)2 + (mi − 1)(Ni,q −α))Li,q [9], where
α = min{Ni,q,mi}.
Then we have
Bλ,Ii,q ≤ (
α(α− 1)
2
+ (mi − 1)(Ni,q − α))Li,q
= ((mi − 1)Ni,q − α(mi − α+ 1
2
))Li,q
= ((mi − 1)Ni,q −∆)Li,q
Therefore, when x = 0 (thus Bλ,Ii,q = 0) we have
IIi,q(x) = 0 +Bλ,Ii,q ≤ ((mi − 1)Ni,q −∆)Li,q
In summary, in both cases the lemma is proved.
Lemma 7. IOi,q(x) ≤ FO(x) in FIFO-order, where
FO(x) =
∑
j 6=i
min{miηqi,jNj,q, (Ni,q + (mi − 1)x)mj}Lj,q
Proof. From Lemma 5, we have:
IO,ji,q (x) ≤ miηqi,jNj,qLj,q (17)
With FIFO spin locks, at most mj requests from τj can be
spinning at the same time (in the queue waiting for `q), each
request of Ji for `q is blocked by at most mj requests from
another task τj (at mostmj requests from τj are in the queue
waiting for `q), so B
λ,O,j
i,q for x accesses to `q of vertices in λ
is bounded by
Bλ,O,ji,q ≤ xmjLj,q
The remaining Ni,q − x accesses to `q are from vertices not
in λ, for which Bλ,O,ji,q is bounded by
Bλ,O,ji,q ≤ (Ni,q − x)mjLj,q
Applying them to IO,ji,q (x) = miBλ,O,ji,q +Bλ,O,ji,q gives
IO,ji,q (x) ≤ (xmimj + (Ni,q − x)mj)Lj,q
= (Ni,q + (mi − 1)x)mjLj,q
By getting the minimum of this bound and the bound in
(17), the lemma is proved.
By now we have bounded both IIi,q(x) and IOi,q(x) for
resource `q with a particular x. Since x is unknown, we need
to find the value of x in [0, Ni,q] that leads to the maximal
IIi,q(x) + IOi,q(x). By doing this for each `q ∈ Θi, we obtain
an upper bound for Ii as follows:
Lemma 8. In FIFO-order, we have:
Ii ≤
∑
`q∈Θi
max
x∈[0,Ni,q ]
{FI(x) + FO(x)}
Then by applying this to Lemma 3, we can bound the
worst-case response time of τi:
Theorem 2. In FIFO-order, Ri is bounded by:
Ri ≤
Ci + (mi−1)Li +
∑
`q∈Θi
max
x∈[0,Ni,q ]
{FI(x) + FO(x)}
mi
where FI(x) and FO(x) are defined in Lemma 6 and 7.
Proof. Proved by Lemma 3 and Lemma 8.
If the number of processors mi assigned to each task is
given, we can use Theorem 2 to compute Ri and compare it
with Di to decide the schedulability of τi.
However, if the number of processors mi assigned to
each task is not given and we are required to partition the
total m processors to each task, we are not able to directly
compute mi for each task τi. This is because the worst-
case response time bound of a task in Theorem 2 (more
specifically, FO(x)) depends on the number of processors
assigned to other tasks. Therefore, there is a cyclic depen-
dency among the number of processor assigned to different
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 8
tasks: to decide mi for τi, we need to know mj for τj , while
to decide mj for τj , we need to know mi for τi.
In the following we present an algorithm to iteratively
compute mi for each task τi in the presence of the cyclic de-
pendency mentioned above. Initially, we set mi = d Ci−LiDi−Li e
for each τi, which is number of processors to make τi
schedulable without considering the shared resources [6].
This is a lower bound of our desired mi. Then starting with
these initial mi values, we gradually increase mi for each τi,
until finding a set of mi values for all tasks to make them
all schedulable according to Theorem 2. The pseudo-code of
this procedure is presented in Algorithm 2.
Algorithm 2 Processor partitioning algorithm for FIFO-
order.
1: For each τi: mi ← d Ci−LiDi−Li e;
2: while (1) do
3: update← 0;
4: for each task τi do
5: for each resource `q ∈ Θi do
6: find x ∈ [0, Ni,q] s.t., FI(x) +FO(x) is maximal;
7: end for
8: Compute the WCRT bound R′i using Theorem 2;
9: if R′i > Di then
10: mi ← mi + 1; update← 1;
11: end if
12: end for
13: if
∑
τi∈τ mi > m then
14: return unschedulable
15: end if
16: if update = 0 then
17: return schedulable
18: end if
19: end while
It is necessary to mention that Algorithm 2 is a heuristic
algorithm to compute mi for each task to be schedulable.
However, Algorithm 2 is not optimal in the sense that
minimum number of processors required by the whole task
set to be schedulable is obtained. For example, after an
iteration, there are two tasks that are not schedulable. Then
according to Algorithm 2, the number of processors required
by these two tasks are both increased by 1. However, it
is possible that after the number of processors required
by one of these two tasks is increased by 1, the other
task becomes schedulable. Finding the optimal processor
allocation algorithm is out of the scope of this paper which
will be investigated in our future work.
5.3 Priority-Order
In the following we develop analysis techniques for priority-
order. We use τHi and τ
L
i to denote the set of tasks with
higher and lower priorities than τi, respectively.
We first bound IIi,q(x). Since different requests to a
resource from the same task have the same priority, the
upper bound of intra-task blocking time in priority-order
is the same as in FIFO-order. Then we have:
Lemma 9. IIi,q(x) ≤ PI(x) in priority-order, where
PI(x) = ((Ni,q − x)(mi − 1)−max{1− x, 0}∆)Li,q
and ∆ = min{Ni,q,mi}
(
mi − min{Ni,q,mi}+12
)
.
Proof. The lemma is the same as the proof of Lemma 6.
In the following we bound IOi,q(x) in priority-order. We
use ∆qi,j to denote the maximal number of jobs of τj that
may have contention on resource `q with a single request
from job Ji of task τi, which can be computed by [12], [14]:
∆qi,j =
{
ddpr(τi,`q)+DjTj e if both τi and τj access `q
0 otherwise
(18)
where dpr(τi, `q) is delay-per-request [9] on `q of τi.
dpr(τi, `q) denotes the length of time interval between the
time that a request of `q from τi issues and the time it
is served, which can be calculated by a fix-point iteration
method (the calculation of dpr(τi, lq) is the same as in [9],
thus omitted here).
Lemma 10. IOi,q(x) ≤ POL (x) +POH(x) in priority-order, where
POL (x) = (Ni,q + (mi − 1)x) max
τj∈τLi
{Lj,q},
and
POH(x)=
∑
τj∈τHi
min{miηqi,jNj,q, (Ni,q+(mi−1)x)∆qi,jNj,q}Lj,q.
Proof. We divide IOi,q(x) by each individual task according
to its priority:
IOi,q(x) =
∑
τj∈τLi
IO,ji,q (x) +
∑
τj∈τHi
IO,ji,q (x)
With priority ordered spin locks, each resource access re-
quest of Ji for `q is blocked by at most one request
from all tasks with lower priorities than τi, so ∀τj ∈ τLi ,∑
τj∈τLi B
λ,O,j
i,q for x accesses to `q of vertices in λ is
bounded by ∑
τj∈τLi
Bλ,O,ji,q ≤ x max
τj∈τLi
{Lj,q}
The remaining Ni,q − x accesses to `q are from vertices not
in λ, for which
∑
τj∈τLi B
λ,O,j
i,q is bounded by∑
τj∈τLi
Bλ,O,ji,q ≤ (Ni,q − x) max
τj∈τLi
{Lj,q}
Applying them to IO,ji,q (x) = miBλ,O,ji,q +Bλ,O,ji,q gives∑
τj∈τLi
IO,ji,q (x) ≤ POL (x). (19)
In the following, we focus on bounding
∑
τj∈τHi I
O,j
i,q (x).
From Lemma 5, we have:
IO,ji,q (x) ≤ miηqi,jNj,qLj,q (20)
From (18), each resource access request of Ji for `q is
blocked by at most ∆qi,jNj,q requests from τj in priority-
order, so ∀τj ∈ τHi , Bλ,O,ji,q for x accesses to `q of vertices in
λ is bounded by
Bλ,O,ji,q ≤ x∆qi,jNj,qLj,q
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 9
The remaining Ni,q − x accesses to `q are from vertices not
in λ, for which Bλ,O,ji,q is bounded by
Bλ,O,ji,q ≤ (Ni,q − x)∆qi,jNj,qLj,q
Applying them to IO,ji,q (x) = miBλ,O,ji,q +Bλ,O,ji,q gives
IO,ji,q (x) ≤ (xmi∆qi,jNj,q + (Ni,q − x)∆qi,jNj,q)Lj,q
= (Ni,q + (mi − 1)x)∆qi,jNj,qLj,q
Getting the minimum of this bound and the bound in
(20) gives us: ∑
τj∈τHi
IO,ji,q (x) ≤ POH(x).
Combining with (19), the lemma is proved.
Then we can bound the worst-case response time of τi in
priority-order:
Theorem 3. In priority-order, Ri is bounded by:
Ri ≤
Ci + (mi−1)Li +
∑
`q∈Θi
max
x∈[0,Ni,q ]
{PI(x) + PO(x)}
mi
where PI(x) and PO(x) = POL (x) + POH(x) are defined in
Lemma 9 and 10.
Proof. The proof is done by sharing the same idea with the
proof of Theorem 2, thus omitted here.
Similarly with that in FIFO-order, we present an algo-
rithm to iteratively compute the minimum mi for each task
τi to be schedulable. We start by setting mi = d Ci−LiDi−Li e for
each τi and then gradually increase mi until finding the
minimum value of mi for τi to be schedulable according to
Theorem 3. The pseudo-code is shown in Algorithm 3.
Algorithm 3 Processor partitioning algorithm for priority-
order.
1: For each τi: mi ← d Ci−LiDi−Li e;
2: for each task τi do
3: while (1) do
4: for each resource `q ∈ Θi do
5: find x ∈ [0, Ni,q] s.t., PI(x) + PO(x) is maximal;
6: end for
7: Compute the WCRT bound R′i using Theorem 3;
8: if R′i > Di then
9: mi ← mi + 1;
10: else
11: break;
12: end if
13: end while
14: end for
15: if
∑
τi∈τ mi > m then
16: return unschedulable
17: else
18: return schedulable
19: end if
6 EVALUATIONS
In this section, we evaluate the performance of our ap-
proaches in terms of acceptance ratios, i.e., the ratio between
the number of task sets that are schedulable and the number
of the whole task sets, in comparison with the state-of-the-
art:
• XU-U: Algorithm 1 for unordered spin locks.
• XU-F: Algorithm 2 for FIFO-ordered spin locks.
• XU-P: Algorithm 3 for priority-ordered spin locks.
• SON-F: test for FIFO spin locks in [9].
• SON-P: test for priority-ordered spin locks in [9].
In particular, we adopt an optimal priority assignment
when evaluating XU-P and SON-P for priority-order, where
we try all permutations of priorities for each task set until
either the task set is schedulable or all permutations have
been checked 1. There are several different methods to make
the priority assignment, such as assign the locking-priorities
based on the tasks’ relative deadlines or simulated anneal-
ing to find an approximately optimal priority assignment
[9]. However, we adopt the optimal priority assignment
method to make a fair comparison with [9], which is also
shown with the best performance in [9].
We compare the above approaches with both synthetic
workload and workload generated according to realistic
OpenMP programs. It is necessary to mention that we do
not make simulations of scheduling or actually execute any
programs but test the schedulability of task sets (either
synthetic workload or realistic OpenMP programs) by using
their parameters according to different approaches listed
above.
6.1 Synthetic Workload
We first compare the three approaches with randomly
generated task systems. The DAG tasks are generated as
follows:
• Task Graph Gi = 〈Vi, Ei〉: The task graph of each
task is generated using the Erdös-Rényi method
G(|Vi|, p) [15]. For each task, the number of vertices
|Vi| is randomly chosen in [100, 400]. The WCET of
each vertex is randomly picked in [250, 600]. The
metrics of the number and WCETs of vertices are
consistent with the measurement results in [16]. For
each possible edge we generate a random value in
[0, 1] and add the edge to the graph only if the
generated value is less than a predefined threshold
p = 0.1. The same as in [17], a minimum number
of additional edges are added to make a task graph
weakly connected.
• Deadline and Period: The deadline Di of each task
τi is generated in a similar way with [9]: after Li is
fixed,Di is generated according to a ratio between Li
1. Note that enumerating all possible priority permutations may
result in computation explosion when the number of tasks is large (we
have at most 10 tasks in a task set in our experiments, i.e., in Figure
5.(e)). However, proposing methods of priority assignment is out of the
scope in this paper. We choose this method only for comparing with
the results from [9] where the optimal priority assignment is shown to
have the best performance.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 10
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 00 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
ept
anc
e R
atio
N o r m a l i z e d  U t i l i z a t i o n
 X U - U X U - F S O N - F S O N - P X U - P
(a) Under different Unorm.
1 2 6 2 5 2 3 7 8 5 0 4 6 3 0 7 5 6 8 8 2 1 0 0 80 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
epta
nce
 Ra
tio
N u m b e r  o f  A c c e s s e s  p e r  R e s o u r c e
 X U - U X U - F S O N - F S O N - P X U - P
(b) Under different
∑
τi∈τ Ni,q .
1 2 3 4 5 6 7 8 9 1 0 1 1 1 20 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
ept
anc
e R
atio
N u n b e r  o f  R e s o u r c e  T y p e s
 X U - U X U - F S O N - F S O N - P X U - P
(c) Under different |Θ|.
1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 00 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
ept
anc
e R
atio
M a x i m u m  L o c k i n g  T i m e  o f   E a c h  R e s o u r c e
 X U - U X U - F S O N - F S O N - P X U - P
(d) Under different max∀τi{Li,q}.
2 4 6 8 1 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
ept
anc
e R
atio
T a s k  N u m b e r
 X U - U X U - F S O N - F S O N - P X U - P
(e) Under different n.
0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 00 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
ept
anc
e R
atio
N o r m a l i z e d  U t i l i z a t i o n
 X U - U X U - F S O N - F S O N - P X U - P
(f) Under realisitic OpenMP programs.
Fig. 5. Comparisons with the state-of-the-art.
andDi randomly chosen in {0.125, 0.25}. The period
Ti is set to be equal to Di.
• Resource: The number of resource types is in the
range [1, 12]. The number of accesses to each resource
by all tasks
∑
τi∈τ Ni,q is in the range [16, 1008],
and is randomly distributed to different tasks. The
maximal locking time max∀τi{Li,q} of each resource
is in the range [5, 60] and each Li,q is randomly
picked in [1,max∀τi{Li,q}].
Since we only focus on heavy tasks, a task with Ui < 1
is discarded until a heavy task is generated during the
generation of each task. For each task set, we generate n
tasks where n is in [1, 14]. The normalized utilization Unorm
(the ratio between the total utilization and the number of
processors) of each task set is predefined, which will be
explained in detail for the configuration of each figure. After
we generate all tasks in a task set, we can compute the
total utilization U∑, then we set the number of processors
according to the formula m = d U∑Unorm e. The number of
processors could become quite large (far more than 10
processors) when U∑ is relatively low (e.g., lower than 0.2)
or the number of tasks in a task set is large (e.g., more than
6 tasks). For each configuration (corresponding to one point
on the X-axis), we generate 1000 task sets.
In Figure 5.(a)-(e), we set a basic configuration and
in each group of experiments vary one parameter while
keeping others unchanged. The basic configuration is as
follows: n = 4, Unorm = 0.5, the number of resource types
is 4,
∑
τi∈τ Ni,q = 256 and max∀τi{Li,q} = 15.
Figure 5.(a) shows acceptance ratios of all tests under
different normalized utilizations (X-axis). Figure 5.(b) eval-
uates the acceptance ratios under different
∑
τi∈τ Ni,q . We
can observe that the acceptance ratios of all tests decrease
as
∑
τi∈τ Ni,q increases. Figure 5.(c) shows the acceptance
ratios under different number of resource types. The accep-
tance ratios of all tests decrease as the number of resource
types increases. In Figure 5.(d), resources are generated
with different max∀τi{Li,q}. The schedulability of all tests
decreases as max∀τi{Li,q} increases. In Figure 5.(e), we
generate different number of tasks in each configuration.
The schedulability of XU-U, XU-F and SON-F decreases as
the number of tasks increases whereas the schedulability
of tests for priority order, i.e., XU-P and SON-P, is hardly
affected by the number of tasks.
From the above results we see that tests for priority-
order perform better than those for FIFO-order and un-
ordered, and our approaches consistently outperform the
state-of-the-art under different parameter settings: XU-P
outperforms SON-P and XU-F outperforms SON-F. In par-
ticular, even if XU-U adopts less queue order information, it
still consistently outperforms SON-F due to our new anal-
ysis techniques which systematically analyze the blocking
time that may delay the finishing time of a parallel task and
jointly consider the impact of blocking time to both the total
workload and the longest path length.
In the following, we conduct experiments to evaluate
the performance of both [9] and our results in comparison
with locking protocols for sequential tasks. That is, we try to
find a straightforward way to extend locking protocols for
sequential tasks to paralleled tasks, such that the points we
make in Section 3 can be more clear. Some modern analysis
techniques for sequential tasks use Linear Programming (LP)
to achieve more precise performance, e.g., [14], [18], which
are not included due to the following reasons. First, the
blocking times are defined under schedulability tests for se-
quential tasks which can not be directly applied for parallel
tasks (some significant modifications and techniques are re-
quired and it is not trivial). Second, the LP-based techniques
run with significant computing resources and time since
they are with quite high complexity (weeks on clustered
computers as provided by the authors of [9]) whereas our
tests and [9] are polynomial. OMLP is a well-known locking
protocol of clustered scheduling for sequential tasks [13]
which is also the most relevant work with this paper (DAG
tasks scheduled under federated scheduling can be regarded
as sequential tasks scheduled on clusters). In Fig.6, we apply
OMLP on DAG tasks in a straightforward manner where
each vertex in a DAG task is regarded as an independent
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 11
0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 00 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
epta
nce
 Ra
tio
N o r m a l i z e d  U t i l i z a t i o n
 X U - U X U - F S O N - F S O N - P X U - P O M L P
(a) Under different Unorm.
2 4 6 8 1 00 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
epta
nce
 Ra
tio
T a s k  N u m b e r
 X U - U X U - F S O N - F S O N - P X U - P O M L P
(b) Under different n.
1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 00 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Acc
epta
nce
 Ra
tio
M a x i m u m  L o c k i n g  T i m e  o f   E a c h  R e s o u r c e
 X U - U X U - F S O N - F S O N - P X U - P O M L P
(c) Under different max∀τi{Li,q}.
Fig. 6. Comparisons with OMLP.
sequential task. We first randomly distribute the generated
requests of each task to its vertices. The priorities of all
vertices in a DAG task are set the same as their indexes,
and a vertex with a smaller index has a higher priority.
We first compute the S-oblivious PI-blocking for each vertex
according to the blocking analysis techniques presented in
[13] and then add the PI-blocking to the WCET of the
vertex, after which the longest length among all paths and
the WCET of all vertices of τi are denoted by L′i and C
′
i
respectively. Then we use the general schedulability test
of federated scheduling for each DAG task [6], i.e., the
response time of task τi is computed by Ri ≤ L′i + C
′
i−L
′
i
mi
.
The schedualbility of the task set is decided in a similar way
with Algorithm 3 (the only difference is on the computation
of Ri).
In Figure 6.(a)-(c), we set a basic configuration and
in each group of experiments vary one parameter while
keeping others unchanged. The basic configuration is as
follows: n = 4, Unorm = 0.6, the number of resource
types is 9,
∑
τi∈τ Ni,q = 60 and max∀τi{Li,q} = 15. In
comparison with the basic configuration of Figure 5, we
have significantly reduced the total number of resource
accesses to evaluate the performance of our results in a
system with a modicum number of accesses (the case that is
more close to the practical scenarios). It may be noticed that
both our work and [9] are based on the classic Graham’s
bound [19]. Thus if there are no resource access contentions,
the schedulabilities of our result and [9] are the same, and
of course the gap of the performance between our results
and [9] becomes more significant when there are more
resource access contentions. From Figure 6, we can observe
that our results still outperform [9]. Moreover, even more
concrete information are used (i.e., the exact distributions of
requests), directly applying locking protocols and associated
blocking analysis techniques for sequential tasks on DAG
tasks is quite pessimistic (as discussed in Section 3).
6.2 Realistic OpenMP Programs
In the following, we evaluate the three approaches with
workload generated according to realistic OpenMP pro-
grams. OpenMP supports task parallelization since version
3.0 [20], which can be modeled as DAG models [16]. We
collect 8 OpenMP programs (see Table. 2) using C language
from different benchmark suits and transform them into
DAG model. We measure the Ci and Li of each program
and Ni,q and Li,q to each shared resource by each task on
a hardware platform with Intel i7-7820HQ CPU@2.90GHz,
cache size of 8MB and total memory of 4GB. The run time
compiling environment is Ubuntu 12.04.5 LTS with gcc 4.9.4.
We consider 10 different types of resources, where the first 4
are shared data objects in the operating kernel accessed via
system calls (i.e., time) or library calls (i.e., fprintf, printf,
malloc). The remaining 6 are shared data structures or non-
reusable routines protected by # pragma omp critical in the
OpenMP program.
The measurement results are summarized in Table 2,
where the time unit is µs. Note that the measurement results
are not guaranteed to be safe upper bounds of the desired
parameters. In order to obtain their safe upper bounds, a
comprehensive static analysis covering all the hardware and
software behaviors is required. In this paper, we simply
use these results to approximately represent the workload
characteristics of these OpenMP programs. It may be notices
that the number of types of shared resources that each
program may access is not large. The resources accessing
behaviors of programs in bots-1.1.2 are similar because of
that they are all commutative algorithms and may use some
similar library calls such as "malloc". These features does
not affect our evaluations, and the main purpose of our
evaluations is to show the impact to the schedulability of
realistic parallel programs with shared resources and the
schedulabilities under different methods.
For each task set, we pick n programs (each being a DAG
task) in Table 2, where n is randomly chosen in [2, 5]. The
deadline Di of each task and the number of processors in
the system are set in the same way as Section 6.1.
Fig. 5.(f) shows acceptance ratios of all tests under differ-
ent normalized utilizations (X-axis). We can observe that the
acceptance ratios of all tests decrease in comparison with
Fig. 5.(a). This is because some applications have relatively
short deadlines and periods by our task generation method,
and thus have low tolerance to blocking time caused by
other tasks. Nevertheless, the results have the same trend
as in Fig. 5.(a): XU-F consistently outperforms SON-F while
XU-P outperforms SON-P.
7 RELATED WORK
There is plentiful of literature on scheduling algorithms and
analysis techniques for the parallel real time tasks [6], [7],
[8], [23], [24], which all assume tasks to be independent from
each other and do not consider the locking issue.
Real-time locking protocols are well supported in
uniprocessor systems. The Priority Inheritance Protocol
(PIP) [3] is the first solution to address the priority in-
version problem. There are several optimal protocols for
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 12
TABLE 2
Measurement results of OpenMP programs.
Benchmark Application Ci Li `0 `1 `2 `3 `4 `5 `6 `7 `8 `9N L N L N L N L N L N L N L N L N L N L
bots-1.1.2 [21]
alignment.for 313168 11446 22 2 1 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
alignment.single 315981 9980 22 2 1 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
fft 274 58 21 2 1 4 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
fib 353 20 20 2 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sort 1757 217 20 2 2 4 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
floorplan 5843 92 36 2 6 1 2 2 0 0 4 1 0 0 0 0 0 0 0 0 0 0
OpenMPMicro [22] MatrixMultiplication 5873246 106983 0 0 3 7 0 0 5 4 0 0 0 0 0 0 0 0 0 0 0 0Square 50000812 1000066 0 0 0 0 0 0 0 0 0 0 20 5 50 1 50 105 50 79 50 1
uniprocessor real-time task systems, such as Multiprocessor
Stack Resource Policy (SRP) [2] and Priority Ceiling Protocol
(PCP) [3] which guarantee bounded blocking time for a
single resource access request and ensure deadlock freedom.
On multiprocessors, there are two major lock types: spin
locks and suspension-based semaphores. Much work has
been done for partitioned multiprocessor scheduling, such
as MPCP [5] and DPCP [25] and the Multiprocessor Stack
Resource Policy (MSRP) [26]. The Flexible Multiprocessor
Locking Protocol (FMLP) [10] is a family of locking protocols
which support both global and partitioned scheduling. The
Parallel Priority Ceiling Protocol (P-PCP) [27] is an exten-
sion of the PIP that attempts to avoid certain unfavorable
blocking situations. The family of O(m) Locking Protocols
(OMLP) [4], [13] is a suite of suspension-based locking
protocols that have proved to be asymptotically optimal
under suspension-oblivious analysis. Lakshmanan et al. [28]
proposed the Multiprocessor Priority Ceiling Protocol with
virtual spinning and Faggioli et al. [29] proposed a locking
protocol for reservation-based schedulers that includes pre-
emptable spinning.
A recent work considering locks for parallel real-time
task model is [9], which adopts the federated scheduling
framework with spin locks. As mentioned before, [9] ana-
lyzes the impact of the blocking time to the total workload
and the longest path length separately, which leads to sig-
nificant pessimism in analysis precision. The contribution of
this paper is to address this pessimism in [9].
The locking protocols have been studied with other
graph-based task models, such as the DRT model [30] and
multi-frame task model [31]. However, these models are still
sequential (multiple edges going out from a vertex have
conditional branching semantics rather than forking).
8 CONCLUSIONS
We study the analysis of parallel real-time tasks with spin
locks in three different orders under federated scheduling.
A recent work [9] developed analysis techniques for this
problem, which are pessimistic since all blocking time are
assumed to delay the finishing time of a parallel task and
the blocking time to the total workload and the longest
path length of each task is analyzed separately. In this
paper, we develop new schedulability and blocking analysis
techniques to improve the analysis precision. In our future
work, we will investigate blocking analysis on other (finer-
grained) models.
REFERENCES
[1] M. Jones, “What really happened on mars rover pathfinder,” The
Risks Digest, vol. 19, no. 49, pp. 1–2, 1997.
[2] T. P. Baker, “Stack-based scheduling of realtime processes,” Real-
Time Systems, vol. 3, no. 1, pp. 67–99, 1991.
[3] L. Sha, R. Rajkumar, and J. P. Lehoczky, “Priority inheritance
protocols: An approach to real-time synchronization,” IEEE Trans-
actions on computers, vol. 39, no. 9, pp. 1175–1185, 1990.
[4] B. B. Brandenburg and J. H. Anderson, “Optimality results for
multiprocessor real-time locking,” in RTSS. IEEE, 2010, pp. 49–
60.
[5] R. Rajkumar, “Real-time synchronization protocols for shared
memory multiprocessors,” in ICDCS. IEEE, 1990, pp. 116–123.
[6] J. Li, J. J. Chen, and et.al, “Analysis of federated and global
scheduling for parallel real-time tasks,” in ECRTS, 2014.
[7] C. Maia, M. Bertogna, and et.al, “Response-time analysis of syn-
chronous parallel tasks in multiprocessor systems,” in RTNS, 2014.
[8] X. Jiang, X. Long, and et.al, “On the decomposition-based global
edf scheduling of parallel real-time tasks,” in RTSS, 2016.
[9] S. Dinh, J. Li, K. Agrawal, C. Gill, and C. Lu, “Blocking analysis for
spin locks in real-time parallel tasks,” IEEE Transactions on Parallel
and Distributed Systems, vol. 29, no. 4, pp. 789–802, 2018.
[10] A. Block, H. Leontyev, B. B. Brandenburg, and J. H. Anderson, “A
flexible real-time locking protocol for multiprocessors,” in RTCSA.
IEEE, 2007, pp. 47–56.
[11] A. Wieder and B. B. Brandenburg, “On spin locks in autosar:
Blocking analysis of fifo, unordered, and priority-ordered spin
locks,” in RTSS. IEEE, 2013, pp. 45–56.
[12] B. Brandenburg and J. H. Anderson, “Scheduling and locking in
multiprocessor real-time operating systems,” Ph.D. dissertation,
Citeseer, 2011.
[13] B. B. Brandenburg and J. H. Anderson, “The omlp family of opti-
mal multiprocessor real-time locking protocols,” Design automation
for embedded systems, vol. 17, no. 2, pp. 277–342, 2013.
[14] M. Yang, A. Wieder, and B. B. Brandenburg, “Global real-time
semaphore protocols: A survey, unified analysis, and compari-
son,” in RTSS. IEEE, 2015, pp. 1–12.
[15] D. Cordeiro, G. Mounié, and et.al, “Random graph generation for
scheduling simulations,” in ICST, 2010.
[16] Y. Wang, N. Guan, J. Sun, M. Lv, Q. He, T. He, and W. Yi,
“Benchmarking openmp programs for real-time scheduling,” in
RTCSA. IEEE, 2017, pp. 1–10.
[17] A. Saifullah, D. Ferry, and et.al, “Parallel real-time scheduling of
dags,” Parallel and Distributed Systems, IEEE Transactions on, 2014.
[18] A. Wieder and B. B. Brandenburg, “On spin locks in autosar:
Blocking analysis of fifo, unordered, and priority-ordered spin
locks,” RTSS, 2013.
[19] R. L. Graham, “Bounds on multiprocessing timing anomalies,”
SIAM journal on Applied Mathematics, 1969.
[20] O. Board, “Openmp application program interface version 3.0,” in
The OpenMP Forum, Tech. Rep, 2008.
[21] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade,
“Barcelona openmp tasks suite: A set of benchmarks targeting the
exploitation of task parallelism in openmp,” in ICPP. IEEE, 2009,
pp. 124–131.
[22] V. V. Dimakopoulos, P. E. Hadjidoukas, and G. C. Philos, “A
microbenchmark study of openmp overheads under nested par-
allelism,” in International Workshop on OpenMP. Springer, 2008,
pp. 1–12.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XX 201X 13
[23] J. Fonseca, G. Nelissen, and V. Nélis, “Improved response time
analysis of sporadic dag tasks for global fp scheduling,” in Pro-
ceedings of the 25th international conference on real-time networks and
systems. ACM, 2017, pp. 28–37.
[24] X. Jiang, N. Guan, X. Long, and W. Yi, “Semi-federated scheduling
of parallel real-time tasks on multiprocessors,” in RTSS. IEEE,
2017, pp. 80–91.
[25] R. Rajkumar, L. Sha, and J. P. Lehoczky, “Real-time synchroniza-
tion protocols for multiprocessors,” in RTSS. IEEE, 1988, pp.
259–269.
[26] P. Gai, G. Lipari, and M. Di Natale, “Minimizing memory utiliza-
tion of real-time task sets in single and multi-processor systems-
on-a-chip,” in RTSS. IEEE, 2001, pp. 73–83.
[27] A. Easwaran and B. Andersson, “Resource sharing in global fixed-
priority preemptive multiprocessor scheduling,” in RTSS. IEEE,
2009, pp. 377–386.
[28] K. Lakshmanan, D. de Niz, and R. Rajkumar, “Coordinated task
scheduling, allocation and synchronization on multiprocessors,”
in RTSS. IEEE, 2009, pp. 469–478.
[29] D. Faggioli, G. Lipari, and T. Cucinotta, “The multiprocessor
bandwidth inheritance protocol,” in ECRTS. IEEE, 2010, pp. 90–
99.
[30] N. Guan, P. Ekberg, M. Stigge, and W. Yi, “Resource sharing
protocols for real-time task graph systems,” in ECRTS. IEEE,
2011, pp. 272–281.
[31] P. Ekberg, N. Guan, M. Stigge, and W. Yi, “An optimal resource
sharing protocol for generalized multiframe tasks,” Journal of
Logical and Algebraic Methods in Programming, vol. 84, no. 1, pp.
92–105, 2015.
Xu Jiang has received his BS degree in com-
puter science from Northwestern Polytechnical
University, China in 2009, received the MS de-
gree in computer architecture from Graduate
School of the Second Research Institute of
China Aerospace Science and Industry Corpora-
tion, China in 2012, and PhD from Beihang Uni-
versity, China in 2018. Currently, he is working
in Northeastern University, China. His research
interests include real-time systems, parallel and
distributed systems and embedded systems.
Nan Guan is currently an assistant professor at
the Department of Computing, The Hong Kong
Polytechnic University. Dr Guan received his BE
and MS from Northeastern University, China in
2003 and 2006 respectively, and a PhD from Up-
psala University, Sweden in 2013. Before joining
PolyU in 2015, he worked as a faculty member
in Northeastern University, China. His research
interests include real-time embedded systems
and cyber-physical systems. He received the
EDAA Outstanding Dissertation Award in 2014,
the Best Paper Award of IEEE Real-time Systems Symposium (RTSS)
in 2009, the Best Paper Award of Conference on Design Automation and
Test in Europe (DATE) in 2013.
He Du is currently a Ph.D. candidate at School
of Computer Science and Engineering, North-
eastern University. She received the Bachelor
degree from Northeastern University, Shenyang,
China, in 2015. Her research interests focus on
parallelism program analyze and multiprocessor
real-time scheduling.
Weichen Liu received the B.Eng. and M.Eng.
degrees from the Harbin Institute of Technology,
Harbin, China, and the Ph.D. degree from the
Hong Kong University of Science and Technol-
ogy, Hong Kong. He is an Assistant Professor
with the School of Computer Science and Engi-
neering, Nanyang Technological University, Sin-
gapore. He has authored and co-authored over
70 publications in peer-reviewed journals, con-
ferences, and books. His current research inter-
ests include embedded and real-time systems,
multiprocessor systems, and network-on-chip. Dr. Liu was a recipient of
the Best Paper Candidate Awards from ASP-DAC 2016, CASES 2015,
and CODES+ISSS 2009, the Best Poster Awards from RTCSA 2017
and AMD-TFE 2010, and the most popular Poster Award from ASP-DAC
2017.
Wang Yi received the PhD in computer sci-
ence from Chalmers University of Technology,
Sweden, in 1991. He is a chair professor with
Uppsala University. His interests include mod-
els, algorithms and software tools for building
and analyzing computer systems in a systematic
manner to ensure predictable behaviors. He was
awarded with the CAV 2013 Award for contri-
butions to model checking of real-time systems,
in particular the development of UPPAAL, the
foremost tool suite for automated analysis and
verification of real-time systems. For contributions to real-time systems,
he received Best Paper Awards of RTSS 2015, ECRTS 2015, DATE
2013 and RTSS 2009, Outstanding Paper Award of ECRTS 2012 and
Best Tool Paper Award of ETAPS 2002. He is on the steering committee
of ESWEEK, the annual joint event for major conferences in embedded
systems areas. He is also on the steering committees of ACM EM-
SOFT (co-chair), ACM LCTES, and FORMATS. He serves frequently
on Technical Program Committees for a large number of conferences,
and was the TPC chair of TACAS 2001, FORMATS 2005, EMSOFT
2006, HSCC 2011, LCTES 2012 and track/topic Chair for RTSS 2008
and DATE 2012-2014. He is a member of Academy of Europe (Section
of Informatics) and a fellow of the IEEE.
