Simultaneous Multithreading Applied to Real Time by Osborne, Sims Hill et al.
Simultaneous Multithreading Applied to Real
Time
Sims Hill Osborne












Existing models used in real-time scheduling are inadequate to take advantage of simultaneous
multithreading (SMT), which has been shown to improve performance in many areas of computing,
but has seen little application to real-time systems. The SMART task model, which allows for
combining SMT and real time by accounting for the variable task execution costs caused by
SMT, is introduced, along with methods and conditions for scheduling SMT tasks under global
earliest-deadline-first scheduling. The benefits of using SMT are demonstrated through a large-scale
schedulability study in which we show that task systems with utilizations 30% larger than what
would be schedulable without SMT can be correctly scheduled.
2012 ACM Subject Classification Computer systems organization → Real-time systems; Computer
systems organization → Real-time system specification; Software and its engineering → Scheduling;
Software and its engineering → Multithreading
Keywords and phrases real-time systems, simultaneous multithreading, soft real-time, scheduling
algorithms
Digital Object Identifier 10.4230/LIPIcs.ECRTS.2019.3
Related Version Longer version with all graphs at http://jamesanderson.web.unc.edu/papers/.
Supplement Material ECRTS 2019 Artifact Evaluation approved artifact available at
https://dx.doi.org/10.4230/DARTS.5.1.8
Funding Work supported by NSF grants CNS 1409175, CNS 1563845, CNS 1717589, and CPS
1837337, ARO grant W911NF-17-1-0294, and funding from General Motors.
1 Introduction
Simultaneous multithreading (SMT) is a technology developed in the 1980s and 90s that
allows multiple processes to issue instructions to different processor contexts, or threads, on a
single physical computing core, creating the illusion of multiple cores for every one core that
is actually present. It was designed to increase system utilization, particularly in the presence
of memory latency [6, 26]. SMT became widely available in 2002, when it was made available
on Intel processors [18]. Early experiments on the Pentium 4 showed that SMT could increase
throughput by a factor of more than 1.5 in the best case [1, 2, 25]. The first attempt to































© Sims Hill Osborne, Joshua J. Bakita, and James H. Anderson;
licensed under Creative Commons License CC-BY
31st Euromicro Conference on Real-Time Systems (ECRTS 2019).
Editor: Sophie Quinton; Article No. 3; pp. 3:1–3:22
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
3:2 Simultaneous Multithreading Applied to Real Time
enabling SMT and making every thread available for real-time work, it is possible to schedule
workloads with total utilizations up to 50 percent greater than what would be possible on
the same platform without SMT. While Jain et al. gave ample experimental evidence that
SMT can enable systems with higher utilization to be supported, neither they nor anyone
else, to our knowledge, has provided a schedulability test that takes SMT into account.
Unfortunately, SMT’s increase in throughput comes at the cost of longer and less
predictable execution times, caused by contention for limited hardware resources. Apparently,
the real-time systems community decided that this uncertainty makes SMT inappropriate for
real-time work. We question the validity of this assessment for soft real-time (SRT) systems
that may tolerate some tardiness. Evidence suggest that even others begin to question this
assessment in the context of safety-critical domains. In particular, the U.S. Federal Aviation
Administration has received requests to certify safety-critical applications that use SMT,
though they currently lack adequate techniques for doing so [21]. (We defer considerations
of safety-critical applications to future work.)
As evidence of the potential benefits of SMT, we present a sample of our results in
Fig. 1; a platform with 16 cores is capable of scheduling task systems with total utilizations
exceeding 20. We discuss this graph and others in Section 5.
Considered problem. We consider the problem of defining a scheduler for SRT systems
that reaps the benefits of SMT without sacrificing execution-cost predictability. Existing
models for analyzing real-time workloads do not allow us to specify how enabling SMT affects
a task, so to quantify the per-task effects of SMT, we introduce a new task model, SMART
(Simultaneous Multithreading Applied to Real Time). Using the SMART model, we attack
our problem by dividing it into three sub-problems:
Sub-Problem 1: Determine execution costs for tasks with SMT enabled. “Costs” is
plural for each task; one worst-case execution cost is not enough to define a task.
Sub-Problem 2: Decide which tasks should use SMT. How using SMT will affect any
given task is a function of what other tasks are using SMT.
Sub-Problem 3: Schedule so tasks using SMT do not interfere with tasks not using SMT.


















16 Cores, Task Utilization  (0.0, 0.4]
s  Gauss(0.72,0.13), f  Gauss(0.72,0.04)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 1 Schedulability on 16 cores with SMT. Note that the horizontal axis begins at utilization
16 and that schedulability does not begin to drop until utilization 20. Effectively, more than 20
cores worth of capacity can be had on a 16-core platform. We discuss this graph and others like it
in detail later.
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:3
The second sub-problem is particularly interesting. In general, allowing a task to execute
with SMT will decrease the demand the task places on the hardware platform but increase
the time needed for the task to execute. To address our problem, we need to balance the
advantages of decreasing platform demand with the disadvantages of increasing task execution
time. It is not enough to evaluate a task in isolation; every task that uses SMT may influence
every other task that uses SMT.
Motivation. Processors are expensive. For any workload, real time or not, it is desirable
to minimize the hardware cost needed to obtain a given level of performance. SMT is a
means to get the most work out of a given processor. Presently, SMT is widely implemented,
meaning there is a high chance that users are paying for SMT even if they are not using it.
A better understanding of SMT would allow for better use of existing hardware resources.
Related works. Snavely and Tullsen demonstrated that SMT performance is dependent
on which tasks share a core and introduced the term “symbiosis” to describe this concept
[24]. We have already mentioned Jain et al.’s work on SMT and real-time scheduling from
2002 [15]. Since then, Cazorla et al. [3], Gomes et al. [12, 13], and Zimmer et al. [28]
have proposed ways to eliminate the timing uncertainties associated with SMT by means of
detailed control over program execution and, in the case of Zimmer et al., a purpose-built
processor, FlexPRET. Cazorla et al. [3] and Lo et al. [17] gave methods to limit real-time
work to a small number of threads, leaving the remaining threads to execute only when doing
so will not interfere with real-time work. Mische et al. [20] proposed to use SMT to hide
context-switch times by using threads to switch task state in and out in the background.
Early work on the performance of tasks executed by hardware threads was done by Bulpin
[1], Bulpin and Pratt [2], Huang et al. [14], and Tuck and Tulsen [25]. Detailed analysis
of Intel’s microarchitecture, including the resource constraints that are relevant with SMT,
have been performed by Fog [11]. A preliminary version of our paper was presented as a
work in progress at RTSS 2018 [22].
Contribution and organization. We introduce the SMART task model, a method for
scheduling SMART tasks, and a related schedulability test. While other works focus on
modifying hardware to make SMT more predictable, our work allows for SMT-supported
real-time work to run on existing hardware and operating systems. We give results of
benchmark tests measuring the performance impacts of SMT with regard to execution times.
We show, using a schedulability study based on our benchmark results, that it is possible to
correctly schedule task systems with utilizations more than 30% greater than what would be
schedulable on the same platform without SMT enabled.1
The rest of this paper is organized as follows. In Section 2, we give a brief overview of
SMT technology, discuss the shortcomings of the sporadic task model with regard to SMT,
and introduce the SMART model. In Section 3, we address Sub-Problems 2 and 3, showing
how SMT can be used to schedule otherwise unschedulable task systems. In Section 4, we
address Sub-Problem 1, how to determine appropriate costs. (Note that we address our
sub-problems in reverse order.) In Section 5, we present our schedulability experiments and
results. In Section 6, we conclude and suggest future directions for our research.
1 While Jain et al. [15] were able to schedule systems with up to 50% greater utilization, they define a
“correctly scheduled system” as one having a low number of observed deadline misses, whereas we define
correctness as all tasks having analytically guaranteed bounded tardiness.
ECRTS 2019
3:4 Simultaneous Multithreading Applied to Real Time
Figure 2 Top: task execution without SMT.
Bottom: task execution with SMT.
Figure 3 Two tasks executing without SMT
(top) and with SMT (bottom). With SMT, each
task requires more time to complete individually,
but time for both tasks to complete is reduced.
2 What is a SMART Task?
Here we give a brief overview of SMT technology alongside the sporadic task model and its
limitations. We introduce SMART as an alternative model to address SMT.
2.1 SMT Basics
Cores with SMT enabled accept multiple instructions per cycle from multiple tasks, reducing
wasted instructions per cycle. A detailed explanation is available in Eggers et al.[6], but we
illustrate the essentials in Example 1.
I Example 1. Fig. 2 shows the effect of enabling SMT. At the top of the figure, tasks τ1
and τ2 execute sequentially without SMT on a processor that can accept two instructions
per cycle. When less than two instructions are available for execution, as at times 2, 3, and
elsewhere, processor cycles are lost. τ1 finishes at time 6 and τ2 at time 12. At the bottom of
the figure, the same tasks execute in parallel with SMT enabled, reducing the number of lost
processor cycles. Both tasks finish at time 9. In this case, SMT has the effect of delaying the
completion of τ1, but speeding up the completion of τ2, since it does not have to wait for τ1
to complete before beginning its own execution. 
Fig. 3 gives a more task-centric view of the two tasks seen in Fig. 2. For the remainder
of this paper, we will conceptualize tasks as seen in Fig. 3; we are interested in how long a
task takes to execute and how much of a core it uses, not an exact cycle-by-cycle accounting.
As shown in Fig. 3, SMT can cause individual tasks to take longer to complete, but total
throughput is potentially increased, since the number of wasted instruction slots can be
decreased. The challenge for real-time scheduling is to take advantage of this increased
throughput without allowing increased execution costs to render the system unschedulable.
The effect of SMT on task execution times is not constant across tasks; how much a task’s
execution time is increased by SMT depends on both the task itself and on other tasks that
might be executing on the same core.
To discuss SMT more easily, we make a distinction between a core and a processor. A
core is the hardware unit responsible for executing instructions. A processor is a single
instruction context on a core. Every computer core, by definition, supports at least one
processor, but computer cores capable of SMT may support multiple processors. We define
a physical processor as a processor that occupies an entire core, while a threaded processor
corresponds to a single hardware thread. Different threaded processors on the same core are
sibling processors. Tasks scheduled on sibling processors are said to be co-scheduled.
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:5
Figure 4 Example of cores supporting threaded processors only, physical processors only, or both.
We focus on a platform π that has m cores where every core supports one physical
processor or two threaded processors at a time. For example, Fig. 4 shows a system of six
cores. Cores 1-3 have SMT enabled and support two threaded processors each. Cores 5
and 6 have SMT disabled and support one physical processor each. Core 4 initially has
SMT disabled and supports one physical processor, but at time 1, SMT is enabled on core 4,
causing the single physical processor to be replaced by two threaded processors. We only
consider two threads per core because this is what Intel currently supports.
2.2 Task Model
In the traditional implicit-deadline sporadic task model, a task τi = (Ti, Ci) is defined by its
period, Ti, and its worst-case execution cost, Ci. The utilization of τi is given by ui = CiTi .
Every task releases an unlimited number of jobs, with the kth job released by τi denoted by
τi,k. Jobs of τi are released at least Ti units of time apart and have an implicit deadline of
Ti. If the jobs of each task τi are released exactly Ti units apart, then the task system is
periodic. We consider only SRT systems here, in which some deadline misses are acceptable.
In our model, a job’s tardiness is the difference between its completion time and deadline, if
the job completes after its deadline, and zero otherwise. A task’s tardiness is the maximum
tardiness of any of its jobs. We define an SRT system as being correctly scheduled if all tasks
have guaranteed bounded tardiness. A task system is SRT-schedulable under scheduling
algorithm A if it can be correctly scheduled by the specific algorithm A, and SRT-feasible if
it is SRT-schedulable by some algorithm A. An algorithm is SRT-optimal if it can schedule
all SRT-feasible task systems.
Given a platform π consisting of m identical cores and no SMT, a task system τ is
SRT-feasible if and only if
∀τi ∈ τ ui ≤ 1 and
n∑
i=1
ui ≤ m (1)
both hold [4].
The SMART model. The shortcoming of the sporadic model in regard to SMT is that
it only allows one worst-case execution cost per task, and therefore cannot adequately
characterize a task system’s behavior in the presence of SMT. For example, it is not possible
to specify the task behavior seen in Fig. 3 using the sporadic model. To address this
ECRTS 2019
3:6 Simultaneous Multithreading Applied to Real Time
shortcoming, we introduce the SMART model. In this model, every task is modeled as
τi = (Ti, (Ci:j)). All parameters must be rational. As in the sporadic model, Ti is the period
of τi. The parameter (Ci:j) is a list of costs that indicate the worst-case execution cost of a
job of τi given that the entire job is co-scheduled with one or more jobs of τj . We define Ci:i
to be τi’s cost when it executes on a normal physical processor. For all i 6= j, Ci:j ≥ Ci:i.2
We define ui:j = Ci:jTi .
Notice that we are implicitly making four simplifying assumptions here: (i) τi’s worst-case
execution time can be determined by examining how it is interfered with when co-scheduled
with each other task individually; (ii) when τi is co-scheduled with τj , every portion of τi
receives the same amount of interference from every portion of τj ; (iii) the two threads of a
given core are identical; and (iv) the hardware-level priority of τi and τj , when co-scheduled,
is fixed. In practice, (i) and (ii) will not necessarily hold, but we maintain that our model
is sufficient for non-safety critical SRT workloads. Currently, (iii) and (iv) hold on Intel
architectures [11]. We discuss (i) and (ii) further in Section 4 when we delve into the issue of
how to actually determine execution costs.
I Definition 2. The execution rate of τi given that it is co-scheduled with τj is given by
ri:j = CiCi:j , where both Ci and Ci:j are maximum observed execution times.
We assume no relationship between ri:j and rj:i; in fact, as we show in our benchmark
experiments, the two can differ significantly. Our definition assumes two hardware threads
per core, but could be expanded to allow for additional threads. In general, ri:j > 0.5
indicates that τi could benefit from being co-scheduled with τj assuming that Ci:j ≤ Ti and
Cj:i ≤ Tj hold.
I Example 3. Suppose Fig. 3 depicts one job each of SMART tasks τ1 and τ2. C1 = 6 and
C2 = 6, but C1:2 = 9 and C2:1 = 9, giving r1:2 = r2:1 = 23 . Task τ2 benefits from SMT; the
job completes at time 9 with SMT as opposed to time 12 without. If both jobs are released
at time 0 and have a deadline at time 10, then SMT allows for both jobs to complete on
time, whereas without SMT, τ2’s job misses its deadline. 
Scheduling SMART tasks. We need to schedule n tasks that have n costs each; this
problem poses obvious difficulties. In the next section, we show how we can schedule SMART
tasks similarly to traditional sporadic tasks without sacrificing the advantages of SMT.
3 Scheduling Physical and Threaded Tasks
Not all tasks will benefit from SMT. We label tasks that should and should not use SMT as
threaded tasks and physical tasks, respectively. Physical tasks can execute only on physical
processors and threaded tasks only on threaded processors. To keep the task types separate,
we divide them into two task subsystems, τp and τh, that we schedule separately.
I Definition 4. Subsystem τp is the set of all physical tasks in τ. np = |τp|. Subsystem τh
is the set of all threaded3 tasks in τ. nh = |τh|.
2 In the rare event that Ci:j < Ci:i holds, the two are likely close in value, and we can simply redefine
Ci:j to equal Ci:i.
3 We use h rather than t for threaded so as to avoid confusion with t for time.
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:7







threaded tasks have cost Chi and utilization uhi =
Chi
Ti
. Costs for physical tasks are no different
than costs in a sporadic task system, but costs for threaded tasks are a function of how
the task system is divided. These cost parameters are a simplification of the full SMART
parameters; we will show how to obtain them in Section 3.2.




i and Uh =∑nh
i=1 u
h
i respectively. To measure the total demand placed on the platform, we define effective
utilization, UE = Up + U
h
2 . U
h is halved in the sum to reflect the fact that each threaded
task requires only half a core at a time to execute. 
3.1 Sub-Problem 3: Scheduling Task Subsystems
In this section, we give conditions for scheduling τp and τh on π. We assume the decision of
which tasks should be physical and which should be threaded has already been made. Our
current problem is how to schedule those tasks, but the best way to do so is not clear.
I Example 6. Suppose we attempt to schedule a task system τ using global earliest-deadline-
first scheduling (GEDF). Let τ1 be a threaded task and τ2 a physical task such that at time
t, a job of τ1with a deadline of t+ 1 is contending for a single core with a job of τ2 with a
deadline of t + 2. Following GEDF, the job of τ1 should be given priority over that of τ2.
However, if no other threaded task has an active job at time t, then doing so will cause
the second threaded processor of a core in π to be unused, negating any advantage gained
by having τ1 be threaded. If we avoid this problem by giving priority to τ2, then we are
not wasting processor capacity, but we are violating EDF priority rules. If we co-schedule
the tasks on threaded processors despite τ2 being a physical task, then unanticipated task
interference may ensue, potentially invalidating assigned per-task worst-case execution costs.
None of these approaches is particularly satisfactory. 
To address the problems raised in Example 6, we divide π into sub-platforms πp and πh.
I Definition 7. πp is the sub-platform of π that schedules only tasks in τp. It includes
mp = bUpc fully available cores and one partially available core. Given a length-W interval,
denoted a window, the partially available core belongs to πp for apW time units per window,
where ap = Up − bUpc. πp can exist only if Up ≤ m.
I Definition 8. πh is the sub-platform of π that schedules only tasks in τh. It has mh =
m− dUpe fully available cores and one core available for ahW time units per window, where
ah = dUpe − Up. Consequently, mh + ah = m− Up. If ap > 0, then ah = 1− ap.
We refer to the core shared by both platforms as the shared core. If there is no shared
core, then ap = ah = 0. Note that mp + ap +mh + ah = m must hold.
I Example 9. In Fig. 4, πp is shown in dark gray and πh in light gray. The sub-platforms
are defined by mp = 2, mh = 3, W = 3, ap = 13 , and a
h = 23 . .
We now give schedulability results for τp and τh individually and then combine those
conditions to get an overall schedulability result. For the most part, we will focus on the
case where a shared core exists. Our results are based on Devi and Anderson’s EDF-high-low
(EDF-hl) algorithm [5]. EDF-hl gives schedulability conditions and tardiness bounds for
“low” SRT tasks that are scheduled according to GEDF but are subject to interruption from
periodic “high” hard real-time tasks, with at most one such task fixed on each processor.
ECRTS 2019
3:8 Simultaneous Multithreading Applied to Real Time
For our purposes, we can view τp as a set of low tasks scheduled on mp + dape processors
and subject to preemption by a single high task with period W and cost ahW. This reflects
the fact that, from the perspective of τp, work on the shared core is periodically preempted.
Likewise, we can view τh as a set of low tasks scheduled on 2(mh + 1) processors that are
periodically preempted by two high tasks, both with period W and cost apW. The following
definitions apply to the EDF-hl results.
I Definition 10. Devi and Anderson define τH as the set of all high tasks, τL as the set of
all low tasks, umax(τL) as the highest-utilization task within τL, Usum as the total utilization
of both τH and τL, UH is the sum of all the utilizations of all tasks in τH , and UL is the sum
of the min(dUsume − 2, n) largest utilization of tasks in τL.
We state an abridged version of Theorem 1 in [5] here. The full version defines the
tardiness bound B as a function of the task system and platform. We omit that portion of
the theorem due to space constraints.
I Theorem 11. EDF-hl ensures a tardiness bound of at most B to every task τi of τL if
|τH | ≤ m and Usum ≤ m and at least one of (2) or (3) holds.
m− |τH | − UL > 0 (2)
m−max(|τH | − 1, 0)umax(τL)− UL − UH > 0 (3)
Returning to our problem, our schedulabilty conditions rely on the following assumptions.
These assumptions allow us to schedule τp and τh as if they both consisted of standard
sporadic tasks. We will show how to support Assumptions 1 and 2 in Section 3.2.
I Assumption 1. Tasks have been divided into threaded and physical tasks such that ∀τpi ∈
τp, upi ≤ 1 and ∀τhi ∈ τh, uhi ≤ 1 both hold. Without loss of generality, we assume that the
tasks in each of the sets τp and τh are indexed in decreasing-utilization order, e.g., up1 (resp.,
uh1) is the largest utilization in τp (resp., τh).
I Assumption 2. Worst-case costs for physical and threaded tasks have been determined.
I Assumption 3. Physical tasks are not permitted to execute on threaded processors.4
I Lemma 12. τp is schedulable on πp under GEDF such that all tasks have guaranteed
bounded tardiness if (4) holds.
Up ≤ mp + ap. (4)
Proof. If ap = 0, then the result restates the SRT feasibility condition for m identical, fully
available processors from (1). GEDF is known to be SRT-optimal [4], so the result follows.
If mp = 0, then it can easily be shown that the system is schedulable only if Up ≤ ap.
In the rest of the proof, we consider the remaining possibility, i.e., that ap > 0 and mp > 0
both hold. For this case, we show that Theorem 11 can be applied.
From the perspective of τp, there exists a set of low tasks τp with total utilization Up,
one high task with utilization ah, and mp+ 1 processors. Thus, we want to apply Theorem 11
with the substitutions m ← mp + 1, τL ← τp, Usum ← Up + ah, and |τH | = 1. With
4 When the shared core belongs to πp, it supports a physical processor, not a threaded processor.
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:9
these substitutions, (4), and Def. 8, it is straightforward to see that both |τH | ≤ m and
Usum ≤ m hold, as required by Theorem 11. We now show that (2) holds, from which
bounded tardiness for the tasks in τL, i.e., those in τp, follows. To see this, note that from










⇒ {because upi ≤ 1 holds, by Assumption 1}
UL < m
p.
From this inequality, we have m− |τH | − UL = mp + 1− 1− UL > 0, as required by (2). J
The schedulability condition for τh is slightly more complicated, due to it potentially
having two partially available processors.
I Lemma 13. τh is schedulable on πh under GEDF such that all tasks have guaranteed
bounded tardiness if (5) and at least one of (6) or (7) hold, where umax(τh) denotes the
maximum task utilization in τh.









Proof. As in the prior proof, the proof is straightforward if either ah = 0 holds or mh = 0
holds, so we focus on the remaining possibility, i.e, mh > 0 and ah > 0 both hold; note that
the latter implies that ap > 0 holds as well. As before, we will use Theorem 11. In this case,
we are attempting to schedule a set of low tasks τh with total utilization Uh on 2(mh + 1)
processors given two high tasks, each with utilization ap. Thus, we want to apply Theorem 11
with the substitutions m ← 2(mh + 1), τL ← τh, Usum ← Uh + 2ap, and |τH | = 2. With
these substitutions, (5), and Def. 8, it is straightforward to see that both |τH | ≤ m and
Usum ≤ m hold, as required by Theorem 11. In the rest of the proof, we show that, with
these substitutions, (6) implies (2) and (7) implies (3), from which bounded tardiness for the
tasks in τL, i.e., those in τh, follows.
ECRTS 2019
3:10 Simultaneous Multithreading Applied to Real Time
To see that (6) implies (2), first note that, because mh is an integer, we have dUsume−2 ≤










= {by the definition of UL in Def. 10}
2mh > UL,
i.e., 2mh − UL > 0 holds, which is equivalent to (2), since m = 2(mh + 1) and |τH | = 2.
To see that (7) implies (3), observe that




⇒ {reasoning as above}
2(mh + ah)− umax(τh) > UL
= {because ah = 1− ap}
2(mh + 1− ap)− umax(τh) > UL,
= {in our context umax(τh) = umax(τL), |τH | − 1 = 2, UH = 2ap, and m = 2(mh + 1)}
m−max(|τH | − 1, 0)umax(τL)− UH > UL,
which is equivalent to (3). J
A special case applies when there is no shared core.
I Lemma 14. If ah = 0, then τh is schedulable on πh under GEDF if and only if Uh ≤ 2mh
holds.
Proof. With no shared core, the platform consists of 2mh identical cores. The standard SRT
feasibility test given by (1) applies. J
Our next step is to give a schedulability condition for τp and τh combined on π. This
condition is a straightforward extension of the preceding lemmas, but it has the benefit of
letting us focus on τ rather than on how π is partitioned.
I Theorem 15. Platform π can be partitioned such that τp is schedulable on πp and τh is
schedulable on πh, both under GEDF, if (8) and at least one of (9) or (10) hold.









S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:11
Proof. In order to define mp and ap so that mp + ap = Up holds, as in Def. 7, we merely
require Up ≤ m to hold, and by Def. 5, this is implied by (8). Note that mp + ap = Up
satisfies Condition (4) in Lemma 12.
Schedulability of τp on πp is implied by (8):
UE ≤ m




= {by Def. 7, mp + ap = Up}
Up = mp + ap,
which is the condition for τp per Lemma 12.
We next show that (8) implies Condition (5) of Lemma 13. To see this, observe that, by
Def. 5, UE ≤ m ⇒ U
h
2 ≤ m−U
p. Also, by Def. 8, mh + ah = m−Up. Putting these facts
together, we have Uh ≤ 2(mh + ah), which is (5).
We conclude the proof by showing that (9) is equivalent to Condition (6) of Lemma 13,










Similarly, to see that (10) holds, note the following.




= {by Def. 8, m− dUpe = mh.}




Having verified all conditions of Lemmas 12 and 13, we conclude that τp is schedulable on
πp and τh is schedulable on πh. J
Again, a special case applies if Up is integral.
I Corollary 16. If Up is integral, then both τp and τh are schedulable on their respective
sub-platforms under GEDF so long as UE ≤ m holds.
Proof. Similar to the proof of Lemma 14. J
It is not strictly necessary that πp be defined as we do here. If we allow other design
considerations, such as maximizing cache affinity or minimizing tardiness, different platform
definitions may be preferable, but we defer those possibilities to future work.
By themselves, the results of this section are not very useful, since there are an exponential
number of possible ways to partition π. In the next section, we show how to efficiently find
τp and τh that will be schedulable under Theorem 15.
ECRTS 2019
3:12 Simultaneous Multithreading Applied to Real Time
3.2 Sub-Problem 2: Dividing the Tasks
We have addressed how to schedule a task system τ for a given pair of subsystems τp and τh.
Here, we show how we arrive at Assumption 1 – τ has already been divided – and weaken
Assumption 2, which states that all execution costs have been determined, to the following:
I Assumption 4. If τi is a threaded task, then Chi = max∀τj∈τ Ci:j .
Oblivious scheduling. We first work through a simple example of dividing a task system
and then formalize that approach into what we term symbiosis-oblivious partitioning.5 We
then show how our approach can be improved by modifying Assumption 4.
























6, 6, 5.3, 4
))
.







8 = 2.125 and will require three cores to be feasibly scheduled (recall
that Ci:i gives τi’s cost with nothing co-scheduled, i.e., without SMT). Based on Assumption
4, we see that Ch1 = 10 if τ1 is threaded. Because T1 = 8, making τ1 threaded would give
uh1 = 108 , making the system unschedulable. For τ2, C
h
2 would be at most τ2’s period, but
Ch2 = 4 would be more than twice C
p
2 = 1. Part of the schedulability condition given in
Theorem 15 is that UE ≤ m. Because UE is defined as UE = Up + U
h
2 (Def. 5), placing τ2
in τh would increase UE more than placing τ2 in τp, so we do not wish for τ2 to be threaded.
For both τ3 and τ4, maxCi:j ≤ Ti and min( CiCi:j ) ≥ .5 both hold, so letting those tasks be
threaded would decrease UE compared to placing them in τp without violating uhi ≤ 1, so
we allow those tasks to be threaded, giving uh3 = 34 and u
h
4 = 68 . The resulting partition has
Up = 78 +
1
4 , U
h = 34 +
6
8 , and U
E = 1.875. It can, per Theorem 15, be scheduled on only
two cores. 
We formally state the steps we just took in Algorithm 1, which partitions τ into τp and τh
so as to minimizes UE subject to uhi ≤ 1 for all threaded tasks and |τh| ≥ 2. The resulting
partition is then schedulable if Theorem 15 holds. We require that |τh| ≥ 2 holds since
allowing a single threaded task will give no schedulability advantage compared to letting all
tasks by physical. We refer to partitions of τ that obey both these constraints as legal. We
will examine the effectiveness of Algorithm 1 in our schedulability study.
I Definition 18. A partition of τ into τp and τh is legal if and only if ∀ τhi ∈ τh, uhi ≤ 1
and |τh| 6= 1 hold.
A more complex cost model. Under Assumption 4, the only variable that influences the
cost of τi is whether τi is physical or threaded. However, Assumption 4, and consequently
Algorithm 1, is highly pessimistic with regard to assigning Chi values. Returning to Example
17, we declared Ch3 = 3 on the grounds that ∀j, maxC3:j = 3 holds. However, there is a
limitation to that logic; Ch3 = 3 is based on the assumption that τ1 can interfere with τ3, but
in our example, we decided that τ1 should not be threaded. We can remove this limitation,
thereby improving our model, by replacing Assumption 4 with Assumption 5. The difference
is that under Assumption 5, Chi is only based on other threaded tasks, not on all tasks in τ.
5 The terms symbiosis-oblivious and symbiosis-aware scheduling were previously used by Jain et al. [15].
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:13
Algorithm 1: Oblivious Partitioning.
1: for all τi ∈ τ do
2: Chi ← max∀j≤n Ci:j
3: if Chi ≤ Ti and CiCh
i
≥ 2 then
4: τh ← τh ∪ τi
5: else
6: Cpi ← Ci:i
7: τp ← τp ∪ τi
8: end if
9: end for
10: if |τh| < 2 then
11: τp ← τp ∪ τh
12: τh ← ∅
13: end if
14: return τp, τh
I Assumption 5. If τi is threaded, then Chi = max∀τj∈τh Ci:j .
The difference is that while Assumption 4 considers interference from all tasks in τ,
Assumption 5 considers interference only from other tasks in τh.While this approach removes
some of the pessimism present in symbiosis-oblivious scheduling, it has the disadvantage
that every time a task is added to or removed from τh, Chi may change for all tasks in τh.
We refer to task-partitioning algorithms that incorporate Assumption 5 as symbiosis-aware
partitioning. We give a brief demonstration of symbiosis-aware partitioning in Example 19,
using the same task set as in Example 17.
I Example 19. We first decide that τ1 must be physical, since ∀ j 6= 1, C1:j > T1. Knowing
that no task will be co-scheduled with τ1, we have Ch2 = 2 and Ch3 = 2.6, giving uh2 = 24 and
uh3 = 2.64 , but leaving u
h
4 unchanged. (In Example 17, we made τ2 a physical task and τ3 a
threaded task with uh3 = 34 .) Now we make all of τ2, τ3, and τ4 threaded, with τ3 having a
lower utilization than before. We now get Up = 78 and U




8 , so that U
E = 1.83.
a reduction from UE = 1.875 in Example 17. Again, τp and τh are schedulable on two
cores per Theorem 15.
A greedy approach to schedulability. We propose to use Algorithm 2 to partition τ. The
algorithm seeks to minimize UE by repeatedly moving a task from τp to τh, or vice versa,
to give the greatest decrease in UE . It does so until either a specified maximum number of
attempts has been made or it reaches a partition that cannot be improved by the movement
of any single task. The algorithm is not optimal, even given an unlimited number of attempts,
as there may exist partitions of τ that cannot be improved by moving any one task but can
be improved by moving two or more tasks.
The for loop of lines 3 through 16 determines, for every τi in τp, the benefit of moving
that task to τh. Line 4 tests what Chi would be if τi were in τh. Lines 10 through 13 calculate
the change to tasks already in τh caused by moving τi, and line 15 gives the total change to
UE caused by moving τi to τh.
Similarly, the for loop of lines 19 through 23 determines the benefit of moving τj to τp,
for every τj currently in τh. Line 20 gives the change to tasks remaining in τh caused by
moving τj , and line 22 gives the total change to UE caused by moving τj to τp. The if of
line 25 guarantees that no task will be moved unless moving that task will decrease UE ,
preventing the algorithm from placing τ into any one partition more than once.
ECRTS 2019
3:14 Simultaneous Multithreading Applied to Real Time
Algorithm 2: Greedy Partitioning.
Require: τ partitioned such that ∀τi ∈ τh, uhi ≤ 1 and |τh| ≥ 2
1: for `← 1...maxLoops do
2: . Identify best move from τp to τh
3: for all τi ∈ τp do




6: if uhi > 1 then
7: continue
8: end if
9: . Calculate how adding τi to τh will affect tasks already in τh
10: if moving τi to τh will cause uhj ≥ 1 for any τj ∈ τh then
11: continue
12: end if
13: I(τhi )← total increase in util. of tasks already in τh caused by moving τi
14: . ∆(i) gives decrease to UE caused by moving τi.






17: . Identify best move from τh to τp
18: if |τh| > 2 then
19: for all τj ∈ τh do
20: D(τhj )← total decrease in util. of tasks already in τh caused by moving τj











25: if no task has a positive ∆ value then
26: break
27: end if
28: Move task with maximum ∆ to other subsystem and update threaded costs
29: end for
30: return(τp, τh)
The algorithm returns a partition that can be tested for schedulability by Theorem 15.
Algorithm 2 assumes, and maintains as an invariant, that the partition is legal, as defined
in Def. 18. To begin Algorithm 2, τ must already be in a legal partition. We propose three
ways to achieve this. First, in the greedy-threaded approach, we begin with all tasks in τh
and then place into τp all tasks for which any possible Chi value will give uhi > 1. Intuitively,
putting tasks in τh whenever possible should be beneficial, so we should start with as many
tasks in τh as possible.
Second, in the greedy-physical approach, we start with all tasks in τp apart from the
single pair of tasks that will give the greatest decrease to UE . This can be done by defining
the decrease to UE associated with a single pair of tasks (τi, τj) as











S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:15




When τi and τj are placed into τh, upi and u
p
j are no longer part of Up and can be subtracted






, to UE . We expect
this approach will be more efficient than the first one in task systems where upi is typically
large or CiCi:j is typically small, since there will be relatively few tasks that can be placed in
τh, making it more efficient to begin with the majority of tasks in τp. If no satisfactory pair
of tasks exists, then we conclude that SMT should not be used with this task system.
Third, in the greedy-mixed approach, we first run Algorithm 1 and use the partition given
by doing so as our starting point. Intuitively, Algorithm 1 by itself should give a partition
with a lower UE value than either of the other two approaches, so using it is a starting point
should yield better results. As with the greedy-physical approach, if Algorithm 1 places
no tasks in τh, then we conclude that SMT should not be used. We compare these three
approaches in our schedulability experiments, presented in Section 5. We found that for all
three versions of Algorithm 2, there existed task systems that were schedulable according to
that version alone. In fact, the greedy-physical approach seemed to find more schedulable
task systems than the other two.
4 Sub-Problem 1: SMT and Execution Times
Current literature does not address how SMT affects worst-case execution costs. While the
early 2000s saw multiple detailed analyses of the performance effects of SMT [1, 2, 25], little
work of this type has been done since then. While ongoing research into scheduling with SMT
exists outside of real time [7, 8, 10, 23], this current research does not suit our needs for two
reasons. First, it tends to be oriented towards total throughput and average execution costs,
whereas we need information on worst-case execution costs. Second, the current works we
are aware of compare different methods of implementing SMT, but do not compare systems
that use SMT to those that do not use it.
4.1 Benchmark Experiments
To analyze the effects of SMT on worst-case execution costs, we ran a series of experiments
using the TACLeBench sequential benchmarks [9], which consist of 23 C implementations
of functions commonly found in embedded and real-time systems. All of our experiments
were conducted in Linux on an Intel Xeon Silver 4110 2.1 GHz CPU with eight cores, each
capable of supporting two threaded processors, running Linux.6
To get baseline results for execution times without SMT enabled, we looped each bench-
mark 1,000 to 100,000 times – lower cost benchmarks got more loops – and timed the
execution of each loop using a nanosecond resolution timer. Between loops, an array the
size of the L3 cache was allocated and set, so that every execution started with a cold cache.
Benchmarks were assigned a Linux real-time priority, prioritizing them above all normal tasks,
pinned to a single processor, and executed sequentially. We excluded four benchmarks from
the set – anagram, audiobeam, g723_enc, and huff_dec – as they would not correctly execute
in a loop. Results of our baseline experiments are summarized in Table 1. The last column
gives the coefficient of variation, defined as the standard deviation divided by the mean.
6 Code for these experiments is available at https://github.com/JoshuaJB/SMART-ECRTS19,
https://jamesanderson.web.unc.edu/papers/, and https://doi.org/10.4230/DARTS.5.1.8.
ECRTS 2019
3:16 Simultaneous Multithreading Applied to Real Time
Table 1 Baseline Execution Times (ns).





adpcm_dec 167,380 151,914 0.006659
adpcm_enc 158,053 147,394 0.006463
ammunition 47,979,870 47,899,553 0.001589
cjpeg_transupp 844,791 827,661 0.002087
cjpeg_wrbmp 32,420 26,712 0.010552
dijkstra 15,740,782 15,719,309 0.000445
epic 665,837 649,170 0.002284
fmref 154,776 99,280 0.068863
gsm_dec 470,193 463,592 0.002546
gsm_enc 1,337,465 1,320,787 0.001934
h264_dec 93,361 82,045 0.006340
huff_enc 247,232 234,213 0.005431
mpeg2 135,009,849 134,898,300 0.000248
ndes 21,600 15,426 0.015071
petrinet 3,682 62 0.215268
rijndael_dec 965,022 940,081 0.007688
rijndael_enc 872,400 858,645 0.002224
statemate 11,928 6,495 0.026602
susan 10,958,260 10,932,188 0.000379
For threaded execution times, every task was executed alongside every other task. For
each pair, the measured task was executed the same number of times as in the baseline
experiments while an interfering task executed continuously at equal priority on the second
thread of the same core. Our results are summarized in Fig. 5, which shows ri:j for every
pair of tasks, with the measured task as τi and the interfering task as τj . Observed rates
ranged from 0.51 (mpeg2 interfering with epic) to 1.00, with the exception of values involving
petrinet. Petrinet has an extremly short execution time, as indicated in Table 1; we suspect
its strange behavior is merely random noise.
We cannot guarantee that our experiments captured the maximum interference to τi
caused by τj . However, the low coefficients of variation recorded in Fig. 5 imply that different
interleavings of τi and τj will cause only minor variations in the cost of τi. As discussed in
Section 4.3 below, SRT systems may tolerate some cost overruns.
While we have defined Ci:i as the cost of τi with no co-schedule, the main diagonal of
Fig. 5 shows how much slower a task runs when executed with a second copy of itself. This
is irrelevant for real-time systems in which task parallelism is forbidden, but is relevant to
systems in which different jobs of the same task may execute in parallel, as discussed by
Voronov, Anderson, and Yang [27]. Prior to performing our experiments, we had expected
that tasks executed alongside copies of themselves would have very low ri:j , values, due to
competing for the same resources, but our experiments show this is not necessarily the case.
4.2 Benchmark Characterization
In our results, we observe that tasks are relatively consistent both in how vulnerable they are
to interference from other tasks and in how much interference they cause to other tasks. This
is similar to other results in the literature [1, 2, 14, 25]. We say that tasks that experience
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:17
Figure 5 Effect of SMT on execution times. Measured benchmarks execute with the listed ri:j
values when sharing a thread with a given interfering benchmark. Shading is darkest on smallest
values. Right column shows the maximum coefficient of variation experienced by each measured
benchmark over all interfering benchmarks.
little interference from other tasks – i.e. tasks τi for which ri:j tends to be high – are strong,
and that tasks which cause little interference to other tasks – i.e. τi for which rj:i tends to
be high – are friendly. When we define a strength score si = meanj(ri:j) and friendliness
score fi = meanj(rj:i), no task has a Pearson correlation7 with absolute value greater than
0.14 between si and fi values. Bulpin’s work on the behavior of threaded tasks discusses
this lack of correlation further [1, 2].
For both values, we centered and standardized each row and column before fitting them
to several common statistical distributions via a log-likelihood maximization. We found the
Gaussian distribution to best approximate the results from our experiments. Applying a
maximum likelihood (MLE) estimation, we found that mean 0.72 and standard deviation
0.13 were the best for si while mean 0.72 and standard deviation 0.04 were best for fi.
4.3 Reliability of Measured Worst-Case Costs
We stated in Assumption 4 that Chi is no more than maxτj∈τh Ci:j . While we are confident
that violations will be rare, we cannot guarantee there will not be any. In particular, our
assumption that all portions of τi receive the same amount of interference from all portions of
τj is a potential source of timing violations. For example, let τh = {τh1 , τh2 , τh3 } be such that
C1:2 = C1:3 = 6. Under Assumption 5, the worst-case execution time for τ1 is 6. Suppose τ1
can be broken into two segments, τ1a and τ1b, such that C1a:2 = 4, C1b:2 = 2, C1a:3 = 2, and
C1b:3 = 4. If τ1a is co-scheduled with τ2 and τ1b is co-scheduled with τ3, τ1’s total execution
time would be 8, violating our stated worst-case execution costs. At present, our benchmark
tests and model do not discover or account for task inter-leavings as in this scenario. In the
7 A Pearson correlation of ±1 indicates total positive or negative linear correlation; 0 indicates no
correlation.
ECRTS 2019
3:18 Simultaneous Multithreading Applied to Real Time
future, we would like to resolve this with finer-grained timing analysis and a model that does
not assume task interference is independent from location within the task. In particular,
breaking tasks into segments, determining execution costs per segment, as in our example,
and conducting an analysis similar to this paper, but at a finer granularity, seems like a
promising way forward. For now, we reiterate that we are only considering applications that
are not safety-critical and where some tardiness is acceptable.
Generally, precise timing analysis on multicore is hard and contains uncertainty regardless
of the added SMT challenge. Fortunately, Mills and Anderson have shown SRT systems to
have expected tardiness bounds based on average rather than worst-case execution times
[19]. Their approach relies on designating per-task execution budgets so that if any one job
overruns its budget, it will not receive further execution time until a subsequent job of the
same task could have been executed had the first job completed. These budgets come from
average execution times. Therefore, so long as our costs are greater than the true average
costs, any system τ that can be scheduled as we have described will remain so, though
possibly with increased tardiness, even if our stated costs are not true worst-case costs.
Concerning our results here, our true interest is not in these specific times, but rather in
developing a sense of how SMT-enabled tasks behave so that we can create synthetic tasks
for our schedulability study that are good representations of reality.
5 Schedulability Experiments
Having shown how to schedule SMT-enabled systems and analyzed the behavior of our
benchmark tasks when using SMT, it remains to be seen whether we can schedule otherwise
unschedulable systems. To answer this question, we ran a series of schedulabilty experiments.
5.1 Experimental Procedure
To run our experiments, we created synthetic task systems to be scheduled on platforms
with m cores, m ∈ {4, 8, 16} such that the total system utilization ranged from m to 2m.
For each task system, we partitioned the system into τp and τh using Algorithm 1 and
all three versions of Algorithm 2. We then tested for schedulability per Theorem 15. We
created enough task systems that each data point in our graphs represents the composite
schedulability of approximately 1,000 task systems. We created over 300 graphs, with a few
thousand to hundreds of thousands of task systems per graph. Creating task sets, partitioning
task sets, and testing for schedulability consumed over 30 days of CPU time.
We plotted our results on a series of schedulability graphs with total utilizations on the
horizontal axis and the proportion of systems that were schedulable on the vertical axis.
Since we started at utilization m, and the standard SRT feasibility condition given by (1)
requires that
∑n
i=1 ui ≤ m hold, every system we created was infeasible without using SMT.
Every system that we could schedule is an argument for adapting SMT in real-time systems.
Each graph shows results for tasks created using a common set of utilization and ri:j
values. Task utilizations were assigned from one of four ranges: the uniform distributions
(0, .4], [.3, .7], [.6, 1], and (0, 1]. We used two approaches for determining ri:j values. In
the Gaussian-average approach, we drew si and fi from the Gaussian distributions with
mean 0.72 for both values and standard deviations ranging from 0.13 to 0.39 for si and from
0.04 to 0.12 for fi. These parameters are based on distributions we fitted to our models, as
discussed in the previous section. We allowed larger standard deviations than we obtained
from our benchmarks to make our results more widely applicable.
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:19



















4 Cores, Task Utilization  (0.0, 0.4]
s  Gauss(0.72,0.13), f  Gauss(0.72,0.04)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 6 Graph shape is similar to Fig. 1,
which has more cores.



















16 Cores, Task Utilization  [0.3, 0.7]
s  Gauss(0.72,0.13), f  Gauss(0.72,0.04)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 7 Schedulability similar to Figs. 1
and 6, despite higher task utils.



















16 Cores, Task Utilization  (0.0, 1.0)
s  Gauss(0.72,0.13), f  Gauss(0.72,0.04)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 8 Despite same expected per-task
util. as Fig. 7, schedulability is reduced.




















16 Cores, Task Utilization  [0.6, 1.0)
s  Gauss(0.72,0.13), f  Gauss(0.72,0.04)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 9 Given high per-task utiliza-
tions, only small schedulability gains can be
achieved.
In the uniform-normal approach, both si and fi come from one of four uniform distri-
butions: [.65, 1], [.7, 1], [.75, 1], or [.8, 1]. The two ranges may differ for a given graph.
Each ri:j value was then chosen from a normal distribution with mean sifi and standard
deviation σ, where σ is .01, .05, or .1. Negative values or those greater than 1 are clamped
to 0 or 1 respectively. The intuition behind the uniform-normal approach is to create ri:j
values broadly similar to the benchmark values we obtained, but via different methods than
Gaussian-average so as to avoid having our results be overly dependent on that model. While
high si values in this context still indicate tasks that receive little interference from other
tasks, and high fi values indicate tasks that cause little interference to others, they are used
differently here than in the Gaussian average approach and should not be directly compared.
5.2 Results
Due to space constraints, we present only a small portion of our graphs to highlight general
trends. A full set of graphs is available in our online appendix.8 For all of our graphs, the
horizontal axis begins at m; all of our task systems would be infeasible without SMT.
I Observation 1. Given favorable task parameters, virtually all task systems with utilizations
as high as 1.25m, and roughly half of task systems with utilizations of 1.33m, are schedulable.
Favorable task parameters are high means and low standard deviations for friendliness and
strength values combined with low per-task utilizations. Examples of these results are seen in
Figs. 1, 6, and 7.
8 Appendix and code is available at https://jamesanderson.web.unc.edu/papers/. Code is also avail-
able at https://github.com/JoshuaJB/SMART-ECRTS19 and https://doi.org/10.4230/DARTS.5.1.8.
ECRTS 2019
3:20 Simultaneous Multithreading Applied to Real Time





















16 Cores, Task Utilization  (0.0, 0.4],  = 0.055
s  Uniform(0.65,1.00), f  Uniform(0.65,1.00)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 10 Uniform-normal ri:j values on
16 cores. Note variations in algorithm per-
formance.

















4 Cores, Task Utilization  (0.0, 0.4],  = 0.055
s  Uniform(0.65,1.00), f  Uniform(0.65,1.00)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 11 Uniform-normal ri:j on 4 cores.
Unlike the Gaussian model, core count influ-
ences gains from SMT here.

















8 Cores, Task Utilization  (0.0, 0.4]
s  Gauss(0.72,0.39), f  Gauss(0.72,0.13)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 12 Gaussian approach with higher
variance. Gains from SMT are reduced com-
pared to Figs. 1, 6, and 7.

















8 Cores, Task Utilization  [0.3, 0.7]
s  Gauss(0.72,0.39), f  Gauss(0.72,0.13)
[1] O s
[2] G -T
[3] G -P s
[4] G -M x
Figure 13 Underperformance of Greedy-
thread as in Fig. 12 disappears as utiliza-
tions increase.
I Observation 2. Task systems with low per-task utilization received the greatest improvement
in schedulability, and task systems with high utilization saw the least. Since threading tasks
increases individual execution costs, it will typically not be possible to thread tasks that already
have high utilizations. Fig. 6, in our introduction, shows schedulability for task systems
with individual utilizations drawn from the uniform distribution (0, 0.4], and shows that the
majority of systems considered are schedulable with utilizations as high as 5.34. Fig. 9 has
the same parameters as Figs. 1 and 6, but draws utilizations instead from the range [.6, 1].
This graph shows virtually no improvement when run with SMT.
I Observation 3. Algorithm 1, oblivious partitioning, competes with the more complex
algorithms. In our best results, such as Figs. 1, 6, 7, and 13, Algorithm 1 is indistinguishable
from the greedy algorithms. When Algorithm 1 does not perform as well as the variants of
Algorithm 2, the difference is small enough that the lower algorithm complexity might still
make it a better choice.
I Observation 4. Lower ri:j variability yields improved schedulability. In Fig. 12, the task
systems sample from the same utilization range as those of Figs. 1 and 6, but here the
standard deviation of the distribution from which si and fi are sampled is larger. This
increased variance causes fewer task sets to be schedulable Fig. 12 in than in Figs. 1 and 6.
I Observation 5. Schedulability benefits of our methods are not limited to task systems
generated using a single model. While the Gaussian approach created systems that saw more
improvement from SMT, the benefits of SMT are not limited to task systems created under
that model, suggesting that SMT can benefit a wide variety of task systems.
S.H. Osborne, J. J. Bakita, and J. H. Anderson 3:21
6 Conclusion
We have given a task model, SMART, that allows for reasoning about SMT-enabled task
systems by defining multiple cost parameters per task. We have shown how to decide which
tasks should and should not use SMT and how to take advantage of SMT to schedule
otherwise unschedulable task systems. We measured the execution times of benchmark
tasks with and without SMT enabled, with the SMT-enabled case covering interference from
all other tasks in the set. We conducted an extensive schedulability study using synthetic
tasks modeled on our benchmark tasks and showed that for task systems consisting of low
utilization tasks, it is possible to schedule virtually all systems with utilization as large as
1.25m and to schedule many task systems with utilizations approaching 1.33m.
In the future, we plan to improve our timing analysis to the point that hard real-time
systems, where no tardiness is permitted, becomes an option. In addition, we want expand
our soft real-time work by partitioning both task systems and hardware platforms to minimize
tardiness, rather than simply maximizing schedulability. Making tasks threaded tends to
decrease demand on the platform, potentially reducing tardiness, but will increase execution
costs, potentially increasing tardiness [4, 5, 16]. While the potential gains shown in this paper
are substantial, we have only begun to expose the potentials of hardware multithreading.
References
1 J. Bulpin. Operating system support for simultaneous multithreaded processors. PhD thesis, Uni-
versity of Cambridge, King’s College, 2005. URL: http://www.cl.com.ac.uk/TechReports/.
2 J. Bulpin and I. Pratt. Multiprogramming Performance of the Pentium 4 with Hyperthreading.
In Third Annual Workshop on Duplicating, Deconstruction and Debunking, pages 53–62, June
2004.
3 F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero.
Predictable performance in SMT processors: synergy between the OS and SMTs. IEEE
Transactions on Computers, 55(7):785–799, July 2006. doi:10.1109/TC.2006.108.
4 U. M. C. Devi and J. H. Anderson. Tardiness bounds under global EDF scheduling on a
multiprocessor. In RTSS’05, pages 330–341, December 2005. doi:10.1109/RTSS.2005.39.
5 U. M. C. Devi and J. H. Anderson. Flexible tardiness bounds for sporadic real-time task systems
on multiprocessors. In 20th IEEE International Parallel Distributed Processing Symposium,
pages 10 pp.–, April 2006. doi:10.1109/IPDPS.2006.1639265.
6 S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen. Simultaneous
multithreading: a platform for next-generation processors. IEEE Micro, 17(5):12–19, September
1997. doi:10.1109/40.621209.
7 S. Eyerman and L. Eeckhout. The Benefit of SMT in the Multi-core Era: Flexibility Towards
Degrees of Thread-level Parallelism. In Proceedings of the 19th International Conference on
Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages
591–606, New York, NY, USA, 2014. ACM. doi:10.1145/2541940.2541954.
8 S. Eyerman, P. Michaud, and W. Rogiest. Revisiting symbiotic job scheduling. In 2015 IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS), pages
124–134, March 2015. doi:10.1109/ISPASS.2015.7095791.
9 H. Falk, S. Altmeyer, P. Hellinckx, B. Lisper, W. Puffitsch, C. Rochange, M. Schoeberl,
R. B. Sorensen, P. Wagemann, and S. Wegener. TACLeBench: A Benchmark Collection to
Support Worst-Case Execution Time Research. In Martin Schoeberl, editor, 16th International
Workshop on Worst-Case Execution Time Analysis (WCET 2016), volume 55 of OpenAccess
Series in Informatics (OASIcs), pages 2:1–2:10, Dagstuhl, Germany, 2016. Schloss Dagstuhl–
Leibniz-Zentrum fuer Informatik. doi:10.4230/OASIcs.WCET.2016.2.
ECRTS 2019
3:22 Simultaneous Multithreading Applied to Real Time
10 J. Feliu, J. Sahuquillo, S. Petit, and J. Duato. Perf Fair: A Progress-Aware Scheduler to
Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers,
66(5):905–911, May 2017. doi:10.1109/TC.2016.2620977.
11 A. Fog. The Microarchitecture of Intel, AMD, and VIA CPUs: an optimization guide for
assembly programmers and compiler makers. Technical University of Denmark, 2018. URL:
https://www.agner.org/optimize/microarchitecture.pdf.
12 T. Gomes, P. Garcia, S. Pinto, J. Monteiro, and A. Tavares. Bringing Hardware Multithreading
to the Real-Time Domain. IEEE Embedded Systems Letters, 8(1):2–5, March 2016. doi:
10.1109/LES.2015.2486384.
13 T. Gomes, S. Pinto, P. Garcia, and A. Tavares. RT-SHADOWS: Real-time system hardware
for agnostic and deterministic OSes within softcore. In ETFA ’15, pages 1–4, September 2014.
doi:10.1109/ETFA.2015.7301572.
14 W. Huang, J. Lin, Z. Zhang, and J.M. Chang. Performance Characterization of Java
Applications on SMT Processors. In ISPASS ’05., pages 102–111, March 2005. doi:
10.1109/ISPASS.2005.1430565.
15 R. Jain, C. J. Hughes, and S. V. Adve. Soft real-time scheduling on simultaneous multithreaded
processors. In RTSS ’02, pages 134–145. Institute of Electrical and Electronics Engineers Inc.,
2002. doi:10.1109/REAL.2002.1181569.
16 H. Leontyev and J. H. Anderson. Generalized tardiness bounds for global multiprocessor
scheduling. Real-Time Systems, 44(1):26–71, March 2010. doi:10.1007/s11241-009-9089-2.
17 S. Lo, K. Lam, and T. Kuo. Real-time task scheduling for SMT systems. In RTCSA’05, pages
5–10, August 2005. doi:10.1109/RTCSA.2005.77.
18 D. Marr, F. Binns, D. Hill, G. Hinton, K. Koufaty, J. Miller, and M. Upton. Hyper-Threading
Technology Architecture and Microarchitecture. In Intel Technology Journal, volume 6, pages
4–15, February 2002.
19 A. F. Mills and J. H. Anderson. A Stochastic Framework for Multiprocessor Soft Real-Time
Scheduling. In RTAS ’10, pages 311–320, April 2010. doi:10.1109/RTAS.2010.33.
20 J. Mische, S. Uhrig, F. Kluge, and T. Ungerer. Using SMT to Hide Context Switch Times of
Large Real-Time Tasksets. In RTAS ’10, pages 255–264, August 2010. doi:10.1109/RTCSA.
2010.33.
21 B. Ocker. FAA special topics. In Collaborative Workshop: Solutions for Certification of
Multicore Processors, November 2018.
22 S. Osborne and J. H. Anderson. Work in Progress: Combining Real Time and Multithreading.
In 2018 IEEE Real-Time Systems Symposium (RTSS), pages 139–142, December 2018. doi:
10.1109/RTSS.2018.00024.
23 P. Radojković, P. M. Carpenter, M. Moretó, V. Čakarević, J. Verdú, A. Pajuelo, F. J. Cazorla,
M. Nemirovsky, and M. Valero. Thread Assignment in Multicore/Multithreaded Processors:
A Statistical Approach. IEEE Transactions on Computers, 65(1):256–269, January 2016.
doi:10.1109/TC.2015.2417533.
24 A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for a Simultaneous Multithreaded
Processor. In ASPLOS ’2000, ASPLOS IX, pages 234–244, New York, NY, USA, 2000. ACM.
doi:10.1145/378993.379244.
25 N. Tuck and D. M. Tullsen. Initial Observations of the Simultaneous Multithreading Pentium
4 Processor. In PACT ’03, PACT ’03, pages 26–35, Washington, DC, USA, 2003. IEEE
Computer Society. URL: http://dl.acm.org/citation.cfm?id=942806.943857.
26 D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing
on-chip parallelism. In ISCA ’95, pages 392–403, 1995.
27 S. Voronov, J. H. Anderson, and K. Yang. Tardiness Bounds for Fixed-Priority Global
Scheduling Without Intra-Task Precedence Constraints. In RTNS ’18, RTNS ’18, pages 8–18,
New York, NY, USA, 2018. ACM. doi:10.1145/3273905.3273913.
28 M. Zimmer, D. Broman, C. Shaver, and E. A. Lee. FlexPRET: A processor platform for mixed-
criticality systems. In RTAS ’14, pages 101–110, April 2014. doi:10.1109/RTAS.2014.6925994.
