Architectural support for real-time task scheduling in SMT processors by Cazorla Almeida, Francisco Javier et al.
Architectural Support for Real-Time
Task Scheduling in SMT Processors
FranciscoJ. Cazorla1, Peter M.W. Knijnenburg2, Rizos Sakellariou3,
Enrique Fernandez4, Alex Ramirez1,5, Mateo Valero1,5
1 DAC, UPC, Spain, {fcazorla,aramirez,mateo}@ac.upc.es.
2 LIACS, Leiden University, the Netherlands, peterk@liacs.nl.
3 University of Manchester, UK, rizos@cs.man.ac.uk.
4 University of Las Palmas de Gran Canaria, Spain.
efernandez@dis.ulpgc.es.
5 Barcelona Supercomputing Center, Spain.
Abstract
In Simultaneous Multithreaded (SMT) architectures most hardware re-
sources are shared between threads. This provides a good cost/performance
trade-off which renders these architectures suitable for use in embedded sys-
tems. However, since threads share many resources, like caches, they also
interfere with each other. As a result, execution times of applications become
highly unpredictable and highly dependent on the context in which an appli-
cation is executed. Obviously, this poses problems if an SMT is to be used in
a (soft) real time system. In this paper, we propose two novel hardware mech-
anisms that can be used to reduce this performance variability. In contrast to
previous approaches, our proposed mechanisms do not need any information
beyond the information already known by traditional job schedulers. Neither
do they require extensive profiling of workloads to determine optimal sched-
ules. Our mechanisms are based on dynamic resource partitioning. The OS
level job scheduler needs to be slightly adapted in order to provide the hard-
ware resource allocator some information on how this resource partitioning
needs to be done. We show that our mechanisms provide high stability for
SMT architectures to be used in real time systems: the real time benchmarks
we used meet their deadlines in more than 98% of the cases considered while
the other thread in the workload still achieves high throughput.
DAC, UPC. Technical Report. May 2005. Cazorla et al.
1 Introduction
Current processors take advantage of Instruction Level Parallelism (ILP) to exe-
cute in parallel several independent instructions from a single instruction stream
(thread). However, there is only a limited amount of parallelism available in each
thread due to data and control dependences [11], which degrades performance. In
order to alleviate these problems many hardware resources are required, degrading
the performance/cost ratio of these processors.
A solution to improve the performance/cost ratio of processors is to allow threads
to share hardware resources. In current processors, resource sharing can occur at
different ways. At one extreme of the spectrum, there are multiprocessors (MPs)
that only share some levels of the memory hierarchy. On the other extreme, there
are ‘full-fledged’ simultaneous multithreaded processors (SMTs) that share many
hardware resources, improving their performance/cost ratio [8].
However, in SMT processors, threads may also interfere because they share many
resources. This implies that the speed a thread obtains in one workload can be very
different from the speed it has in another workload [4]. We refer to this by saying that
an SMT processor has a high variability. Obviously, high variability is an undesirable
property in a real time environment. Not only needs the job scheduler take into
account Worst Case Execution Times and deadlines when selecting a workload, it
should also know about how the workload can affect the Worst Case Execution Time.
Note that redefining WCET as the longest execution time in an arbitrary workload
is not an option. By carefully selecting a workload, the WCET could be made
arbitrarily large. Moreover, analytical approaches to WCET would fail miserably if
they would need to take a context into consideration.
The execution time of a thread depends on the amount of execution resources
given to it. In a system with single-thread processors, a thread has exclusive access
to all available hardware resources, so that this is not a problem. In an SMT pro-
cessor the amount of resources given to a thread varies dynamically. An instruction
fetch policy, e.g., icount [10], decides how instructions are fetched from the threads,
thereby implicitly determining the way internal processor resources are allocated to
the threads. The key point is that current fetch policies are designed with the main
2
DAC, UPC. Technical Report. May 2005. Cazorla et al.
objective of increasing processor throughput, what may differ from the objective of
the global system, in our case, a real-time environment. This could compromise the
objective of the real-time system meeting tasks deadlines [2] if some measures are
not taken. In addition, current OS level job schedulers perceive the different contexts
of an SMT as independent processing units and assume that threads have exclusive
access to the hardware resources when scheduled on a processing unit. These as-
sumptions are inherited from multiprocessors, but are not valid for SMTs. Threads
in an SMT share the same hardware resources, so that the behavior and the resources
used by one thread affect all other threads in the workload.
To sum up, in current systems the performance/cost versus variability trade-off
is clear. MPs have lower variability but bad performance/cost ratio. SMT implies
a good cost-performance relation but high variability in the execution time of an
application depending on the context in which it is executed. In the literature, several
solutions have been proposed in order to improve this trade-off [5][7][9]. The common
characteristic of these solutions is that they assume knowledge of the average number
of Instructions Per Cycle (IPC) of applications when they are executed in isolation,
IPCalone. In other words, these solutions are IPC based. This implies, as we show
later, that these solutions are applicable for a subset of real-time applications, where
the IPCalone of applications can be a priori obtained.
As far as we know there is not any proposal that deals with this problem when the
IPCalone of applications is not know. In such a case time-critical threads are given
all the performance of the SMT [2]. This, of course, solve the problem but provides
low throughput. In this paper we propose a novel mechanism to enforce real time
constraints in an SMT based system. This mechanism consists of a small extension of
the OS level job scheduler and an extension of the SMT hardware, called a Resource
Allocator. Our approach is resource based instead of IPC based. By this we mean
that it relies on the amount of resources given to the time-critical thread. The job
scheduler assembles a workload for the SMT processor and instructs the Resource
Allocator to dedicate at least a certain amount of resources to the time critical thread
so that it is guaranteed to meet its deadline. Apart from this, the Resource Allocator
tries to adjust the resource allocation in order to maximize performance. The current
paper is focused on the Resource Allocator. In future work, we hope to give a working
3
DAC, UPC. Technical Report. May 2005. Cazorla et al.
implementation of the job scheduler as well. Our approach is feasible for real-time
applications. With our method time-critical applications meet their deadline more
than 98% of times while the non-crtical applications still achieve high performance.
This paper is structured as follows. Section 2 presents some background on SMTs
and real-time scheduling. In Section 3 we present existing approaches to solve the
high variability of SMTs. In section 4, we explain the experimental environment.
Sections 5 presents our two mechanisms. Section 6 is devoted to show the simulation
results. In section 7 we explain the hardware/software changes required to implement
our mechanism. Finally, conclusions are given in Section 8.
2 Background on SMTs and Real-Time Schedul-
ing
In this section we give some background on SMT processors and real-time scheduling.
In particular, we discuss some of the challenges to develop a real-time scheduler for
SMT processors.
2.1 SMT processors
In an SMT, the front end of a superscalar processor is adapted in order to be able to
fetch from several threads while the back end is shared between threads. Thus, the
same physical resources, like functional units or the branch predictor, are used by
several threads at the same time as schematically depicted Figure 1. Several Program
Counters (PCs) are used in the instruction fetch stage and the other stages are
shared between the threads. A fetch policy, like icount [10], determines from which
threads the next instructions are fetched. Moreover, by doing so, the fetch policy
also implicitly determines the way internal processor resources, like physical registers
or window entries, are allocated to threads. This is a major cause of performance
variability in SMT processors: the speed of a thread highly depends on the context
in which it is executed.
Next, instructions are decoded and renamed in order to track data dependences.
When an instruction is renamed, it is allocated an entry in the window or issue
4
DAC, UPC. Technical Report. May 2005. Cazorla et al.
Decode
Fetch
Rename
Instruction
Cache
PCs
LSQ
IQ
FPQ
IREG
Read
FREG
Read
IREG
Write
FREG
Write
Execution
Data
Cache
ROB
Figure 1: Baseline architecute
queues (integer, floating point and load/store) until all its operands are ready. Each
instruction also allocates one Re-Order Buffer (ROB) entry and a physical register
in the register file, if required. ROB entries are assigned in program order and
instructions wait in this buffer until all earlier instructions are resolved. When an
instruction has all its operands ready, it is issued: it reads its operands, executes,
writes its results, and finally commits.
2.2 Real-Time Scheduling
Real-time systems are characterized by a group of repetitive tasks, called a task set.
For each task, the scheduler knows three main parameters. First, the period, that is,
the interval at which new instances of a task are ready for execution. Second, the
deadline, that is, the time before which an instance of the task must complete. For
simplicity, the deadline is often set equal to the period. This means that a task has
to be executed before the next instance of the same task arrives in the system. Third,
the Worst Case Execution Time (WCET ) is an upper bound on time required to
execute any instance of the task that is guaranteed never to be exceeded.
5
DAC, UPC. Technical Report. May 2005. Cazorla et al.
In soft-real time scheduling, the main purpose of the scheduler consists of finding
a feasible schedule for the task set. Many algorithms have been proposed to solve this
problem in single-threaded systems (e.g., EDF or LLF). However, these algorithms
are no longer sufficient in an SMT processor, since the execution time of a thread is
unpredictable when this thread is scheduled with other threads. Algorithms should
be adapted to meet this new situation.
As pointed out in [7], the problem of scheduling a set of tasks turns into two
different problems in SMT systems. The first problem is the same as for multipro-
cessors, namely, to select the set of tasks to run. This problem is called the workload
selection problem. The second problem consists of determining how resources are
shared between threads. In this paper, we focus on the latter problem that is also
known in the literature as the resource sharing problem.
The high variability of SMT processors implies that the task of a real-time job-
scheduler for SMT processors is much more complex and challenging than for single-
threaded processors. When scheduling a job, the job-scheduler must take into ac-
count the amount of resources given to a thread, which is implicitly decided by the
instruction fetch policy, in order to ensure that it meets its deadline.
3 Existing Approaches
In [2], the authors propose an approach where the WCET is specified assuming a
virtual simple architecture (VISA). At execution time, a task is executed on the
actual processor. Intermediate virtual deadlines are established based on the VISA.
If, during execution, a task fails to meet its intermediate deadlines, the processor
is reconfigured to implement the VISA, bounding the execution time of the task.
If the actual processor is an SMT using a fetch policy that attempts to maximize
throughput, and a task fails to meet its intermediate deadlines, the SMT is switched
to single-threaded mode. The authors conclude that fetch policies that attempt
to maximize throughput, like icount, should be “balanced” for minimum forward
progress of real-time tasks. This is precisely the target of our paper: we ensure a
minimum amount of resources for a given time-critical thread so that it meets its
deadline regardless of the other threads executed in its workload. Our approach is
6
DAC, UPC. Technical Report. May 2005. Cazorla et al.
orthogonal to the VISA framework: a time-critical thread is executed on the actual
SMT processor that provides the thread with a given percentage of resources. In the
event that the task does not meet its intermediate deadlines, instead of switching the
processor to single-threaded mode, we can increase the amount of resources given to
it, so that it meets its deadline and total performances does not drop drastically.
As far as we know, there are three main studies dealing with real-time scheduling
for SMTs. Two of these focus on real-time systems [5][7] while the third focuses on
general-purpose systems [9].
In [7], the authors focus on workload selection in soft-real time systems, although
they also briefly discuss the resource sharing problem. The authors propose a method
to solve the problem of high variability of SMTs that profiles all possible combina-
tions of tasks. By comparing the IPC of a thread when it is executed in a given
workload, IPCSMT , with the IPC that threads achieves when it is run in isolation,
IPCalone, the slowdown that the thread suffers from being executed in a context is
determined. This information is given as additional input to the scheduler that uses
this information to maximize performance since the scheduler selects those workloads
that lead to the highest symbiosis among threads and thus the highest performance.
The main drawback of this solution is the prohibitively large number of profiles re-
quired. For a task set of K tasks and a target processor with N contexts, we have
to profile all N !
K!(N−K)!
possible combinations.
A similar solution is proposed in [9]. The authors propose several OS level job
schedulers to enforce priorities in a general-purpose system. Mostly, these schedulers
find co-schedules from a pool of runnable jobs that is larger than the number of
hardware contexts. Their SOS policy runs jobs alone on the machine to determine
their full speed, runs several job mixes in order to determine the best mix that
exhibits symbiosis, and finally runs jobs alone in order to meet priorities.
Finally, in [5] we have proposed a hardware mechanism to run a given thread at a
given percentage of its full speed, IPCalone, in an arbitrary workload. If it is required
to run a thread A at an target IPC that is X% of the IPCalone of that thread, then
the IPC of the critical thread is periodically measured and the mechanism tries to
run that thread at X% of the last measured speed. It has been shown that this
approach can realize an arbitrary required percentage of the IPCalone of a critical
7
DAC, UPC. Technical Report. May 2005. Cazorla et al.
thread in widely different workloads.
A common characteristic of these studies is that they are IPC based, that is, they
require the IPCalone of threads. By comparing the IPC of a thread in a workload
with its IPCalone, these methods converge to a solution. In this paper, we propose a
different way of approaching the problem. Instead of using the IPC of applications
to drive the solution, we use resource allocation that normally is implicitly driven by
the instruction fetch policy. Our method makes explicit to the scheduler the amount
of resources used by each thread. The scheduler adjusts this allocation to guarantee
that applications meet their deadlines.
The main advantage of our method is two-fold. First, it is well known that IPC
values can be highly dependent on the input of an application. For some types of
real-time applications, such as multimedia applications [6], this dependence is weak:
the IPC of such an application is roughly independent from the input. But for
other types of applications this is not the case. Our method does not require this
information so that it is applicable to all types of real-time applications. Second, we
achieve a similar or even better success rate than the approaches discussed above,
while improving overall performance.
4 Experimental Environment
In this section, we discuss both our baseline architecture used to run our experi-
ments, the benchmarks we use, and the metrics we employ to compare the different
proposals.
4.1 SMT Simulator
In order to evaluate the performance of the different policies, we use a trace driven
SMT simulator derived from smtsim [10]. The simulator consists of our own trace
driven front-end and an improved version of smtsim’s back-end. The simulator allows
executing wrong path instructions by using a separate basic block dictionary that
contains all static instructions.
We use an aggressive configuration, shown in Table 1: many shared resources
8
DAC, UPC. Technical Report. May 2005. Cazorla et al.
(issue queues register, functional units, etc.), very wide superscalar, and a deep
pipeline for high clock rate. These features cause the processor performance to be
very unstable, depending on the mix of threads. Thus, this configuration represents
an unfavorable scenario where we evaluate our proposals. It is clear that, if those
proposals work in this hard configuration they will work better in narrower processors
with fewer shared resources.
Table 1: Baseline configuration
Processor Configuration
Default fetch policy icount 2.8
Pipeline depth 12 stages
Fetch/Issue/Commit Width 8
Queues Entries 64 int, 64 fp, 64 ld/st
Execution Units 6 int, 3 fp, 4 ld/st
Physical Registers 256 integer, 256 fp
(shared)ROB size 512 entries
Branch Predictor 16K entries gshare
Branch Target Buffer 256-entry, 4 ways
Return Address Stack 256 entries
Memory Configuration
Icache, Dcache 64 Kbytes, 2-way, 8-bank,
64-byte lines, 1 cycle access
L2 cache 2048 Kbytes, 8-way, 8-bank,
64-byte lines, 20 cycle access
Main memory latency 300 cycles
TLB miss penalty 160 cycles
4.2 Benchmarks
We use workloads consisting of two threads. The first thread is the critical thread
(CT) that represents the thread with the most critical time restriction or the soft
real time thread. The second thread is a non-critical thread (NCT) that is assumed
either to have less-critical time restrictions or to have no time restrictions at all. As
critical threads we use programs from the MediaBench Benchmark suite, namely,
adpcm, epic, g721, gsm and mpeg2. We used both the coder and the decoder of
these media applications. Hence, we use 10 media applications as critical threads.
Table 4.2 shows the inputs for each of the MediaBench benchmarks.
We want to check the efficiency of the resource allocator under scenarios where
the NCT requires many resources, and thus, where the performance of the CT could
be more affected. For this reason, we use as non-critical threads benchmarks from
the SPEC2000 integer and fp benchmark suite that require more resources than
media applications. Each of the ten media applications is executed with 8 different
benchmarks from the SPEC200 benchmark suite as non-critical thread. We have
9
DAC, UPC. Technical Report. May 2005. Cazorla et al.
Table 2: MediaBench Benchmarks used in this paper.
Benchmark name Media Language input
adpcm speech C clinton.pcm
epic image C test image.pgm
g721 speech C clinton.pcm
mpeg2 video C test2.mpeg
used gzip, mesa, perlbmk, wupwise, mcf, twolf, art and swim. These benchmarks were
selected because they exhibit widely varying behavior. Some are memory bounded,
which means that they generate many cache misses. Others are not but consume
many computational resources. In our experiments below, we have used all pairs of
media and general purpose applications, giving us a total of 80 workloads.
In order to check the efficiency of our Resource Allocator, we consider three
different scenarios that differ in the stress that is put on our mechanism. The worst
utilization Uw is defined as the fraction Uw =
WCET
P
whereWCET is the Worst Case
Execution Time and P is the period of an application. If the utilization is low, then
WCET is much smaller than the period and hence it should be relatively easy to
guarantee deadlines. If, on the other hand, the utilization is high, then the critical
thread must be given many or even all resources. In this case, it may happen more
frequently that a critical thread misses a deadline.
In this paper, the WCET of an application is set equal to its real execution time,
when it is run in isolation in the SMT processor, ExecT imei. In this paper, we
consider three worst utilization factors called low, medium, and high. In the first
case, we model a situation where the job scheduler has to schedule one task with a low
worst utilization of 30%: we establish as a deadline for each task 3.3×ExecT imei ↔
(Ui =
WCETi
Pi
= ExceT imei
3.3×ExceT imei
= 30%). In the second scenario, we model a medium
utilization of 50% so that for each task its deadline is 2×ExecT imei. Finally, in the
worst scenario, we use a high utilization of 80%. In this case, the deadline for each
task is 1.25×ExecT imei.
4.3 Metrics
In all our experiments, we run the CT until completion. If the NCT finishes earlier,
it is started again. When the CT finishes, we measure three values. First, the
10
DAC, UPC. Technical Report. May 2005. Cazorla et al.
success rate (SR), which indicates the frequency the CT finishes before its deadline.
In previous real-time systems, it is the responsibility of the OS level job scheduler to
provide a high success rate. In our approach, this responsibility is shared between the
job scheduler and the resource allocator. Second, we measure the performance of the
non-critical thread. We want to give a minimum amount of resources to the critical
thread to meet its deadline. The remaining resources are given to the non-critical
thread in order to maximize its throughput.
Both these values are required to quantify the efficiency of our approach. For
example, if a given CT has an utilization of 30% and the scheduler orders the resource
allocator to assign to it 100% of the resources, the thread will meet its deadline. This
provides a success rate of 100% but neither provides high throughput nor shows the
efficiency of the resource allocator. Analogously, if a thread has an utilization of 90%
and the scheduler orders the allocator to give it 10% of the resources, the thread likely
misses its deadline and we do not know anything about the efficiency of the resource
allocator.
As a third measure, in addition to the success rate, we measure the extra time
required to finalize the CT for those cases in which the CT misses its deadline.
Assume that the time required to execute the CT in a given workload, denoted by
ExectimeCT , is larger than its deadline so that the CT misses its deadline. Then
the variance is computed as:
varianceCT =
(ExectimeCT − deadlineCT )
deadlineCT
· 100%
where deadlineCT denotes the deadline of the critical thread. For a given policy, we
take the five cases in which the variance is highest and compute the average of these
variances. We call this metric Mean5WorstVariance. If a policy has a success rate
of 1, we have that ExectimeCT ≤ deadlineCT for each workload and in this case the
variance is zero.
5 Dynamic Resource Partitioning
In this section, we discuss the extensions to the OS level job scheduler and the SMT
hardware for implementing our scheme.
11
DAC, UPC. Technical Report. May 2005. Cazorla et al.
5.1 Overview of our approach
The basis of our mechanism is to partition the hardware resources between the critical
and the non-critical thread and to reserve a minimum fraction of the resources for
the CT that enables it to meet its deadline. In this way, we can also satisfy our
second objective, namely, to increase as much as possible the IPC of the NCT. It is
the responsibility of the job scheduler to provide the resource allocator with some
information so that it can reserve this fraction for the critical thread.
When the WCET of a task is determined, it is assumed that this task has full
access to all resources of the platform it should run on. However, when this task is
executed in a multithreaded environment together with other tasks, it uses a certain
fraction of the resources. It is obvious that when the amount of resources given to
a thread is reduced, its performance decreases as well. The relation between the
amount of resources and performance is different for each program and may vary for
different inputs of the same program. For the benchmarks used in this paper, we have
plotted this relation in Figure 2. This figure shows the relative IPC 1 of each multi
media application when it is executed alone on the SMT as we vary the amount of
window entries and physical registers given to it. This relative IPC is the percentage
of the IPC the application achieves when executed alone on the machine and given
all the resources, called IPCalone. From this figure, we can see that if we dedicate
10% of the resources to the epic decoder, we obtain 50% of the speed it would have
were it given the entire machine. Likewise, 10% of the resources dedicated to the
adpcm decoder gives 95% of its IPCalone.
Our proposed method exploits the relation between the amount of resources
given to the critical thread and the performance it obtains. When the OS level
job scheduler wants to execute a critical thread, given its WCET and a period P ,
it simply computes the allowable performance slow down, S, given by S = P
WCET
.
For such a value of S, each instance of this job finishes before its deadline. Sup-
pose the real execution time if this instance is Ti. Then, Ti ≤ WCET . Hence,
S · Ti =
P
WCET
· Ti ≤
P
WCET
·WCET = P . Hence, the value of S is critical informa-
1Recall that IPC is inverse to performance: ExecutionT ime = CycleT ime×#Instructions×
(1/IPC).
12
DAC, UPC. Technical Report. May 2005. Cazorla et al.
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(p) relative IPC (%)
(r)
 
R
e
so
u
rc
e
s 
(%
)
adpcm_c
adpcm_d
epic_c
epic_d
g721_c
(a) First five applications
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(p) relative IPC (%)
(r)
 
R
e
so
u
rc
e
s 
(%
)
g721_d
gsm_c
gsm_d
mpeg2_c
mpeg2_d
(b) Last five applications
Figure 2: Relation between the amount of resources given to a task and its IPC.
13
DAC, UPC. Technical Report. May 2005. Cazorla et al.
tion needed to establish a resource partitioning.
There are the following two issues we need to address. First, we need to determine
which resources are being controlled by the resource allocator. Second, we need
to decide whether the job scheduler or the resource allocator determines the exact
amount of resources given to the critical thread. In the first case, a resource allocation
is fixed for the entire period a critical thread is executing. We call this approach
the static approach. In the second case, the resource allocator can dynamically vary
the amount of resources dedicated to the critical thread. We call this approach the
dynamic approach.
5.2 Resource allocator
The resource allocator controls the amount of resources that can be used by appli-
cations. It consists of a number of resource usage counters that track the amount
of resources used by each application, one counter per resource. These counters are
incremented each time a thread needs an additional instance of a resource and they
are decremented each time an instruction releases an instance resource. For each
thread in the SMT, there are also limit registers for each resource that contain the
maximum number of instances the thread is allowed to use. These limit registers can
be written by either the job scheduler in the static method or the resource allocator
itself in the dynamic method. If an application tries to use more resources than it
is assigned, its instruction fetch is stalled until resources are freed. In section 7 we
further explain the hardware cost of the resource allocator.
5.3 Resources
The first step in our approach is to determine the set of shared resources that has to
be controlled to provide stability. In our architecture, the shared resources are the
following: the fetch bandwidth, the issue queues, the issue bandwidth, the physical
registers, the instruction cache, the L1 data cache, the unified L2 cache, and the
TLBs. We have conducted a number of experiments to see what the influence on
variability is when we partially dedicate each of these resources to the CT.
14
DAC, UPC. Technical Report. May 2005. Cazorla et al.
0.E+00
1.E-04
2.E-04
3.E-04
ad
pc
m
_
c 
ad
pc
m
_
d 
ep
ic_
c  
ep
ic_
d  
g7
21
_
c  
g7
21
_
d  
gsm
_
c 
 
 
gsm
_
d  
 
m
pe
g2
_
c 
m
pe
g2
_
d 
M
is
se
s
alone
80%
70%
60%
50%
30%
20%
10%
(a) Instruction cache
0.E+00
1.E-03
2.E-03
3.E-03
4.E-03
5.E-03
6.E-03
7.E-03
ad
pc
m
_
c 
ad
pc
m
_
d 
ep
ic_
c  
ep
ic_
d  
g7
21
_
c  
g7
21
_
d  
gsm
_
c 
 
 
gsm
_
d  
 
m
pe
g2
_
c 
m
pe
g2
_
d 
M
is
se
s
alone
80%
70%
60%
50%
30%
20%
10%
(b) L1 data cache
0.E+00
1.E-03
2.E-03
3.E-03
4.E-03
ad
pc
m
_
c 
ad
pc
m
_
d 
ep
ic_
c  
ep
ic_
d  
g7
21
_
c  
g7
21
_
d  
gsm
_
c 
 
 
gsm
_
d  
 
m
pe
g2
_
c 
m
pe
g2
_
d 
M
is
se
s
alone
80%
70%
60%
50%
30%
20%
10%
(c) L2 cache
Figure 3: Cache interference introduced by the NCT as we vary the amount of resources
given to the CT.
15
DAC, UPC. Technical Report. May 2005. Cazorla et al.
5.3.1 Caches and TLBs
Regarding TLBs on average for all the experiments made in this paper the number
of data TLB miss per instruction is 3.6× 10−4 and the number of instruction TLB
misses is 8.2× 10−7. Hence, the influence on the execution time of the CT is small.
For this reason we do not control TLBs.
Regarding caches, we measure for each multi-media application in all ten work-
loads the average number of misses in each cache with respect to the number of
committed instructions as we vary the amount of resources given to it, see figure 3.
We observe that there is an increase in cache miss rate caused by interference by
another application in the workload. We observe as well that this interference is
lower when the amount of resources given to the CT is higher and vice-versa. This is
caused by the fact that if the CT is allowed to use many resources, the NCT executes
slower and hence uses caches less frequently and thus produces less interference.
The absolute number of misses per committed instruction is low: lower than
2× 10−4 for the icache, 7× 10−3 for the data cache, and 4× 10−3 for the L2 cache.
We can draw two conclusions from these figures. First, in the icache there is almost
no interferences between the CT and the NCT. Second, the interference introduced
in the caches by a non-critical thread in a workload is so small that we expect that
this only slightly affects the execution time of multimedia applications. As a result,
we do not need to control how caches are shared between the threads.
5.3.2 Other resources
We systematically measured the effect of controlling the resources other than the
caches. We looked at the following resource partitions. Nothing means that we do not
control any resource inside the SMT. Resources are implicitly shared as determined
by the default fetch policy. Fetch means that we prioritize the CT when fetching
instructions from the instruction cache. Queues and Registers mean that we give
a fixed amount of entries of that resource to the CT. Furthermore, we made all
combinations of these resources 2.
2The issue bandwidth provides small variations in the results. For this reason we do not show
its results.
16
DAC, UPC. Technical Report. May 2005. Cazorla et al.
In Figure 4(a), we show for the gsm decoder benchmark its actual IPC values
for two possible ways to partition resources: when we prioritize instruction fetch and
when we moreover partition the registers and the issue queues. From this figure, it
is immediately clear that controlling the instruction fetch alone gives little control
over the speed of the CT and the variability in IPC is large. On the other hand,
controlling queues, registers and fetch does give much control over the speed of the
CT and hence the variability is low.
In order to measure the sensitivity of the variability to resource partitioning more
systematically, we proceed as follows. We used all pairs of media benchmarks as CT
and spec2000 benchmarks as NCT. We measured execution times of the CT. For
each CT from the MediaBench suite, we obtained 8 numbers, one for each spec
benchmark. We computed the mean and the standard deviation of these numbers,
and the fraction deviation/mean. In this way, we obtain a measure of the variability
in execution time of a CT as we change the NCT, expressed relative to the average
execution time. This allows us to average these values over all possible critical
threads. This final average is the overall measure of variability used in this study.
In Figure 4(b) we show these results. We can immediately observe that when we
do not control resource allocation, we get a high variability of 40%. This can be
interpreted as that in many cases the difference in execution time of a CT in an
arbitrary context can be as high as 40% of the total execution time or even higher.
If we only prioritize the instruction fetch of the CT, this variability is hardly reduced.
The most important resources to control are the registers and the issue queue entries.
The best results are obtained when we control everything: we give the CT priority
in instruction fetch and reserve a certain amount of registers and issue queue entries
for it. These are the resources controlled by our Resource Allocator below.
Regarding the percentage of resources given to the CT, we show that as we
decrease this amount the variability increases. For low percentages (10%, 20%)
variability is even higher than when no control is carried out. This is mainly because
when the CT uses few resources the NCT executes more instructions causing more
interferences. In addition, every time the CT misses in cache it has not enough
resources to hide this latency, even L1 data cache misses. Hence, we conclude that
the minimum amount of resources reserved for the CT is 20%.
17
DAC, UPC. Technical Report. May 2005. Cazorla et al.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
I21 I22 I23 I24 M21 M22 M23 M24 I21 I22 I23 I24 M21 M22 M23 M24
Fetch Control Fetch, IQs & Regs Control
IP
C 
o
f t
he
 
gs
m
_
d 
be
n
ch
m
a
rk
(a) Raw data when we control only the fetch, and
when we control the fetch, the IQs and registers.
0
5
10
15
20
25
30
35
40
45
50
no
 
co
ntr
ol
Fe
tch
 
(F)
Qu
eu
es
 
(Q)
Q &
 
F
Re
gis
ter
s(R
) 
R&
F
R&
Q
R&
Q&
F
Av
e
ra
ge
 
(A
ve
ra
ge
 
Ab
so
lu
te
 
D
e
vi
a
tio
n
 
/ M
e
a
n
)
90%
80%
70%
60%
50%
40%
30%
20%
10%
                
(b) Average of the fraction (Standard Deviation / Mean).
Figure 4: Variability in the IPC of the CT for different workloads as we vary the
amount of resources under control.
5.4 Static approach
In this section, we discuss our static approach to resource partitioning. In this
approach, the job scheduler computes a priori the resource partitioning that is used
18
DAC, UPC. Technical Report. May 2005. Cazorla et al.
throughout the entire period of the critical thread. In Figure 2, we have plotted to
relation between the amount of resources dedicated to the CT and the performance
it obtains. It clearly follows that, for all benchmarks considered in this paper, the
relation between performance and amount of dedicated resources is super-linear.
That is, if we dedicate X% of the resources, we obtain more than X% of IPCalone
and in some cases much more. Since the job scheduler knows the WCET of the
critical thread and the period in which it should execute, it knows the slow down the
CT can suffer: the slow down factor S = P
WCET
discussed above. Hence, given this
fraction S of the performance of the CT, it needs to compute a function f(S) = Y
to determine that the CT needs Y% of the resources to obtain this performance. We
call such a function a performance/resource function or p/r function. In this paper,
we have experimented with different p/r functions. Hence, when the job scheduler
assembles a workload with a certain application as critical thread, it computes the
value of S and determines the corresponding value f(S) for a p/r function f . Then
it instructs the resource allocator to reserve f(S)% of the resources for this critical
thread. In the next subsection, we discuss performance/resource functions in more
detail.
5.4.1 Performance/resource functions
Figure 5(a) shows the actual p/r relation of each thread and several p/r functions
that approximate this actual p/r relation, which can be used by the job scheduler.
These functions are plotted as circles in the figure. We show several functions that
are given by r = f(p) = p1/value for value equal to 1 (linear), 0.7, and 0.4. For
lower values, the amount of resources given to the CT is reduced and the actual p/r
relation is better approximated. This may be positive since we allow the NCT to use
more resources. However, this may also compromise the success rate.
For our experiments discussed in the next section, we use the p/r functions de-
scribed above. We moreover use another p/r function that is more directly based on
the graphs shown in Figure 2, called adhoc. In order to construct this function, we
determine for each Multimedia Benchmark and for each possible value of p, the value
of r by reading the curves from x-axis to y-axis. Figure 5(b) shows two examples of
how the function adhoc is determined. The diamonds show the actual p/r relation
19
DAC, UPC. Technical Report. May 2005. Cazorla et al.
for the adpcm c and gsm d functions benchmarks. The circles show the approxima-
tion we used. Note that this approximation is slightly larger than the corresponding
value in the curve in Figure 2 in order to take into account interference by the NCT,
mainly for low percentages of p.
5.5 Dynamic approach
In this approach, the resource allocator dynamically determines the amount of re-
sources given to the critical thread. The main advantage of this method is that it
adapts to program execution phases, increasing overall performance. In the next
section, we show that the dynamic approach provides better results than the static
approach but it is only suitable if the application under consideration has a number
of characteristics that we discuss in more detail below. The main advantage of the
static approach is that it can be used always.
The mechanism bases on the observation that in order to realize X% of the overall
IPC for a given job, it is sufficient to realize X% of the maximum possible IPC at
every instant throughout its execution. In the present case, if we want to slow down
an task with a factor S, it is sufficient to slow it down with a factor S at every
instant. The dynamic approach that exploits this observation is a simplification of
the method we proposed in [5], called Predictable Performance or PP.
The resource allocator distinguishes two phases that are executed in alternate
fashion. We briefly describe these phases below. For more information, please con-
sult [5].
During the first phase, the sample phase, all resources under control are given to
the CT and the NCT is temporarily stopped. As a result, we obtain an estimate of
the current IPCalone of the CT which we call the local IPCalone. The sample phase
starts with a warm up period of 50,000 cycles that is used to remove pollution by
the NCT from the shared resources. Next, we measure the IPCalone of the critical
thread using a period of 10,000 cycles.
During the second phase, the tune phase, the NCT is allowed to run as well. Our
mechanism dynamically varies the amount of resources given to the CT to achieve
an IPC that is equal to the local IPCalone× S. The tune phase lasts 300,000 cycles.
20
DAC, UPC. Technical Report. May 2005. Cazorla et al.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
(p) relative IPC 
(r)
 
a
m
o
u
n
t o
f r
e
so
u
rc
e
s
adpcm_c
adpcm_d
epic_c
epic_d
g721_c
g721_d
gsm_c
gsm_d
mpeg2_c
mpeg2_d
linear(power 1)
power 0.7
power 0.4
(a) Different arguments for the Power function
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
(p) relative IPC (%)
(r)
 
a
m
o
u
n
t o
f r
e
so
u
rc
e
s 
(%
) adpcm_c real
adpcm_c approx.
gsm_d real
gsm_d approx.
(b) Adhoc function
Figure 5: Different performance/resources functions
21
DAC, UPC. Technical Report. May 2005. Cazorla et al.
It is divided in periods of 15,000 cycles during which the realized IPC of the critical
thread is measured. If this measured value is lower than required, the CT is assigned
more resources. If, on the other hand, it is higher than required, resources are taken
away from the CT and given to the NCT.
The main difference between the present dynamic approach and the Predictable
Performance mechanism from [5] is that the present approach controls fewer re-
sources. In particular, we do not exercise control over the caches, in contrast to PP
in which L2 cache miss rates of the critical thread are monitored and this information
is used to dedicate part of the L2 cache exclusively to the critical thread. Moreover,
in PP the real value of IPCalone of the CT is used to compute resource allocations in
the tune phase. This value has to be provided by the OS, in contrast to the present
approach that does not require this value.
The main difficulty in our dynamic method is to measure accurately the local
IPCalone of the CT, due to the pollution created by the NCT in the shared resources.
As we have shown in [5], the main source of interaction among the CT and the NCT
is the L2 cache. This pollution stays for a long time, up to 5 million cycles. We have
analyzed the pollution caused by the NCT in this resource in detail. We found that
for multi media applications the measured value of the IPCalone is 1% lower than the
real value. For SPEC benchmarks used in [5], it is 8% lower. The main reason for
this is that media applications have a smaller working set than spec benchmarks. For
this reason, an NCT does not interfere as much with media applications as with spec
benchmarks. As a result, we can use a more simple resource partitioning algorithm
for media applications than the algorithm from [5] that is geared toward general
purpose applications.
We conclude that, if applications under consideration have a small working set
in comparison with the L2 cache size, then they are unlikely to be affected by a
NCT with a much larger working set. As a result, the IPC measured in the sample
phase is closer to the actual IPCalone, what allows us to leave the L2 miss rate out
of consideration, thereby considerably simplifying the mechanism. If this condition
is not satisfied, the dynamic approach cannot be applied and we have to resort to
either the static approach discussed above or to the expensive mechanism described
in [5].
22
DAC, UPC. Technical Report. May 2005. Cazorla et al.
To summarize, in the dynamic approach, the job scheduler provides the value
S to the resource allocator. Next, the resource allocator determines the IPCalone
of that instance of the task during a sample phase and reduces its IPC by a factor
of S during the subsequent tune phases. This implies that the CT can meet its
deadline and that we minimize the amount of resources given to the CT, enabling
high performance of the NCT.
6 Simulation Results
In this section, we present the results of the static and dynamic approaches. More-
over, we show the results obtained using the Predictable Performance mechanism we
presented in [5] and a fetch control like mechanism [2][9].
6.1 Static method
Figure 6 shows the success rate and the performance for the different p/r functions
used in the static method. In figure 6(a) bars show the success rate and are measured
in the left y-axis. Lines show the Mean5WorstVariance and are measured in the right
y-axis.
In figure 6(a), we can see that the linear relation provides the best success rate.
We also observe that when the p/r function is more aggressive, the success rate
decreases. This is intuitively clear, since we reduce the amount of resources given
to the CT. All functions, except the function f(p) = p1/0.4, achieve a good success
rate and Mean5WorstVariance. As we move from high to low utilization scenarios,
the success rate improves. On average, the success rate is 0.9875, 0.979, and 0.671
for the linear, f(p) = p1/0.7, and f(p) = p1/0.4 functions, respectively. For the adhoc
function the success rate is 0.975. The Mean5WorstVariance is 1%, 2.6%, and 87%
and 2.4%, respectively.
Figure 6(b) shows the average IPC of the NCT, averaged over all experiments.
We see that the function f(p) = p1/0.4 achieves the best performance results. How-
ever, this is at the cost of success rate. Hence, we conclude that this function is
too aggressive. From the other p/r functions, we observe that as we increase the
23
DAC, UPC. Technical Report. May 2005. Cazorla et al.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
high medium low
Su
cc
e
ss
 
R
a
te
0
20
40
60
80
100
120
140
160
180
200
v
a
ria
tio
n
 
re
s
pe
c
t t
o
 
de
a
dl
in
e
 
(%
)
linear (power 1)
power 0.7
power 0.4
adhoc
Mean5WorseVariance
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
high medium low
N
CT
 
Pe
rfo
rm
a
n
ce
 
linear
power 0.7
power0.4
adhoc
(a) success rate and Mean5WorstVariance (b) Throughput
Figure 6: Success rate and throughput for different p/r functions
aggressiveness of the p/r function, we obtain more performance for the NCT. This
is because more aggressive p/r functions provide the CT with fewer resources.
We conclude that the adhoc performance/resource function performs best. If this
function is too difficult to obtain in some circumstances, the p/r function f(p) = p1/0.7
performs only slightly worse. Recall that a p/r function is computed by the OS level
job scheduler and hence is implemented in software. Hence we can easily use complex
functions like the functions discussed above.
6.2 Dynamic method
In this section, we present the results of the dynamic method and moreover compare
this mechanism with previous approaches. For this experiment, we have also consid-
ered the Predictable Performance mechanism used in [5] and a prioritization-aware
fetch policy [2][9] which we call fetch control in this section. This mechanism always
prioritizes the CT when fetching instructions. We also compare our results with the
adhoc p/r function for the static method.
Figure 7(a) shows the success rate for the different approaches. If we just control
fetch, we see that we obtain a low success rate and a high Mean5Variation, even for
the low utilization scenario. The predictable performance approach obtains a success
rate of 1, and hence a Mean5WorstVariation of 0. This is mainly due to the fact that
this policy requires knowledge of the IPCalone of each CT that allows it to compute
dynamically how far the current IPC of the CT is from the target IPC. In this way,
this mechanism can converge to the target IPC. Our mechanism does not require
24
DAC, UPC. Technical Report. May 2005. Cazorla et al.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
high medium low
Su
cc
e
ss
 
R
a
te
0
50
100
150
200
250
300
350
v
a
ria
tio
n
 
re
s
pe
c
t t
o
 
de
a
dl
in
e
 
(%
)
Fetch control
Resource Allocator (static)
Resource Allocator (dynamic)
Predictable Performance
Mean5WorseVariance
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
high medium low
N
CT
 
IP
C
Fetch control
Resource Allocator (static)
Resource Allocator (dynamic)
Predictable Performance
(a) success rate and Mean5WorstVariance (b) Throughput
Figure 7: Success Rate and throughput of our approach and previous approaches
this information but achieves a success rate of 0.9875 nevertheless. It also has a low
Mean5WorstVariance of 1.63%. In the static method using the adhoc p/r function,
the success rate is 0.975 and the Mean5WorstVariance is 2.4%.
Regarding throughput, we can see that our dynamic method achieves the same
performance as Predictable Performance. Our static method achieves 9% less perfor-
mance, but still much more performance than the fetch control method, up to 56%
more in the low utilization scenario.
6.3 Discussion
In this section, we briefly summarize the results presented above and discuss some
other issues, namely, the cost and the applicability of the mechanisms discussed
above.
Concerning the cost of the different mechanisms, the fetch control mechanism
proposed in [3][9] only prioritizes the fetch of the critical thread and hence has lowest
cost. Our static mechanism needs to keep track of how many resources are used by
each thread and hence is more expensive. However, the cost for doing this is not
high as we show in section 7. Since the required percentage that the critical thread
should receive is provided by the job scheduler, the base resource allocator is enough
to implement our static method. Our dynamic method is more complex. Apart from
the resource allocator that is required to monitor that threads do not exceed their
share of the resources, a mechanism in hardware is required to sample the IPCalone
of the critical thread during the sample phase and to periodically determine the
25
DAC, UPC. Technical Report. May 2005. Cazorla et al.
Table 3: Comparing approaches shown in this paper
Metric fetch control static dynamic PP
Success rate -- (<75%) ++ (>95%) ++ ++
Throughput NCT -- + ++ ++
Applicability ++ ++ + -
Cost -- - - +
IPC of this tread during the tune phase. Moreover, as we saw in section 7, logic
is required to suspend the NCT during the sample phase and to adjust resource
allocation during the tune phase. Finally, Predictable Performance requires all this
plus extra logic to monitor the L2 cache. Moreover, the logic required to determine
resource allocation after the sample phase is more complex than the logic required
by our dynamic mechanism proposed in the present paper.
Concerning the applicability of the various approaches, both fetch control and our
static method can be used for all applications. Our dynamic method requires that
the non-critical thread does not interfere too much with the critical thread in the
L2 cache. Predictable Performance requires that all instances of an application have
more or less the same IPC. Fortunately, it has been shown [6] that media applications
have these properties so that all approaches discussed in this paper can be applied.
Hence we can summarize all aspects in table 3. Symbols used in this table mean:
(++) very high, (+) high, (−) low, (−−) very low.
Depending on the properties of applications that need to run on the system,
the amount of hardware available to provide soft real time functionality, and the
required success rate, a designer of an embedded, real-time system can choose on of
the alternatives discussed in this paper. If there is hardly any room to implement
a real time mechanism, fetch control can be used which has a poor success rate but
costs next to nothing. If there is a modest amount real estate available and the
success rate must be reasonably high, our static method is best suited. If, on the
other hand, the success rate must be 1 then Predictable Performance can be used,
at the cost of a complex implementation. In situations in between, our dynamic
method may be a good candidate.
26
DAC, UPC. Technical Report. May 2005. Cazorla et al.
7 Implementation
The two proposals shown in this paper require some hardware to control the amount
of resources given to the CT and NCT. Next, we present the hardware changes in
our baseline architecture to provide such functionality. Finally, we show how the OS
and the hardware collaborate in order to accomplish with time requirements.
7.1 Hardware to control resource allocation
The objective of this hardware is to ensure that the CT is allowed to use at least a
given amount of each shared resource. Tasks done by this hardware are three: track,
compare, and stall.
• Track: in order to track the amount of resources used by each thread we need
a resource usage counter for each resource under control and for the CT and
the NCT. Each counter tracks the number of slots that each thread have of
that resource. Figure 8 shows the counters required for a 2-context SMT if
we track the physical registers. Resource usage counters are incremented in
the decode stage (indicated by (1) in Figure 8). Register usage counters are
decremented when the instruction commits (2), hence the file register is left
unchanged. All added registers are special purpose registers. They do not
belong to the register file. The design of the register file is left unchanged with
respect to the baseline architecture. The implementation cost of these counters
depends on the particular architecture. However, we think that it is low due
to the fact that current processors have several tens of performance and event
counters registers, i.e. the Intel Pentium4 has more than 80 performance and
event counters [1].
• Compare: our mechanism also needs two register that contains the maximum
number of entries that the CT and the NCT are entitled to use. We call these
registers limit registers. These registers are modified by the OS periodically
as we will see in the next point. In the example shown in Figure 8 we need
4 counters: one for the fp registers and one for the integer registers for both
the CT and NCT. Every cycle we compare the resource usage counters of each
27
DAC, UPC. Technical Report. May 2005. Cazorla et al.
Decode
1
Fetch Rename
CT’’s limit registers
CT’’s usage
counters
NCTs’ usage
counters
CMP
CMP
CMP
CMP
OR OR
Instruction
Cache
PCs
int
fp
int
fp
int
fp
int
fp
Decode
2
Instr.
Que
ues
(1)
(2)(3)
(4)
(3)
Instruction is committed
NCT’’s limit registers
U
p
d
a
te
d
 b
y
 t
h
e
 O
S

...
Figure 8: Hardware required to implement our mechanism
thread with the limit registers. If a threads is using more slots than given to
it then, a signal is send to the fetch stage.
• Stall: in the fetch stage we receive one signal for each thread from the compare
logic indicating whether or not this thread is using more slots of any resource
than given to it. If this signal is activated the fetch mechanism do not fetch
instruction from that thread until the number of entries used for this thread
decreases. Otherwise the thread is allowed to compete for the fetch bandwidth
as determined by the fetch policy.
We also control the occupancy of the issue queues. Hence, 3 limit resources and
3 queue usage counters are required (one for each queue: integer, floating point, and
load/store issue queue). In this case the queue usage counters are decremented when
instructions are issue from the issue queues.
7.2 OS/hardware collaboration
For the static approach the OS only has to update the values of the (special purpose)
limit registers in order to accomplish with tasks’ deadlines. When the OS provides
the task to be executed it also sets the value of these registers. Note that, if the limit
28
DAC, UPC. Technical Report. May 2005. Cazorla et al.
registers of the CT and the NCT are set to the maximum number of resources no
thread is stalled. That is, we would have a standard SMT processor guided by the
fetch policy.
For the dynamic approach the OS sets the percentage of the IPCalone of the CT
that the hardware has to achieve. This requires the addition of one register. In
addition to the previous hardware we use a Finite State Machine (FSM) to dynami-
cally change the resource allocation to converge to the target IPC. This FSM is quite
simple and can be implemented with four counters and simple control logic.
The FSM starts by givin all resources to the CT. This is done by simply setting its
entries in the limit registers to the number of entries of each resource, and resetting
the entries of the NCT. Next, at the end of the warm-up phase, we begin to compute
the IPC of the CT. At the end of the actual-sample phase, we compute the local
target IPC and set the resource allocation to converge to the local target IPC. At
the end of each tune sub-phase, we vary the resource allocation again so that the
IPC of the CT converges to the target IPC.
8 Conclusions
In this paper, we have proposed two novel approaches to the problem of enabling SMT
processors for soft-real time systems. The main problem of using SMT processors
in real-time systems is that in an SMT processor threads share almost all hardware
resources. This may may cause interference between threads which implies that the
speed a thread obtains in one workload can be very different from the speed it has
in another workload [4]. In contrast to previous approaches, our methods do not
require any knowledge beyond information that is traditionally used by the OS level
job scheduler, namely, Worst Case Execution Time and the Period of the time-critical
thread. Neither do our methods require extensive profiling of candidate workloads
like some other methods do [7][9]. Our methods are based on resource partitioning,
instead of previous approaches that are IPC based, reserving a minimum fraction
of all resources for the critical thread so that it can just reach its deadline. In this
way, the non-critical threads also receive as many resources as possible so that their
throughput is maximized at the same time. In the first method, the job scheduler
29
DAC, UPC. Technical Report. May 2005. Cazorla et al.
determines the fraction of resources dedicated to the critical thread and this fraction
is fixed during the entire period. In the second method, the SMT hardware extension
of a resource allocator dynamically adjusts the amount of resources for the critical
thread, thereby adapting to program phases which increases the thoughput of the
non-critical threads even more. We have compared our approaches to two previously
published mechanisms, namely, fetch control [2][9] and Predictable Performance [5].
We have shown that we significantly outperform fetch control and are almost as good
as Predictable Performance using a much less complex mechanism. On average, the
critical thread meets its deadline in 98% of the cases considered in this paper. We
have discussed the pros and cons of all 4 mechanisms to explore the design space of
real time embedded SMT processors in detail.
Acknowledgments
This work has been supported by the Ministry of Science and Technology of Spain
under contracts TIC-2001-0995-C02-01, TIC-2004-07739-C02-01, and grant FP-2001-
2653 (Francisco J. Cazorla), the HiPEAC European Network of Excellence, and an
Intel fellowship. The authors would like to thank Peter Knijnenburg for his comments
and Oliverio J. Santana, Ayose Falco´n, and Fernando Latorre for their work in the
simulation tool. The authors also would like to thank Jim Smith for his help and
technical comments.
References
[1] IA-32 intel architecture software developer’s manual. volume 3: System pro-
gramming guide.
[2] A. Anantaraman, K. Seth, K. Patil, E. Rotenberg, and F. Mueller. Virtual
simple architecture (visa): Exceeding the complexity limit in safe real-time sys-
tems. Proc. of the 30th Annual Intl. Symposium on Computer Architecture,
pages 350–361, June 2003.
30
DAC, UPC. Technical Report. May 2005. Cazorla et al.
[3] A. Anantaraman, K. Seth, K. Patil, E. Rotenberg, and F. Mueller. Exploiting
visa for higher concurrency in safe real-time systems. Technical Report TR-
2004-15, North Carolina State University, 2004.
[4] F.J. Cazorla, P.M.W. Knijnenburg, E. Fernandez, R. Sakellariou, A. Ramirez,
and M. Valero. Implicit vs. explicit resource allocation in SMT processors. In
Symposium on Digital System Design. Invited Paper, 2004.
[5] F.J. Cazorla, P.M.W. Knijnenburg, E. Fernandez, R. Sakellariou, A. Ramirez,
and M. Valero. Predictable performance in SMT processors. ACM International
Conference on Computing Frontiers, pages 171–182, 2004.
[6] C. J. Hughes, P. Kaul, S. V. Adve, R. Jain, C. Park, and J. Srinivasan. Variabil-
ity in the execution of multimedia applications and implications for architecture.
Proc. of the 28th Annual Intl. Symposium on Computer Architecture, pages 254–
265, 2001.
[7] R. Jain, C.J. Hughes, and S.V. Adve. Soft real-time scheduling on simultaneous
multithreaded processors. Proc. of the 23th International Symposium on Real-
Time Systems Symposium, pages 134–145, Dec 2002.
[8] Markus Levy. Multithreaded technologies discolsed at MPF. Microprocessor
Report, Nov 2003.
[9] A. Snavely, D.M. Tullsen, and G. Voelker. Symbiotic job scheduling with pri-
orities for a simultaneous multithreaded processor. ACM SIGMETRICS, pages
234–244, June 2002.
[10] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm. Exploiting choice:
Instruction fetch and issue on an implementable simultaneous multithreading
processor. Proc. of the 23th Annual ISCA, April 1996.
[11] David W. Wall. Limits of instruction-level parallelism. Proc. of the 4th Intl.
Conf. on Architectural Support for Programming Languages and Operating Sys-
tems, pages 176–188, April 1991.
31
