ITCA: Inter-Task Conflict-Aware CPU accounting for CMP by Luque, Carlos et al.
ITCA: Inter-Task Conﬂict-Aware CPU Accounting for CMPs
Carlos Luque1, Miquel Moreto1,2, Francisco J. Cazorla1,3,
Roberto Gioiosa1 and Mateo Valero1,2
1Barcelona Supercomputing Center (BSC), Barcelona, Spain. {carlos.luque, roberto.gioiosa}@bsc.es
2Universitat Politècnica de Catalunya (UPC), Barcelona, Spain. {mmoreto, mateo}@ac.upc.edu
3Spanish National Research Council (IIIA-CSIC), Spain. {francisco.cazorla}@bsc.es
Resumen
Chip-MultiProcessors (CMP) introduce com-
plexities when accounting CPU utilization to
processes because the progress done by a pro-
cess during an interval of time highly depends
on the activity of the other processes it is co-
scheduled with. We propose a new hardware
CPU accounting mechanism to improve the
accuracy when measuring the CPU utilization
in CMPs and compare it with previous ac-
counting mechanisms. Our results show that
currently known mechanisms lead to a 16%
average error when it comes to CPU utilization
accounting. Our proposal reduces this error to
less than 3% in a modeled 8-core processor
system.
1. Introduction
The Operating System (OS) provides the user
with an abstraction of the hardware resources.
The user application perceives this abstraction
as if it is using the complete hardware while, in
fact, the OS shares hardware resources among
the user applications. Hardware resources can
be shared temporally and spatially. Hardware
resources are time shared between users when
each task can make use of a resource for a lim-
ited amount of time (for example, the exclu-
sive use of a CPU). Orthogonally, hardware re-
sources can be shared spatially when each task
makes use of a limited amount of resources,
such as the cache memory or the I/O band-
width.
The execution time of an application is in-
Figura 1: Total (real) and accounted
(sys+user) time of swim in diﬀerent work-
loads running on an Intel Xeon Quad-Core
CPU
ﬂuenced by the amount of hardware resources
shared with the other running applications.
It is also aﬀected by how long the applica-
tion runs with other applications. However, the
time accounted to that application is always the
same regardless of the workload1 in which it is
executed, i.e., regardless of how many applica-
tions are sharing the hardware resources at any
given time. We call this principle, the Principle
of Accounting. Unix-like systems diﬀerentiate
the real execution time and the time an ap-
plication actually is running on a CPU. Com-
mands such as time or top provide three val-
ues: real, user and sys. Real is the elapsed wall
clock time between invocation and termina-
tion of the application; user is the time spent
by the application in the user mode; and sys
is the time spent in the kernel mode on behalf
of the application. In these systems, sys+user
time is the execution time accounted to the
application.
1A workload is a set of applications running, si-
multaneously, on a CPU
Figure 1 shows the total (real) and the
accounted execution time (sys+user) of the
171.swim (or simply swim) SPEC CPU 2000
benchmark [1] when it runs in diﬀerent work-
loads. In this ﬁgure, the time results are nor-
malized to the real execution time of swim
when it runs in isolation (ISOL). For this ex-
periment, we use an Intel Xeon Quad-Core
processor at 2.5 GHz (though the general
trends drawn from Figure 1 apply to all cur-
rent CMPs), which has four cores in the chip
on which we run Linux 2.6.18. We move all the
OS activity to the ﬁrst core, leaving the oth-
er cores as isolated as possible from OS noise.
When swim runs alone in one of the isolated
cores, it completes its execution in 117 sec-
onds. However, when swim runs together with
other applications in the same core, its real
execution time increases up to 4x due to con-
text switches done by the OS (black triangles
in Figure 1). Nevertheless, swim is accounted
roughly the same time (grey triangles), which
is the time the application actually uses the
CPU. Applications may suﬀer some delay be-
cause they lose part of the cache and TLB con-
tents on every context switch, but this eﬀect
is small in this case. Hence, even if swim's to-
tal execution time increases depending on the
other applications it is co-scheduled with, the
time accounted to swim is always the same.
However, processors with shared on-chip re-
sources, such as CMPs [2], make CPU account-
ing more complex because the progress of an
application depends on the activity of the oth-
er applications running at the same time. Cur-
rent OSs still use the CA for multicore pro-
cessors, which can lead to inaccuracy for the
time accounted to each application. In order
to show this inaccuracy, in a second experi-
ment, we use all the cores in the Intel Xeon
Quad-Core processor. Next, we execute swim
with several workloads as shown by the x-axis
in Figure 1. In this case, swim suﬀers no time
sharing and real time is roughly the same as
sys+user because the number of tasks that are
running is equal or less than the number of
virtual CPUs (cores) in the system. In Fig-
ure 1, the grey circles show a variance up of to
2x in the time accounted to swim depending
on the workload in which it runs. This means
that (at least with current known open source
OSs such as Linux) an application running on
a CMP processor may be accounted diﬀerent
CPU utilization according to the other appli-
cations running on the same chip at the same
time. From the user point of view this is an
undesirable situation, as the same application
with the same data input set is accounted dif-
ferently depending on the applications it is co-
scheduled with.
CPU accounting aﬀects several key compo-
nents of a computing system: First, if the OS
scheduler does not properly account the CPU
utilization of each application, the OS schedul-
ing algorithm will fail to maintain fairness
between applications. As a consequence, the
scheduling algorithm cannot guarantee that an
application progresses with its work as expect-
ed. Second, CPU accounting can be also used
in data centers to charge users, together with
other factor such as used amount of memory,
disk space, I/O activity, etc.
The main contributions of this paper are:
For the ﬁrst time, we provide a comprehensive
analysis of the CPU accounting accuracy of
the CA. To the best of our knowledge, the CA
is the only accounting mechanism for CMPs
for open source OSs such as Linux. Next,
we propose a hardware mechanism, Inter-Task
Conﬂict-Aware (ITCA) accounting [3], which
improves the accuracy of the CA for CMPs.
When running on a modeled 2-, 4-, and 8-
core CMP, ITCA reduces the oﬀ estimation of
the CA from 7.0%, 13%, and 16%, to 2.4%,
3.7%, and 2.8%, respectively.
The rest of the paper is organized as follows.
Section 2 analyses and formalizes the CPU ac-
counting problem. Section 3 describes our pro-
posed CPU accounting mechanism with im-
proved accuracy. The experimental methodol-
ogy and the results of our simulations are pre-
sented in Section 4. Section 5 discusses relat-
ed work and, ﬁnally, Section 6 concludes the
paper.
2. Formalizing the problem
Currently, the OS perceives diﬀerent cores in
a CMP as multiple independent virtual CPUs.
The OS does not consider the interaction be-
tween tasks caused by shared hardware re-
sources in the CA. However, the time running
on a virtual CPU is not an accurate measure
of the amount of CPU resources the task has
received. The CPU time to account to a task in
a CMP processor does not only depend on the
time that task is scheduled onto a CPU, but
also on the progress it makes during that time.
In our view, CMP processors, have to maintain
the same principle of accounting that rules to-
day in uniprocessor and Symmetric MultiPro-
cessors (SMP) systems2 accounting: the CPU
accounting of a task should be independent
from the workload in which this task runs. For
example, let's assume that a task X runs for a
period of time in a CMP (TRCMPX,IX ), in which
it executes IX instructions. It is our position
that the actual time to account this task, de-
noted TACMPX,IX , should be the time it would
take this task to execute these IX instructions
in isolation, denoted TRISOLX,IX .
The relative progress that task X has in
this interval of time (PCMPX,IX ) can be expressed
as PCMPX,IX = TR
ISOL
X,IX
/TRCMPX,IX . The relative




in which IPCCMPX,IX and
IPCISOLX,IX are the IPC of task X when ex-
ecuting the same IX instructions in the




·PCMPX,IX from which we con-
clude TACMPX,IX = TR
ISOL
X,IX
. This follows our
principle of workload-independent accounting.
When using this approach to measure CPU
accounting, the main issue to address is how
to determine dynamically (while a task X is
simultaneously running with other tasks) on
each context switch, the time (or IPC) it will
take X to execute the same instructions if it
is alone in the system. An intuitive solution
to this problem is to provide hardware mech-
anisms to determine the IPC in isolation of
each task running in a workload by periodical-
2SMPs are systems with several single thread, sin-
gle core chips
ly running each task in isolation [4][5]. By aver-
aging the IPC in the diﬀerent isolation phases,
an accurate measurement of the IPC in isola-
tion of the task can be obtained. However, as
the number of tasks simultaneously executing
in a multicore processor increases to dozens or
even hundreds, this solution will not scale, as
the number of isolation phases increases lin-
early with the number of tasks in the work-
load. As a consequence, the time the task runs
in CMP architectures is reduced, aﬀecting the
system performance.
Throughout this paper, we refer to inter-
task resource conﬂicts to those resource con-
ﬂicts that a task suﬀers due to the interference
of the other tasks running at the same time.
For example, a given task X suﬀers an inter-
task L2 cache miss when it accesses a line that
was evicted by another task, but would have
been in cache, if X had run in isolation. Like-
wise, intra-task resource conﬂicts denote those
resource conﬂicts that a task suﬀers even if it
runs in isolation. These are conﬂicts inherent
to the task.
The CA accounts tasks based on the time
they run on a CPU, instead of the progress
each task does. Therefore, the CA implicitly
assumes that running tasks have full access
to the processor resources. However, each task
shares resources with other tasks when run-
ning in a CMP, which leads to inter-task con-
ﬂicts. As a consequence, a task takes longer to
ﬁnish its execution than when it runs in isola-
tion, resulting in longer accounting time. For
this reason for a task X, the CA leads to over-




A task has no over-estimation only if it ex-
ecutes with no slowdown in CMP with re-
spect to its execution in isolation, in which




The main source of over-estimation in our
CMP baseline architecture are inter-task con-
ﬂicts and, in particular, inter-task L2 misses.
3. Inter-Task Conﬂict-Aware
Accounting
The target of our proposal is to accurately es-
timate the time accounted to a task in CMPs.
The basic idea of ITCA is to account to a
task only those cycles in which the task is not
stalled due to an inter-task L2 cache miss. In
other words, a task is accounted CPU cycles
when it is progressing or when it is stalled due
to an intra-task L2 miss. The next paragraphs
provide a detailed discussion of when the ac-
counting of a task is stopped and resumed.
L2 data misses: We consider a task is in one
of the following states: (s1) It has no L2 (data)
cache misses or it has only intra-task L2 misses
in ﬂight; (s2) It has only inter-task L2 misses
in ﬂight; and (s3) It has both inter-task and
intra-task L2 misses in ﬂight simultaneously.
In the state (s1) we do a normal accounting
because there is not either and inter-task L2
miss or an intra-task L2 miss. We consider a
task is not progressing, and hence, it should
not be accounted in state (s2). In other words,
accounting is stopped when the task experi-
ences an inter-task L2 miss and it cannot over-
lap its stall with any other intra-task L2 miss.
We resume accounting the task when the inter-
task L2 miss is resolved or the task experiences
an intra-task L2 miss, in which case the task
is able to overlap the memory latency of the
inter-task L2 miss with at least one intra-task
L2 miss.
In the state (s3), we have that the inter-
task L2 miss overlaps with another intra-task
L2 miss. As a consequence, in general we do
a normal accounting to the task in that state.
However, when an inter-task L2 miss becomes
the oldest instruction in the Reorder Buﬀer
(ROB) and the register renaming is stalled,
the task loses an opportunity to extract more
Memory Level Parallelism (MLP). For exam-
ple, let's assume that there are S instructions
between the inter-task L2 miss in the top of
the ROB and the next intra-task L2 miss in the
ROB. In this situation, if the task had not ex-
perienced the inter-task L2 miss it would have
executed the S instructions after the last in-
struction currently in the ROB. Any L2 miss
in those S instructions would have been sent
to memory, increasing the MLP. We take care
of this lost opportunity of extracting MLP by
stopping the accounting of a task if the in-
struction in the top of the ROB is an inter-
task L2 miss and the ROB is full. We call this
condition state (s4).
L2 instruction misses: Another condition
in which we stop the accounting of a task, is
when the ROB is empty because of an inter-
task L2 cache instruction miss (s5). In our pro-
cessor setup instruction cache misses do not
overlap with other instruction cache misses.
That is, at every instant, we have only 1 in
ﬂight instruction miss per task at most. Hence,
on an inter-task instruction L2 miss we consid-
er that the task is not progressing because of
an inter-task conﬂict, and hence, we stop its
accounting.
3.1. Implementation
Figure 2 shows a sketch of the hardware im-
plementation of our proposal, which makes use
of several hardware resource status indicators.
Next, we explain in depth the diﬀerent parts
of our approach.
Detecting inter-task misses: We keep an
Auxiliary Tag Directory (ATD) [6] for each
core (see Figure 2 (a)). The ATD has the same
associativity and size as the tag directory of
the shared L2 cache and uses the same replace-
ment policy. It stores the behavior of memory
accesses per task in isolation. While the tag di-
rectory of the L2 cache is accessed by all tasks,
the ATD of a given task is only accessed by the
memory operations of that particular task. If
the task misses in the L2 cache and hits in its
ATD, we know that this memory access would
have hit in cache if the task had run in isola-
tion [7]. Thus, it is identiﬁed as an inter-task
L2 miss.
Tracking inter-task misses: We add one bit
called ITdatai bit in each entry i of the Miss
Status Hold Register (MSHR). The ITdata bit
is set to one when we detect an inter-task data
miss. Each entry of the MSHR keeps track of
an in ﬂight memory access from the moment it
misses in the data L1 cache until it is resolved.
On a data L1 cache miss, we access the L2
tag directory and the ATD of the task in par-
allel. If we have a hit in the ATD and a miss
in the L2 tag directory, we know that this is
an inter-task L2 cache miss. Then, the ITdata
bit of the corresponding entry in the MSHR is
(a) Baseline processor architecture
(b) Logic to stop accounting
Figura 2: Hardware required for ITCA
set to 1. Once the memory access is resolved,
we free its entry in the MSHR.
When the ROB is empty due to an inter-
task L2 cache instruction miss, we stop ac-
counting cycles to this task. For our purpose,
we use a bit called ITinstruction that indicates
whether the task has an inter-task L2 cache in-
struction miss or not.
Accounting CPU time: We stop the ac-
counting of a given task when: First, the ROB
is empty because of an L2 cache instruction
miss (gate (1) in Figure 2 (b) that implements
condition (s5)). RobEmpty is a signal that is
already present in most processor architec-
tures, while ITinstruction indicates whether or
not a task has an L2 cache instruction miss.
Second,The oldest instruction in the ROB is
an inter-task L2 cache data miss and we have
a Stall in the Register Renaming (SRR), in
which case SRR equals 1 (gate (2) in Fig-
ure 2 (b) that implements condition (s4)).
Storing a bit to track inter-task L2 misses
might require one bit per ROB entry. Third,
all the occupied MSHR entries belong to inter-
task misses. To determine this condition, we
check whether every entry i of the MSHR is
not empty (mshr_entry_emptyi = 0) and
contains an inter-task L2 miss (ITdatai = 1)
(gates (3.1) and (3.2) in Figure 2 (b) imple-
ment condition (s2)). By making an AND op-
eration of ITdata_mshri and a signal show-
ing whether the entire MSHR is empty, Emp-
tyMSHR (3.1), we determine if we have to stop
the accounting for the task. Finally, if any of
the gates (1), (2) or (3.1) returns 1, we stop
the accounting. Otherwise, we account the cy-
cle normally to the task as occurs in states (s1)
and (s3).
In a 2-core CMP, ITCA accounts for every
spent cycle in three possible ways: (1) Each
task is accounted for the cycle when both tasks
progress (the cycle is accounted twice, one for
each task). (2) Only one task is progressing
and the cycle is accounted only to it. (3) The
cycle is not accounted to any task when none
of them is progressing.
The cycles accounted to each task in each
core are saved into a special purpose register
per core, Accounting Register or AR (see Fig-
ure 2 (a)), which can be communicated to the
OS. This register is a read only register like the
Time Stamp Register in Intel architectures.
From the OS point of view working with ITCA
is similar to working with the CA. On every
context switch, the OS reads the Accounting
Register of each taski (ARi), where ARi re-
ports the time to account this task. With this
information, the OS updates metrics of the
system and carries out the scheduling tasks.
4. Experimental results
4.1. Experimental environment
We use MPsim [8], a trace driven CMP simu-
lator to model three processor setups: a 2-core
CMP with a 2MB L2 cache, a 4-core CMP
with a 4MB L2 cache and an 8-core CMP
with an 8MB L2 cache. Each core is single
threaded, has an 11-stage-deep pipeline and
can fetch up to 8 instructions each cycle. Each
core has 6 integer (I), 3 ﬂoating point (FP),
and 4 load/store functional units; 64-entry I,
FP, and load/store instruction queues; 512-
entry reorder buﬀer and 196 I/FP physical reg-
isters. We use a two-level cache hierarchy with
128B lines with a separate 64KB, 2-way in-
struction cache and a 32KB, 4-way data cache
and a 16-way L2 cache that is shared among all
cores. The latency from L1 to L2 is 12 cycles,
and from L2 to memory 300 cycles.
We feed our simulator with traces collected
from the whole SPEC CPU 2000 benchmark
suite using the reference input set. Each trace
contains 300 million instructions, selected us-
ing SimPoint [9]. From these benchmarks, we
generate 2-task, 4-task and 8-task workloads.
In each workload, the ﬁrst thread in the tuple
is the Principal Thread (PTh) and the remain-
ing threads are considered Secondary Threads
(SThs). In every workload, we execute the
PTh until completion. The other threads are
re-executed until PTh completes. We charac-
terize the results of our proposal based on
the type of the PTh and SThs. We distin-
guish three combinations of ST: ILP, MEM
and MIX. ILP combination contains only ILP
benchmarks, MEM combination contains only
memory-bound benchmarks and MIX combi-
nation contains a mixture of both.
As the main metric, we measure how
oﬀ is the estimation provided by each ac-
counting mechanism. The oﬀ estimation
(relative error of the approximation) com-
pares the accounted time of a particular
accounting approach for the PT with the
actual time it should be accounted for. The
ratio
∣∣1− (TACMPPT,IPT /TRISOLPT,IPT )∣∣ estimates
the oﬀ estimation for a given accounting mech-
anism. The ratio
∣∣1− (TRCMPPT,IPT /TRISOLPT,IPT )∣∣
provides the oﬀ estimation for the CA. For
each accounting policy, we also report the
average values of the ﬁve workloads with the
worst oﬀ estimation, denoted Avg5WOE.
Figura 3: Oﬀ estimation of the CA and ITCA
for 2-, 4- and 8-core CMPs with a shared 2MB,
4MB and 8MB L2 cache, respectively
4.2. Accuracy results
Figure 3 shows the oﬀ estimation of ITCA and
the CA for our 3 processor setups. In this case,
we show the average results of each group as
we described in Section 4.1. The bars labeled
AVG represent the average of each CMP con-
ﬁguration for all the groups. While on average
the CA has an oﬀ estimation of 7.0% (2 cores),
13% (4 cores) and 16% (8 cores), ITCA re-
duces it to less then 2.4% (2 cores), 3.7% (4
cores) and 2.8% (8 cores). These results indi-
cate that ITCA provides a good measure of the
progress each task makes with respect to its
execution in isolation, since ITCA takes into
account inter-task L2 misses. Moreover, ITCA
reduces the inaccuracy in the worst ﬁve cases:
the Avg5WOE metric is 117% (2 cores), 91%
(4 cores) and 94% (8 cores) for the CA and
only 32% (2 cores), 35% (4 cores) and 20%
(8 cores) for ITCA.
Next, we observe that the accuracy of the
CA is worse when the PT is in the ILP group
and any of the STs is in the MEM group. This
is due to the fact that some of the ILP tasks
experience a lot of hits in the L2 cache when
they run in isolation, and when they run with
MEM tasks, which make an intensive use of
the L2 cache, the ILP tasks suﬀer a lot of inter-
task misses. As a consequence, the ILP task
suﬀers an increase in its execution time, which
aﬀects the accuracy of the CA. When the PT
is in the MEM group, it already suﬀers a lot
of L2 misses in isolation, so that the increase
in the number of L2 misses when it runs with
other MEM tasks is relatively lower.
Next, we observe that the inaccuracy of the
CA for a given group increases with the num-
ber of cores. Even if in our processor setups for
2, 4, and 8 cores the average cache space per
task is kept the same (1MB per task), the av-
erage oﬀ estimation of the CA increases from
7.0% (2 cores) to 16% (8 cores). The main
reason for that behavior is that having more
tasks sharing the cache increases the probabil-
ity that one of them thrashes the other tasks,
which will lead to higher oﬀ estimations in the
CA. The capacity of the L2 cache is not enough
to store all the data of the tasks running simul-
taneously and for example, the oﬀ estimation
of the group I_MEM in 2 cores is 14% but
29% in 8 cores.
4.3. Hardware proposals to provide
fairness
Several hardware approaches deal with the
problem of providing fairness in multicore ar-
chitectures. Although, fairness is a desirable
characteristic of a system, next we show that
it cannot be used to provide an accurate CPU
accounting. There are two main ﬂavors of fair-
ness. First, it is assumed that an architecture
is fair when it gives the same amount of re-
sources to each running task. However, en-
suring a ﬁxed amount of resources to a task
[10, 5], does not translate into a CPU utiliza-
tion that can be computed for that task. This
is due to fact that the relation between the
amount of resources assigned to a task and
its performance can be diﬀerent for each task.
The second ﬂavor of fairness considers that an
architecture is fair when all tasks running on
that architecture make the same progress. For
example, let's assume a 2-core CMP with tasks
X and Y. The system is said to be fair if in a
given period of time, the progress made by X
and Y is the same, PX = PY . However, the fact
that PX = PY does not provide a quantitative
value that can be provided to the OS, so that
it can account CPU time to each task. In other
words, to know that PX = PY does not pro-
vide any information about CPU accounting
since PX can be any value lower than 1. There-
fore, systems providing fairness require an ac-
curate CPU accounting mechanism as well.
5. Related work
We are not aware of any other work which
studies CPU accounting for CMP architec-
tures. Thus, ITCA is the ﬁrst accounting
mechanism designed speciﬁcally for CMP
processors. For SMT processors [11], oth-
er proposals have been made. The IBM
POWER5TM processor (a dual-core and 2-
context SMT processor) includes a per-task
accounting mechanism called Processor Uti-
lization of Resources Register (PURR) [12].
The PURR approach estimates the time of a
task based on the number of cycles the task
can decode instructions: each POWER5 core
can decode instructions from up to one task
each cycle. The PURR accounts a given cy-
cle to the task that decodes instructions that
cycle. If no task decodes instructions on a
given cycle, both tasks running on the same
core are accounted one half of cycle. An im-
provement of PURR, denoted scaled PURR
(SPURR) [13], is implemented in the IBM
POWER6TM chip, which uses pipeline throt-
tling and DVFS. SPURR provides a scaled
count that compensates the impact of throt-
tling and DVFS. ITCA can work in environ-
ments in which cores work at diﬀerent frequen-
cies with no change in its philosophy. The on-
ly eﬀect seen by ITCA is a diﬀerence in the
memory latency. Throttled cycles are simply
not accounted to any task.
As a part of our future work we plan to ex-
plore CMP architectures in which each core
is SMT. In this type of architectures we need
to combine some of the solutions mentioned
above for SMT architectures with our ITCA
proposal for CMP processors.
6. Conclusions
CMP architectures introduce complexities in
the CPU accounting because the progress done
by a task varies depending on the activity of
the other tasks running at the same time. The
current accounting mechanism, the CA, intro-
duces inaccuracies when applied in CMP pro-
cessors. This accounting inaccuracy may aﬀect
several key elements of the system such as the
OS task scheduling or the charging mechanism
in data centers. In this paper, we present a
hardware support for a new accounting mecha-
nism called Inter-Task Conﬂict-Aware (ITCA)
accounting that improves the accuracy of the
CA. In a 2-, 4- and 8-core CMP architecture,
ITCA reduces the oﬀ estimation down to 2.4%
(2 cores), 3.7% (4 cores) and 2.8% (8 cores),
while the CA presents a 7.0%, 13% and 16%
oﬀ estimation, respectively.
Acknowledgments
This work has been supported by the Ministry
of Science and Technology of Spain under con-
tract TIN-2007-60625 and grant BES-2008-
003683, by the HiPEAC Network of Excel-
lence (IST-004408) and a Collaboration Agree-
ment between IBM and BSC with funds from
IBM Research and IBM Deep Computing or-
ganizations. The authors are grateful to Alper
Buyuktosunoglu who co-authored the orignal
version of this paper and to Enrique Fernán-
dez from the University of Las Palmas of Gran
Canaria for his help setting up the Intel Xeon
Quad-Core processor.
Referencias
[1] Standard Performance Evaluation Cor-
poration, SPEC CPU 2000 benchmark
suite, http://www.spec.org.
[2] L. Hammond, B. A. Nayfeh, and
K. Olukotun, A Single-Chip Multipro-
cessor, in IEEE Computer, vol. 30, no. 9,
1997.
[3] C. Luque, M. Moreto, F. J. Cazor-
la, R. Gioiosa, A. Buyuktosunoglu, and
M. Valero, ITCA: Inter-Task Conﬂict-
Aware CPU Accounting for CMPs, in
PACT, 2009.
[4] F. J. Cazorla, P. M. W. Knijnenburg,
R. Sakellariou, E. Fernandez, A. Ramirez,
and M. Valero, Predictable Performance
in SMT Processors: Synergy between the
OS and SMTs, in IEEE Trans. Comput-
ers, vol. 55, no. 7, 2006.
[5] R. R. Iyer, L. Zhao, F. Guo, R. Illikkal,
S. Makineni, D.Ñewell, Y. Solihin, L. R.
Hsu, and S. K. Reinhardt, QoS poli-
cies and architecture for cache/memory in
CMP platforms, in SIGMETRICS, 2007.
[6] M. K. Qureshi and Y.Ñ. Patt, Utility-
Based Cache Partitioning: A Low-
Overhead, High-Performance, Runtime
Mechanism to Partition Shared Caches,
in MICRO, 2006.
[7] J. Gecsei, D. R. Slutz, and I. L. Traiger,
Evaluation techniques for storage hierar-
chies, IBM Syst. J., vol. 9, no. 2, 1970.
[8] C. Acosta, F. J. Cazorla, A. Ramirez, and
M. Valero, The MPsim Simulation Tool,
in UPC, Tech. Rep. UPC-DAC-RR-CAP-
2009-15, 2009.
[9] T. Sherwood, E. Perelman, and
B. Calder, Basic Block Distribution
Analysis to Find Periodic Behavior and
Simulation Points in Applications, in
IEEE PACT, 2001.
[10] F. J. Cazorla, P. M. W. Knijnenburg,
R. Sakellariou, E. Fernandez, A. Ramirez,
and M. Valero, Architectural Support for
Real-Time Task Scheduling in SMT Sys-
tems, in CASES, 2005.
[11] D. Tullsen, S. J. Eggers, and H. M. Levy,
Simultaneous Multithreading: Maximiz-
ing On-Chip Parallelism, in ISCA, 1995.
[12] P. Mackerras, T. S. Mathews, and R. C.
Swanberg, Operating system exploita-
tion of the POWER5 system, in IBM J.
Res. Dev., vol. 49, no. 4/5, 2005.
[13] M. S. Floyd, S. Ghiasi, T. W. Keller,
K. Rajamani, F. L. Rawson, J. C. Ru-
bio, and M. S. Ware, System power man-
agement support in the IBM POWER6
microprocessor, in IBM J. Res. Dev.,
vol. 51, no. 6, 2007.
