A case for increased operating system support in chip multi-processors by Balasubramonian, Rajeev & Brunvand, Erik L.
A  C a s e  f o r  I n c r e a s e d  O p e r a t i n g  S y s t e m  S u p p o r t  i n  C h i p  
M u l t i - P r o c e s s o r s
David Nellans, Rajeev Balasubramonian, Erik Branvand 
School of Computing, University of Utah 
Salt Lake City, Utah 84112 
{dnellans, rajeev, e lb }@ cs.utah.edu
Abstract
We identify the operating system as one area where a 
novel architecture could significantly improve on current 
chip multi-processor designs, allowing increased perfor­
mance and improved power efficiency. We first show that 
the operating system contributes a non-trivial overhead to 
even the most computationally intense workloads and that 
this OS contribution grows to a significant fraction of total 
instructions when executing interactive applications. We 
then show that architectural improvements have had little 
to no effect on the performance of the operating system 
over the last 15 years. Based on these observations we 
propose the need for increased operating system support 
in chip multiprocessors. Specifically we consider the po­
tential of a separate Operating System Processor (OSP) 
operating concurrently with General Purpose Processors 
(GPP) in a Chip Multi-Processor (CMP) organization.
1 Introduction
The performance of computer systems has scaled well due 
to a synergistic combination of technological advance­
ment and architectural improvement. In the last 15 years, 
high performance workstations have progressed from the 
venerable single core 486/33, released in 1989, to the cur­
rent IBM Power 5 and Intel dual core Pentium 4 EE. Pro­
cess fabrication size has shrunk from 1 micron (486/33) 
down to 90 nanometers (Pentium Prescott) and is ex­
pected to continue down to 65 nanometer and below. Dra­
matically decreased transistor sizes have helped enable 
a 100-fold increase in clock frequency from 33MHz to 
3.8GHz during this same period. Simultaneously, the 
number of pipeline stages has increased from the classic 
5 stage pipeline all the way up to 31 stages [1], Tran­
sistor count and resultant die size has exploded from 1.2 
million transistors on the 486/33 to more than 275 mil­
lion in the Power 5. At the same time, as technology 
has improved, architectural improvements such as deep 
pipelines, caching, and out of order execution have al­
lowed us to take advantage of increased transistor counts 
to improve system performance at a breakneck pace. 
What is striking about these performance improvements, 
however, is that while general application performance 
has improved approximately 200x in this time period, we 
find that the performance of the operating system in the 
same period has seen a significantly smaller improvement 
on the order of 50x. For domains where the operating sys­
tem contributes a significant percentage of cycles, this can 
have a serious impact on overall system performance.
One contributor to this effect is that the operating sys­
tem is not typically considered when architectural en­
hancements are proposed for general purpose processors. 
Modern processors, especially in an academic setting, 
are typically simulated using a cycle accurate architec­
tural model, such as Simplescalar [2], to evaluate perfor­
mance. Typically these cycle accurate simulators do not 
execute the operating system because doing so requires 
accurate I/O models and slows down simulation speed 
substantially. Full system simulators are becoming more 
prevalent [3, 4, 5, 6] but often require modification to the 
operating system or do not provide cycle accurate num­
bers. While several studies suggest that the operating sys­
tem has a significant role in commercial workload per­
formance [7, 8, 9, 10], the effect of the operating system 
on varying workloads is still largely unexplored. Our ap­
proach in this paper is twofold: we simulate a variety of 
applications with a full system simulator to measure the 
portion of system cycles spent in OS tasks, and we take 
actual measurements on real machines to try to understand 
how performance has scaled across different implementa­
tions of the same instruction set architecture.
While Redstone et al. [7], have shown the OS can account 
for over 70% of the cycles in commercial workloads, we 
contribute to this body of work by examining a variety of 
workloads. We examine computationally intensive work­
loads to determine the minimum overhead the operating 
system contributes to workloads. This provides a baseline 
estimate of the potential inaccuracy of architectural mod-
Benchmark % Instructions 
contributed by OS









Table 1: Operating System Contribution In Various Work­
loads
els such as Simplescalar compared to Simics coupled with 
University of Wisconsin’s Multifacet Gems Project [11]. 
We show that minimally, the operating system will con­
tribute 5% overhead to almost all workloads. Workloads 
which utilize system calls or trigger device interrupts have 
increased OS overhead, with the suite of interactive appli­
cations in our study requiring at least 38% OS overhead. 
I/O intensive applications often even have operating sys­
tem overheads which overshadow their userspace compo­
nents.
With operating system overhead being a significant pro­
portion of instructions executed in many workloads, the 
performance of the operating system is a critical factor in 
determining total system throughput. It is currently hard 
to measure operating system performance due to the lack 
of cycle accurate simulator support. Instead we measure 
the effects of architectural improvements by measuring 
performance improvement on real machines for key com­
ponents such as context switch time, and system call la­
tency. We find that compared to user codes (which are typ­
ically the codes targeted by processor optimizations), op­
erating system performance improvement is consistently 
lower by a factor of 4 or 5.
We also briefly examine the issue of operating system in­
terference when powering down processors on modem 
machines. We show that offloading interrupt handling 
to specialized hardware has the potential to allow more 
aggressive power saving techniques for longer durations 
than currently possible.
Having shown that operating systems significantly under- 
utilize modem processors and can contribute a large por­
tion of executed instructions to workloads, we explore the 
possibility of executing the operating system on a semi­
custom processor in a heterogeneous chip multi-processor 
organization. First we identify three key criteria that make 
the operating system a good candidate for hardware off­
load. We then discuss the performance implications of 
such an organization, followed by situations in which sig­
nificant power savings could also be achieved.
Previous work [7,9,10] has shown that the operating sys­
tem can contribute a significant amount of overhead to 
commercial workloads. To our knowledge no work has 
been done to survey the amount of operating system over­
head that is present in a larger variety of workloads. De­
termining the minimal amount of operating system over­
head present in all workloads is useful in estimating the 
amount of possible error in simulations that do not model 
these effects. In workloads that have significant operating 
system overhead, such as databases and web server work­
loads, performance can depend as much on the operating 
system as on user code. For these applications, knowing 
operating system overhead is critical for performance tun­
ing. Finally, very rarely are interactive applications exam­
ined when architecting performance improving features 
because they typically under-utilize modem processors. 
Because of their prevalence, it helps to explore this appli­
cation space to determine if architectural improvements 
can reduce power consumption while maintaining current 
performance levels.
2.1 Methodology
Most architectural simulators do not execute operating 
system code because instrumenting external interrupts 
and device I/O models is a non- trivial task in addition to 
significantly slowing down the overall simulation speed. 
Simics, a full system simulator from Virtutech [4], pro­
vides full system simulation by implementing such de­
vice models allowing it to execute an unmodified Linux 
kernel. To maintain simulation speed Simics implements 
functional models of these devices as well as the proces­
sor. The lack of an accurate timing model limits the per­
formance evaluation possible using this simulator. Func­
tional modeling is advantageous however because it pro­
vides execution fast enough to model not only the operat­
ing system but interactive applications using X-Windows.
Simics’ functional model of a Pentium processor allows 
us to examine the processor state bit to determine if 
the processor model was in privileged mode (supervi­
sor mode) or user mode when executing each instruction. 
This privilege separation allows us to track which instruc­
tions were executed within the operating system. This 
tracking method also allows us to track system calls which 
execute within the user process’ address space but are con­
sidered operating system functionality.
For our simulated system we used the default Redhat 7.3 
disk image provided by Virtutech and left all the default 
system daemons running. Running with background dae­
mons disabled would not portray a typical system in most 
cases which often require remote filesystem services, re-
2  OS Contribution in Workloads
SPECfp Instructions SPECint Instructions
Wupwise 5.96% Gzip 4.65%
Swim 19.42% Vpr 4.43%
Mgrid 0.59% Gcc 5.04%
Applu 4.29% Mcf 4.86%
Mesa 0.71% Crafty 4.54%






Average 5.24% Average 5.15%
Table 2: SPECcpu2000 Instructions Executed in Supervi­
sor Mode
mote procedure calls invocation, as well as remote login. 
We recognize that computing installations often heavily 
optimize the set of system daemons running on their ma­
chines, but we do not attempt to model the vastly differ­
ent subsets of services utilized in enterprise deployments. 
While system daemons typically do not contribute heav­
ily to overall instruction counts, they do cause sporadic 
interrupts which can cause scheduling changes and I/O 
activity.
Typically architectural level simulations are only run for 
several million cycles due to the slowdown of simulation. 
We found significant variation when looking at sample 
sizes of only 10 million instructions due to the OS per­
forming periodic maintenance operations such as clear­
ing write buffers, even when the workload had reached 
a steady state. To remedy this variation we allowed all 
benchmarks to run for 10 billion instructions prior to tak­
ing measurements. All numbers cited are the average of 
10 sequential runs of 100 million instructions which re­
duced statistical variation to negligible levels.
2.2 CPU Intensive Workloads
Computationally intense workloads are used to evaluate 
micro-architectural improvements because they maximize 
computation and minimize the effects of memory and I/O 
performance as much possible. These benchmarks of­
ten represent scientific workloads which require minimal 
user intervention. These benchmarks are ideal for deter­
mining the minimal amount of operating system overhead 
that will be present on a machine because they make very 
small use of operating system functionality. Thus over­
head is likely to be contributed from timer interrupts re­
sulting in context switching, interrupt handling, and sys­
tem daemon overhead.
We chose to use SPECint, SPECfp, and ByteMark [12] as
our computationally intense benchmarks. We also chose 
to time a Linux kernel compile to provide a computation­
ally intense workload that also provides some file system 
I/O. Table 1 shows the percentage of instructions executed 
in supervisor mode by these benchmarks. Table 2 shows 
the individual breakdown and variation between the SPEC 
benchmarks. SPEC benchmarks, particularly those re­
quiring Fortran, for which simulation was unable to com­
plete due to unidentifiable internal Simics errors, are not 
shown. The average operating system overhead when ex­
ecuting these four benchmarks is 9.43%. The ByteMark 
benchmark skews these results strongly however and we 
believe the average of 5.19% for the SPEC benchmarks is 
a more realistic minimal overhead.
2.3 I/O Intensive Workloads
Computationally intense benchmarks are a good way to 
test architectural improvements in the architecture but 
rarely do they capture total system performance because 
many applications involve a significant amount of file or 
network traffic. We chose to use Bonnie++, an open 
source file-system performance benchmark [13] and net- 
perf, a TCP performance benchmark, to measure OS con­
tribution in what we expected to be OS dominated work­
loads. Table 1 shows that our expectations were in­
deed correct and that the OS contribution far outweighs 
user code contribution for file-system and network op­
erations. We also used UnixBench as an I/O Inten­
sive workload. UnixBench consists of several bench­
marks including Whetstone, Drystone, system call perfor­
mance monitoring, and typical UNIX command line us­
age. UnixBench has a surprisingly high operating system 
contribution of 97.16%. These benchmarks confirm that 
I/O processing is almost entirely OS dominated and sup­
port the work of Redstone et al. [7] who have shown that 
for workloads such as Apache the operating system can 
contribute more than 70% of the total cycles.
2.4 Interactive Workload
While cpu-intensive benchmarks provide useful data 
when measuring processor innovations these workloads 
rarely represent the day to day usage of many computers 
throughout the world. One key difference between these 
workloads and typical workstation use is the lack of in­
teraction between the application and the user. X Win­
dows, keyboard, and mouse use generates a workload that 
is very interrupt driven even when the user applications 
require very little computation, a common case within 
the consumer desktop domain. When a high performance 
workstation is being utilized fully, either locally or as part 
of a distributed cluster, the processor must still handle 
these frequent interrupts from user interaction, thus slow-
We modeled an interactive workload by simultaneously 
decoding an MP3, browsing the web and composing email 
in a text editor. While this type of benchmark is not re­
producible across multiple runs due to spurious interrupt 
timing, we believe smoothing the operating system ef­
fects across billions of instructions accurately portrays the 
operating system contribution. This interactive workload 
typically only utilized about 32.91% of the processor’s cy­
cles and spent 38.04% of these instructions within the OS.
3 Decomposing Performance Im­
provements
The performance increase in microprocessors over the 
past 15 years has come from both architectural improve­
ments as well as technological improvements. Faster tran­
sistors have helped drive architectural improvements such 
as deep pipelining which in turn caused an enormous in­
crease in clock frequencies. Significant changes have oc­
curred in architecture as well, moving from the classic five 
stage pipeline to a multiple issue, deeply pipelined archi­
tecture, where significant resources are utilized in branch 
prediction and caching. Because we wish to distill the 
architectural improvement in the last 15 years, and disre­
gard technological improvement as much as possible, we 
chose to take measurements from real machines, a 486 @ 
33MHz and a Pentium 4 @ 3.0GHz, instead of relying on 
architectural simulations which can introduce error due 
to inaccurate modeling of the operating system. For this 
purpose we define total performance improvement (P) as 
technology improvement (T) times architectural improve­
ment (A), or P  =  T  x A .
A metric commonly used to compare raw circuit per­
formance in different technologies is the fan-out of four 
(F04) [14, 15], This is defined to be the speed at which 
a single inverter can drive four copies of itself. This met­
ric scales roughly linearly with process feature size. Thus 
a processor scaled from a 1 micron process, our 486, to 
90nm, our Pentium 4, would have approximately an 11 
fold decrease in F04 delay. We set T, our technological 
improvement, to 11 for the remainder of our calculations.
To determine the architectural improvement, A, when 
moving from the 486 to the Pentium 4 we must accurately 
determine the total performance improvement P and then 
factor out the technological improvement T. To achieve 
accurate values for P we installed identical versions of the 
Linux kernel and system libraries, kernel 2.6.9 and gcc 
3.4.4, on both the machines. Each machine was individ­
ually tuned with compiler optimizations for best perfor­
mance. Using the same kernel and system libraries helps 
minimize any possible performance variation due to oper-
























































Table 3: Speedup from 486 33MHz to Pentium 4 3.0GHz
ating system changes even from minor revisions.
Attempting to discern the architectural improvements of 
only the microprocessor required that we use benchmarks 
that minimize the use of external hardware in a perfor­
mance critical manner. We also chose benchmarks with 
working sets that fit in the limited main memory of our 
486 test machine. This ensured that the machines would 
not regularly access the swap partition which could pro­
vide misleading performance numbers. We chose the 
Linux kernel compile because there is enough parallelism 
present by creating multiple jobs that compilation can 
fully utilize the processor while other jobs are waiting on 
data from the I/O system. Care has to be taken to create 
only the minimum number of jobs necessary to fully uti­
lize the processor or we can introduce significantly more 
context switching than necessary. N + 1 jobs is typically 
the ideal number of jobs, where N is the number of pro­
cessors available when compiling source code.
Table 3 shows the performance difference when execut­
ing the same workload on both machines using the UNIX 
time command. The total performance improvement P is 
taken by examining the Real time. We found the average 
speedup, including both application and operating system 
effects, when moving from the 486 to the Pentium 4 was
192.2. Using 192.2 and 11, for P and T respectively, we 
can then calculate that our architectural improvement, A, 
is 17.47 or roughly 60% more improvement than we have 
obtained from technological improvements. This archi­
tectural improvement comes from increased ILP utiliza­
tion due to branch prediction, deeper pipelines, out of or­
der issue, and other features not present in a classic five 
stage pipeline.
3.1 Operating System Performance
In section 2 we showed that the operating system con­




t f lm m  h






© -rv3 riy -!>o^ofy ■>>n§2^<?“" <?“" <?^  <J“" <J“"
Figure 1: Context Switching Speed Normalized for Clock Frequency
some workloads, such as those which are interactive or 
I/O intensive, can have significant portions of their to­
tal instruction counts occur within the operating system. 
For these applications operating system performance is 
critical to achieve good workload performance. Table 3 
shows that while total application performance has in­
creased 192.2 times in the last 15 years, operating system 
performance has only increased 46.04 times. Stated an­
other way, application codes that can take full advantage 
of modern architectural improvements achieve four times 
more performance than operating system codes that can 
not.
Table 3 shows only a limited set of benchmarks because 
our 486 was restricted to this set due to its limited mem- 
017. Additionally the UNIX time command does not pro­
vide the timing fidelity necessary to have good confidence 
in the low number of seconds reported for the system time 
on the Pentium 4. To validate the OS speedup results in 
Table 3 we perform two other experiments in subsequent 
sections which each independently support the conclusion 
that architectural improvements have helped application 
codes significantly, but are from 4 to 10 times less effec­
tive on operating system codes.
3.2 Context Switch Performance
As single threaded applications are redesigned with mul­
tiple threads to take advantage of SMT and CMP based 
processors, the context switch, an already costly opera­
tion, will become even more critical. Operating system
designers have focused on reducing the cost of a con­
text switch for years, while much less attention has been 
paid in the architecture community [16, 17], To deter­
mine the improvement in context switch time over the last 
15 years we measure context switch cost using a Posix 
locking scheme similar to the one described in [18], In 
this method two threads are contending for a single Posix 
lock, similar in style to traditional methods of passing a 
token between threads. The Posix lock is used instead of 
a traditional UNIX pipe because Posix locks are one of 
the lowest overhead synchronization primitives currently 
available in Linux.
Figure 1 provides the average context switch time, nor­
malized to 3.0 GHz, over five runs of our benchmark on 
various machine configurations. All timings showed a 
standard deviation of less than 3% within the five runs. 
The absolute number for context switch time is much less 
important than the relative time between cases. Disregard­
ing the overhead of the token passing method lets us focus 
on the relative change in context switch cost between dif­
fering machine architectures. Context switching routines 
in the Linux kernel are hand tuned sections of assembly 
code that do not make use of the hardware context switch 
provided by the x86 instruction set. It has been shown that 
this software context switch has performance comparable 
to the hardware switch but provides more flexibility.
Our first experiment sought to determine how context 
switch time scaled with clock frequency for a given archi­
tecture. Our Pentium 4 running at 3.0GHz supports fre-
System Call 486 P4 Speedup Arch
Speedup
brk 87 2 43 3.90
close 439 29 15 1.36
execve 14,406 1,954 7 .63
fcntl64 62 1 62 5.63
fork 15,985 8,187 2 .18
fstat64 183 1 183 16.63
getdents64 501 10 50 4.54
getpid 49 1 49 4.45
getrlimit 59 1 59 5.36
ioctl 728 25 29 2.63
mprotect 324 4 81 7.36
munmap 365 11 33 3
newuname 88 2 44 4
open 860 899 0 0
pipe 559 7 79 7.18
poll 9,727,367 9,280 1,048 95.27
read 44,304 185 239 21.72
rt_sigaction 206 1 206 18.72
rt_sigprocmask 178 1 178 16.18
select 403,042 11,849 34 3.09
sigreturn 75 1 76 6.90
stat64 264 53 5 .45
time 72 5,585 0 0
waitpid 99,917 3 33305 3027.72
write 2,164 3,418 0 0
Table 4: System Call Speedup from 486 33MHz to Pen­
tium 4 3.0GHz - Time in microseconds
quency scaling allowing us to scale down the frequency of 
the processor in hardware and run our context switching 
benchmark at multiple frequencies on the same proces­
sor. The absolute context switch time for these frequency 
scaled runs was then normalized back to S.OGhz. This 
normalization allows us to clearly see that context switch 
time scales proportionally with clock frequency. Scaling 
proportionally with clock frequency indicates that context 
switching is a fairly memory independent operation. Thus 
we can draw the conclusion that the number of cycles re­
quired to context switch in a given microarchitecture is 
independent of clock frequency.
To determine if architectural improvements have reduced 
context switch time we run this benchmark on a 486 @ 
33MHz with the identical kernel and tool-chain versions. 
Scaled for frequency. Figure 1 shows that context switch 
performance has actually decreased in the last 15 years by 
requiring an increased percentage of cycles. This is likely 
due to the increased cost of flushing a modern Pentium’s
31 pipeline stages versus the classic five stages in a 486. 
This decrease in context switch performance undoubtedly 
contributes to the lackluster performance of the operating 
system that we saw in Table 3.
3.3 System Call Performance
To further validate our results that the OS performs any­
where from 4-10 times worse than user codes on modern
architectures we used the Linux Trace Toolkit (LTT) [19] 
to log system call timings over a duration of 10 seconds on 
both the 486 and the Pentium 4. LTT is commonly used 
to help profile both applications and the operating system 
to determine critical sections that need optimization. LTT 
can provide the cumulative time spent in each system call 
as well as the number of invocations during a tracked pe­
riod of time. This allows us to average the execution of 
system calls over thousands of calls to minimize variation 
due to caching, memory, and disk performance. By aver­
aging these 10 second runs across the multiple workloads 
found in Table 3 we also eliminate variation due to partic­
ular workload patterns. The Linux Trace Toolkit provides 
microsecond timing fidelity, thus system calls which take 
less than 1 microsecond can be reported as taking either 0 
microseconds or 1 microsecond depending where the call 
falls in a clock period. Averaging thousands of calls to 
such routines should result in a random distribution across 
a clock period but we currently have no way to measure if 
this is true, and thus can not guarantee if this is, or is not, 
occurring. All system call timings have been rounded to 
the nearest microsecond.
Table 4 shows the absolute time for common system calls, 
absolute speedup, and the speedup due to architectural im­
provement only, using our technology scaling factor T set 
at 11. When examining these figures we must be careful 
to examine the function of the system call before inter­
preting the results. System calls which are waiting on I/O 
such as poll, or are waiting indefinitely for a signal such 
as waitpid, should be disregarded because they depend on 
factors outside of OS performance. System calls such as 
execve, munmap, and newuname provide more accurate 
reflections of operating system performance independent 
of device speed. Because system calls can vary in exe­
cution so greatly we do not attempt to discern an average 
number for system call speedup at this time, instead pro­
viding a large number of calls to examine. It is clear how­
ever that most system calls in the operating system are not 
gaining the same benefit from architectural improvement, 
17.47, that user codes have received in the past 15 years.
3.4 Interrupt Handling
Interrupt handling is a regular repetitive task that all op­
erating systems perform on behalf of the user threads for 
which incoming or outgoing data is destined. Using the 
Linux Trace toolkit we found that the number of inter­
rupts handled by the operating system during a 10 second 
period on the Pentium 4 was within 3% of the number 
handled on the 486 during the same 10 second period. 
These timings were performed with both interactive and 
computationally intense workloads. The average of these 
two workloads is cited in Figure 2 which shows these ex-
ternal interrupts are regular in their arrival. In many cases 
these interrupts are not maskable meaning they must be 
dealt with immediately and often cause a context switch 
on their arrival.
Under the Linux OS the external timer interval is by de­
fault 10ms, this provides the maximum possible idle time 
for a processor running the operating system. Including 
external device interrupts the interval between interrupts 
dropped to 6.2ms on average for both the 486 and the 
Pentium 4. Each interrupt causes on average 1.2 context 
switches and requires 3 microseconds to be handled. The 
irq handling cost is negligible compared to the context 
switching cost and can be disregarded. Thus the average 
cost of handling 193 context switches at approximately 
18000 cycles per context switch, 6 millisecond context 
switch time measured on a 3GHz P4, requires 3.5 million 
cycles per second, or only about 0.1% of the machines 
cycles.
While the total number of cycles spent handling inter­
rupts is very low the performance implications are actu­
ally quite high. The regular nature of interrupt handling 
shown in Figure 2 generates regular- context switching 
which in turn causes destructive interference in microar- 
chitectural structures such as caches and branch predic­
tion history tables. The required warm up of such struc­
tures on every context switch has been shown to have sig­
nificant impact on both operating system and application 
performance [20].
4 Proposed Architectural Support 
for Operating Systems
Hardware specialization via offloading of applications 
from the general purpose processor has occurred many 
times for applications such as graphics processing, net­
work processing, and I/O control. These specialized pro­
cessors have been successful whereas many others, such 
as TCP off-load, have not. We identify three crucial re­
quirements that an offloaded application must meet to jus­
tify the use of specialized hardware. Failure to meet all 
three of the criteria likely indicates that the potential per­
formance and power benefits gained from hardware spe­
cialization will not exceed the communication overhead 
that is also introduced. Based on these criteria we believe 
the operating system is a prime candidate for hardware 
support in a chip multiprocessor configuration.
4.1 Criteria for Hardware Specialization
Constant Execution Applications that contribute a sig­
nificant portion of cycles to the total workload tend to 










Time in 200ms intervals
Figure 2: Distribution of Interrupt Handling Over 10 Sec­
onds
us that we should spend our resources speeding up ap­
plications that contribute the largest amount to total sys­
tem performance. Graphics processors, network handling, 
and I/O controllers are all examples of applications that 
are constantly running and consume a non-trivial fraction 
of a time-shared processor’s cycles. In Section 2, we es­
tablished that the operating system minimally contributes 
five percent overhead to any workload, and that many ap­
plications require at least 38% OS overhead to achieve 
graphical interactivity. I/O intensive applications can have 
more than half of their total instructions occur within the 
operating system.
Inefficient Execution on General Purpose Processors
Offloaded applications must stand to gain a decisive per­
formance advantage by being offloaded to a more special­
ized processor. If no performance gain can be achieved 
when offloading an application, the additional hardware 
will typically consume more power to achieve the same 
total system performance. Conversely, it may also be 
worthwhile to off-load applications that can be executed 
with acceptable performance while utilizing significantly 
less power. In Section 3, we established that operating 
system performance improvement is lagging behind user 
code performance improvements on modern machines by 
four-fold. In the last 15 year's, technology improvements 
have increased the absolute performance of the operating 
system three times as much as architectural innovations. 
Having established the inefficiency of operating system 
execution on modern processors, we believe it will be pos­
sible to provide hardware support that will simultaneously 
save power and increase performance through architec­
tural improvements.
Benefit To Remaining Codes Offloading a specific ap­
plication to specialized hardware must provide a signifi­
cant benefit to the codes that will remain on the general 
purpose processor. We have shown that the operating sys­
tem executes frequently and regularly, with at most 6.2ms 
between invocations when handling only interrupts. Ev­
ery invocation of the operating system introduces cache 
and branch history entries into performance critical struc­
tures. Upon return to the user codes, these structures are 
often no longer at their steady state and require a warm 
up period before they can be fully utilized again. The in­
terference caused by the OS in these structures has been 
shown to be worse than the interference caused by other 
user-level threads [20], As a result, we believe that elim­
inating intermittent operating system execution from the 
main processing unit will result in improved performance 
for the user portions of application execution even with 
user applications now causing more direct interference be­
tween each other.
4.2 Proposed Architecture
We believe the operating system meets the key criteria 
that help identify exceptionally good candidates for ad­
ditional hardware support. With the advent of chip multi­
processors, adding more cores on a single die is becom­
ing commonplace. Future cores will likely have signifi­
cantly more than two cores, the standard today, on a sin­
gle die. We propose a heterogeneous CMP such that one 
core will be customized to handle operating system execu­
tion while the remaining cores are left to execute general 
purpose codes. Other work [21, 22, 23], has explored the 
possibility of heterogeneous chip multi-processors but to 
our knowledge our work is the first to propose a heteroge­
neous CMP that targets a ubiquitous application (the OS) 
with known poor performance.
Our results in Sections 2 and 3 have shown that, scaled for 
technology, a classic 5 stage pipeline architecture such as 
that found on a 486 is surprisingly close in performance 
to a modern Pentium 4 when executing operating system 
codes. More startling is that the 486 architecture uses only
1.2 million transistors while the Pentium 4 uses well over 
200 million. We believe an architecture not significantly 
more complex than the 486 could be customized to exe­
cute operating system codes as fast or faster than modern 
general purpose processors at a cost of less than a few mil­
lion transistors, an insignificant fraction of today’s transis­
tor budgets. At similar performance we also believe this 
semi-custom will do significantly better on metrics which 
take energy expenditure into account, such as energy de­
lay product. The purpose of this work was to evaluate the 
need and potential benefits of an architecture of this na­
ture; detailed evaluations of an architecture based on our 
current observations will be part of future work.
4.3 Potential Performance Benefits
We believe a simple architecture will be able to execute 
the operating system more efficiently and there are several 
key architectural features that will improve overall sys­
tem throughput greatly. The operating system typically 
executes code segments that are significantly smaller than 
user codes. In Section 3, we showed that system calls 
and interrupt handling typically require tens of microsec­
onds to execute upon invocation. These short bursts of in­
structions do not allow for caches and branch predictors to 
warm up properly, reducing their effectiveness. Similarly, 
deep pipelines excel at throughput oriented tasks at the ex­
pense of latency. They are also costly when branch mis­
predictions occur, events that currently happen frequently 
in operating system codes [20],
Slight modifications to current chip multiprocessor archi­
tectures can potentially yield an even more significant per­
formance benefit. Currently, the operating system must 
examine its run queues to determine which application 
will be scheduled next. On uni-processors this is a sequen­
tial operation with the OS performing this calculation be­
tween every two user processes. Offloading the operating 
system to dedicated hardware would not only allow this 
function to be precomputed, but could allow the operating 
system to prefetch context information into a high speed 
context buffer causing reduced context switch times. Sim­
ilarly, with a priori knowledge of upcoming contexts, the 
operating system is in a special position to warm up mi- 
croarchitectural structures by prefetching instructions and 
data into the L2 that are likely to be used by the incoming 
application. We believe that the knowledge the operating 
system has of contexts can be heavily utilized to reduce 
these traditional context switching penalties in a variety 
of ways.
4.4 Potential Power Savings
Dynamic power dissipation has traditionally dominated 
the energy expended by a processor and techniques such 
as clock gating and frequency scaling are being utilized 
to help minimize the impact of this transistor switching. 
As process sizes shrink, static power begins dominating 
the total power consumption of microprocessors [24, 25], 
Static power dissipation can be mitigated through power 
gating and voltage scaling, but these techniques require 
a significant number of cycles to switch between modes. 
These methods often can not be utilized because the pro­
cessor is not idle for sufficient intervals of time.
The Pentium 4 processor at 3.8GHz has thermal pack­
aging to support a steady state power dissipation of 115 
watts . As die temperature rises, the P4 can modulate 
its clock on and off to reduce dynamic power consump­
tion by up to 82.5% [1], Unfortunately, with static power 
estimated to be nearing half of total power consumption, 
using only the Pentium 4’s on-demand clock modulation, 
the processor still consumes 68 watts. Eight of these watts 
are due to dynamic power consumption, while 60 watts 
are static power loss that cannot be addressed by clock 
modulation.
In Section 3, we identified interrupt handling as a peri­
odic task with an average interval of 6.8 milliseconds be­
tween invocations. On a uni-processor, this period is the 
longest possible time a processor can perform power gat­
ing and reduce static power dissipation. On idle machines, 
this interval becomes the limiting factor for potential static 
power savings because the machine must wake up on aver­
age every 6.8ms before returning to deep sleep mode. By 
offloading this interrupt handling to an operating system 
processor, which we expect to consume a trivial amount of 
power relative to modern processors, these high wattage 
cores can be powered down for much longer periods of 
time resulting in significant total power savings. Note that 
the power benefits above are in addition to the power ben­
efits of executing OS code on a simple shallow pipeline.
5 Conclusions
We have shown that the operating system minimally poses 
a 3-5% overhead for all applications and much more for 
some. Applications which require user interaction utiliz­
ing X Windows can incur a much larger operating system 
overhead of 38%. Applications which have significant I/O 
requirement can have even larger operating system com­
ponents. For applications in which the operating system 
contributes a significant portion of its total instructions 
the performance of operating system codes is critical for 
achieving high application performance. The modeling of 
operating systems in architectural simulation is now crit­
ical if we wish accurately predict performance gains in 
future generations of microprocessors.
In the last 15 years total system performance has im­
proved by almost 200 times, 11 times from technology 
improvements, based on F04 delay, and 19.7 times from 
architectural improvements. The operating system has re­
ceived the same 11 fold improvement from technology but 
has seen only a 4 fold improvement from architectural im­
provements and executes nearly 5 times slower than appli­
cation codes on modern hardware.
Based on these studies we propose the concept of a hetero­
geneous chip multiprocessor consisting of a semi-custom 
operating system processor (OSP) intelligently coupled 
with one or more traditional general purpose proces­
sors (GPP). This OSP is likely to have a significantly 
more shallow pipeline than current processors which will
maximize performance while also increasing energy effi­
ciency. Providing external interfaces to performance criti­
cal structures in the GPP such as branch history tables and 
caches would allow the OSP to actively pre-warm these 
structures for incoming contexts of which the OSP has a 
priori knowledge. Recurring tasks, such as interrupt han­
dling, which significantly under-utilize the GPP can now 
be offloaded to a power efficient OSP and allow more ag­
gressive power down of high wattage cores. We believe 
these benefits will come at an increased transistor budget 
of at most a few percent.
References
[1] I. Corporation, “Intel pentium 4 processors 570/571, 
560/561, 550/551, 540/541, 530/531 and 520/521 
supporting hyper-threading technology.” pp. 76-81.
[2] D. Burger and T. M. Austin, “The simplescalar tool 
set, version 2.0,” SIG ARCH  Comput. Archit. News, 
vol. 25, no. 3, pp. 13-25, 1997.
[3] M. Rosenblum, E. Bugnion, S. Devine, and S. A. 
Herrod, “Using the simos machine simulator to 
study complex computer systems,” M odeling and  
Computer Sim ulation , vol. 7, no. 1, pp. 78-103, 
1997, citeseer.nj.nec.com/rosenblum97using.html.
[4] Virtutech, “Simics full system simulator,” January
2004, http://www.virtutech.com.
[5] L. Schaelicke, “L-rsim: A simulation environ­
ment for i/o intensive workloads,” in Proceedings 
o f  the 3rd A nnual IEE E  Workshop on Work­
load Characterization 2000, Los Alamitos, CA 
(USA), 2000, pp. 83-89. [Online], Available: 
citeseer.ist.psu.edu/schaelicke01architectural.html
[6] M. Vachharajani, N. Vachharajani, D. A. Penry, 
J. Blome, and D. I. August, “The liberty simulation 
environment, version 1.0,” Performance Evaluation  
Review: Special Issue on Tools fo r  Architecture Re­
search, vol. 31, no. 4, March 2004.
[7] J. Redstone, S. J. Eggers, and H. M. Levy, “An anal­
ysis of operating system behavior on a simultaneous 
multithreaded architecture,” in Architectural Sup­
port fo r  Programming Languages and Operating 
Systems, 2000, pp. 245-256. [Online], Available: 
citeseer.nj.nec.com/redstone00analysis.html
[8] J. B. Chen and B. N. Bershad, “The impact 
of operating system structure on memory system 
performance,” in Symposium on Operating Systems 
Principles, 1993, pp. 120-133. [Online], Available: 
citeseer.nj.nec.com/chen93impact.html
[9] A. Alameldeen, C. Mauer, M. Xu, P. Harper, 
M. Martin, D. Sorin, M. Hill, and D. Wood, “Evalu­
ating Non-Deterministic Multi-Threaded Commer­
cial Workloads,” in Proceedings o f  Computer A r­
chitecture Evaluation using Commercial Workloads 
(CAECW), February 2002.
[10] L. Barroso, K. Gharachorloo, A. Nowatzyk, and
B. Verghese, “Impact of Chip-Level Integration on 
Performance of OLTP Workloads,” in Proceedings 
ofH PCA-6, January 2000.
[11] M. Martin, D. Sorin, B. Beckmann, M. Marty, 
M. Xu, A. Alameldeen, K. Moore, M. Hill, and 
D. Wood, “Multifacet’s General Execution-Driven 
Multiprocessor Simulator (GEMS) Toolset,” Com­
puter Architecture News, 2005.
[12] B. Magazine, “BYTE Magazine’s 
BYTEmark Benchmark Program 
http://www.byte.com/bmark/bdoc.htm,” 1998.
[13] R. Coker, “Bonnie++ filesystem benchmark 
http://freshmeat.net/projects/bonnie/, 
http://www.coker.com.au/bonnie++/,” 2003.
[14] M. S. Hrishikesh, D. Burger, N. P. Jouppi, S. W. 
Keckler, K. I. Farkas, and P. Shivakumar, “The op­
timal logic depth per pipeline stage is 6 to 8 fo4 in­
verter delays,” in ISC ',4 ’02: Proceedings o f  the 29th 
annual international symposium on Computer archi­
tecture. Washington, DC, USA: IEEE Computer 
Society, 2002, pp. 14-24.
[15] S. Palacharla, N. P. Jouppi, and J. E. Smith, 
“Complexity-effective superscalar processors,” in 
ISCA, 1997, pp. 206-218. [Online], Available: 
citeseer.ist.psu.edu/palacharla97complexity 
effective.html
[16] T. Ungerer, B. Rob, and J. Silc, “A survey of pro­
cessors with explicit multithreading,” ACM  Comput. 
Surv., vol. 35, no. 1, pp. 29-63, 2003.
[17] M. Evers, P.-Y. Chang, and Y. N. Patt, “Using hybrid 
branch predictors to improve branch prediction ac­
curacy in the presence of context switches,” in Pro­
ceedings o f  the 23rd Annual International Sympo­
sium on Computer Architecture, May 1996, pp. 3­
11.
[19] K. Yaghmour and M. Dagenais, “Measuring and 
characterizing system behavior using kernel-level 
event logging.” in USENIX Annual Technical Con­
ference, General Track,, 2000, pp. 13-26.
[20] T. Li, L. John, A. Sivasubramaniam, N. Vi- 
jaykrishnan, and J. Rubio, “Understanding and 
improving operating system effects in control flow 
prediction,” Department o f  E lectrical and Com­
puter Engineering, University o f  Texas at Austin 
Technical Report, June 2002. [Online], Available: 
citeseer.ist.psu.edu/li02understanding.html
[21] R. Kumar, K. I.Farkas, N. P. Jouppi, P. Ran- 
ganathan, and D. M. Tullsen, “Single-isa hetero­
geneous multi-core architectures: The potential for 
processor power reduction,” in MICRO, Proceedings 
o f  the 36th International Symposium on Microarchi­
tecture, December 2003.
[22] S. Balakrishnan, R. Rajwar, M. Upton, and K. K. 
Lai, “The impact of performance asymmetry in 
emerging multicore architectures.” in ISCA, 2005, 
pp. 506-517.
[23] H. P. Hofstee, “Power efficient processor architec­
ture and the cell processor.” in HPCA, 2005, pp. 
258-262.
[24] S.-H. Yang, M. D. Powell, B. Falsafi, K. Roy, 
and T. N. Vijaykumar, “An integrated cir­
cuit/architecture approach to reducing leakage 
in deep-submicron high-performance i-caches,” in 
HPCA, 2001, pp. 147-158. [Online], Available: 
citeseer.ist.psu.edu/yang01integrated.html
[25] J. A. Butts and G. S. Sohi, “A static power 
model for architects,” in International Symposium  
on Microarchitecture, 2000, pp. 191-201. [Online], 
Available: citeseer.ist.psu.edu/butts00static.html
[18] D. E. G. Bradford, “Ibm developerworks, runtime: 
Context switching, part 1,
http://www-106.ibm.com/developerworks/linux/library/l- 
rt9/?t=gr,lnxw03=rtcshl,” 2004.
