Sensitivity Analysis of Core Specialization Techniques by Kallurkar, Prathmesh & Sarangi, Smruti R.
Sensitivity Analysis of Core Specialization Techniques
Prathmesh Kallurkar ∗
Microarchitecture Research Lab
Intel Corporation
e-mail: prathmesh.kallurkar@intel.com
Smruti R. Sarangi
Department of Computer Science
Indian Institute of Technology Delhi
e-mail: srsarangi@cse.iitd.ac.in
Abstract
The instruction footprint of OS-intensive workloads such
as web servers, database servers, and file servers typically
exceeds the size of the instruction cache (32 KB). Con-
sequently, such workloads incur a lot of i-cache misses,
which reduces their performance drastically. Several pa-
pers [6, 8, 5, 2, 3] have proposed to improve the perfor-
mance of such workloads using core specialization. In
this scheme, tasks with different instruction footprints are
executed on different cores. In this report, we study the
performance of five state of the art core specialization
techniques: SelectiveOffload [6], FlexSC [8], DisAggre-
gateOS [5], SLICC [2], and SchedTask [3] for different
system parameters. Our studies show that for a suite of
8 popular OS-intensive workloads, SchedTask performs
best for all evaluated configurations.
1 Multi-programmed Workloads
We compare the impact of all core specialization tech-
niques on a server that is executing multiple OS-intensive
applications. Table 1 shows the constituent bench-
marks and their workloads for each multi-programmed
workload, and Figure 1 shows the impact of different
core specialization techniques on the weighted instruc-
tion throughput of each multi-programmed workload. We
∗The author contributed to this work while at Indian Institute of Tech-
nology Delhi
Bag ID Constituent benchmarks Workload of
individual
benchmark
MPW-A DSS, FileSrv 1X
MPW-B Apache, OLTP 1X
MPW-C Apache, DSS, FileSrv, Iscp 0.5X
MPW-D Apache, DSS, Find, OLTP 0.5X
MPW-E Find, FileSrv, Iscp, Oscp 0.5X
MPW-F Apache, FileSrv, MailSrvIO, OLTP 0.5X
Table 1: Constituent benchmarks of multi-programmed
workloads
MP
W-
A
MP
W-
B
MP
W-
C
MP
W-
D
MP
W-
E
MP
W-
F
gm
ea
n
10
0
10
20
30
40
Ch
an
ge
 in
 in
st
. t
hr
ou
gh
pu
t (
%
)
-13 -19
SelectiveOffload
FlexSC
DisAggregateOS
SLICC
SchedTask
Figure 1: Impact of different techniques on the instruc-
tion throughput of a system executing multi-programmed
workloads
start by allocating equal number of cores for each bench-
mark and then let the scheduling techniques decide the
appropriate number of cores to execute the constituent
tasks of each multi-programmed workload. The mean
improvement in the weighted instruction throughput for
these techniques is: SelectiveOffload (21.48%), FlexSC (-
2.26%), DisAggregateOS (9.47%), SLICC (5.64%), and
SchedTask (23.94%). The primary point to note from Fig-
ure 1 is that the performance of SLICC is low for multi-
programmed workloads. This is an artifact of SLICC’s
thread decomposition policy, which does not group com-
mon portions of OS execution across different applica-
tions. FlexSC, DisAggregateOS, and SchedTask group
system calls based on their IDs. Hence, for these tech-
niques, there is a high correlation between their per-
formance of a multi-programmed workload and its con-
stituent benchmarks.
1
ar
X
iv
:1
70
8.
03
90
0v
1 
 [c
s.A
R]
  1
3 A
ug
 20
17
iSize Technique Find Iscp Oscp Apache DSS FileSrv MailSrvIO OLTP geom. meaniHit Perf iHit Perf iHit Perf iHit Perf iHit Perf iHit Perf iHit Perf iHit Perf iHit Perf
16 KB
SelectiveOffload 1 10 1 21 1 12 1 31 3 5 0 6 1 0 4 11 1 12
FlexSC 7 -48 6 -40 6 -50 -1 12 1 6 1 25 2 12 2 10 3 -14
DisAggregateOS 2 0 1 14 1 10 2 20 3 6 1 16 3 0 4 9 2 9
SLICC 1 4 1 24 1 12 1 1 2 5 1 15 2 0 2 11 1 8
SchedTask 2 11 1 40 1 23 2 44 2 10 1 34 2 28 3 17 1 25
32 KB
SelectiveOffload 2 7 2 21 1 8 2 27 3 5 1 4 3 0 3 9 2 10
FlexSC 10 -51 7 -44 6 -56 -1 7 2 6 2 29 2 12 1 4 3 -18
DisAggregateOS 3 -2 2 16 1 4 4 20 3 6 2 20 5 4 3 6 3 9
SLICC 4 3 2 28 1 7 3 9 1 5 1 20 3 2 2 13 2 11
SchedTask 4 7 3 39 1 15 4 38 3 10 2 44 4 28 3 12 3 23
64 KB
SelectiveOffload 3 6 3 22 2 6 4 26 0 5 1 5 3 -1 2 8 2 9
FlexSC 8 -52 6 -45 4 -57 1 8 0 6 1 23 2 12 1 4 3 -19
DisAggregateOS 5 -1 4 16 2 3 8 22 0 6 1 27 4 5 3 5 3 10
SLICC 5 4 3 33 2 7 8 21 0 6 1 26 2 5 3 19 3 15
SchedTask 5 6 4 39 2 13 8 37 0 11 1 36 3 28 2 13 3 22
iSize is the size of the i-cache. iHit and Perf are the change (%) in i-cache hit rate and the instruction throughput respectively relative to the baseline with the same i-cache size
Table 2: Impact of the size of the instruction cache on the instruction cache hit rate and instruction throughput
2 Instruction Cache Size
Table 2 shows the impact of the i-cache size on the i-cache
hit rate and the instruction throughput derived by all core
specialization techniques. We evaluate all techniques for
the following three i-cache configurations: 4-way 16 KB,
4-way 32 KB, and 4-way 64 KB. A baseline system with
a smaller i-cache incurs more cache misses and therefore,
the core specialization techniques can improve instruc-
tion throughput better. Our proposed technique improves
throughput by 25%, 23%, and 22% over the baseline for a
16 KB, 32 KB, and a 64 KB i-cache system, respectively.
This results in a performance improvement of 13%, 12%,
and 7% respectively over the best state of the art tech-
niques.
3 Cache Configuration
Table 3 describes three cache configurations (Config1,
Config2, and Config3) and their impact on the instruction
throughput of all techniques. Config1 and Config2 have
two levels of cache hierarchy whereas Config3 has three
levels of cache hierarchy. Since the performance bene-
fit derived by a core specialization technique is directly
proportional to the i-cache miss penalty, the performance
of all techniques is the least for Config2 and the most for
Config1. Our proposed technique improves throughput by
24%, 21%, and 23% over the baseline for a system with
Config1, Config2, and Config3 cache configurations re-
spectively. This results in a 7, 6, and 12 percentage point
enhancement in performance (respectively) over the best
existing techniques.
4 Number of Cores
Table 4 shows the impact of the number of cores on the in-
struction throughput of different core specialization tech-
niques. We evaluate all the techniques for the following
four systems: system with 8 cores, system with 16 cores,
system with 24 cores, and a system with 32 cores. We do
not consider a system with less than 8 cores because such
a system is not practical for the OS-intensive server-class
workloads that we consider. Our proposed technique im-
proves throughput by 18%, 27%, 27%, and 23% over the
baseline for a system with 8 cores, 16 cores, 24 cores, and
32 cores respectively. This results in 3, 9, 12, and 12 per-
centage points enhancements, respectively, over the best
existing techniques.
5 Instruction Prefetcher
Figure 2 shows the impact of core specialization tech-
niques on the instruction throughput when the baseline
system employs a hardware instruction prefetcher. We
use the hardware-only mode (CGHC-2K+32K) of the Call
Graph Prefetcher (CGP) [1] as the instruction prefetcher.
We use CGP because its hardware overheads are not
2
Cache configuration Technique Find Iscp Oscp Apache DSS FileSrv MailSrvIO OLTP geom. meanChange in the instruction throughput (%) relative to the baseline system with the same cache configuration
Config1
SelectiveOffload -1 18 14 18 9 13 17 12 12
FlexSC -57 -51 -60 17 1 11 20 8 -21
DisAggregateOS -7 9 10 0 2 16 25 9 7
SLICC 3 27 16 20 6 18 15 18 15
SchedTask 11 36 21 38 2 30 33 14 23
Config2
SelectiveOffload -2 16 14 16 10 10 13 10 11
FlexSC -59 -53 -61 15 1 12 19 5 -23
DisAggregateOS -9 1 0 18 2 10 21 8 6
SLICC 1 25 16 18 5 18 11 15 13
SchedTask 7 33 20 31 2 27 28 10 19
Config 3
SelectiveOffload 7 21 8 27 5 4 0 9 10
FlexSC -51 -44 -56 7 6 29 12 4 -18
DisAggregateOS -2 16 4 20 6 20 4 6 9
SLICC 3 28 7 9 5 20 2 13 11
SchedTask 7 39 15 38 10 44 28 12 23
Config1 → Private caches i-cache and d-cache: (4-way 32 KB. latency = 3 cycles)Shared cache L2 cache: (8-way 8 MB. latency = 18 cycles)
Config2 → Private caches i-cache and d-cache: (4-way 32 KB. latency = 3 cycles)Shared cache L2 cache: (8-way 8 MB. latency = 8 cycles)
Config3 → Private caches i-cache and d-cache: (4-way 32 KB. latency = 3 cycles), L2 cache: (4-way 256 KB. latency = 8 cycles)Shared cache L3 cache: (8-way 8 MB. latency = 18 cycles)
Table 3: Impact of the cache configuration on the instruction throughput
#cores Technique Find Iscp Oscp Apache DSS FileSrv MailSrvIO OLTP geom. meanChange in the instruction throughput (%) relative to the baseline system with the same number of cores
8 cores
SelectiveOffload 14 22 17 48 5 2 -1 17 15
FlexSC -24 -26 -41 13 6 5 12 3 -8
DisAggregateOS -17 -14 -16 0 -10 -19 -28 -1 -14
SLICC 6 -5 -13 -4 -3 -10 -11 -5 -6
SchedTask 20 24 10 36 9 16 22 12 18
16 cores
SelectiveOffload 19 27 26 47 4 6 0 23 18
FlexSC -24 -24 -40 13 5 4 15 13 -7
DisAggregateOS -1 -10 -14 3 -5 -15 -23 4 -8
SLICC 17 6 -2 3 3 -3 -4 8 3
SchedTask 32 37 22 51 8 17 31 26 27
24 cores
SelectiveOffload 15 29 16 40 5 6 0 15 15
FlexSC -45 -35 -53 11 8 22 13 10 -13
DisAggregateOS -4 1 -1 6 0 2 -12 4 0
SLICC 7 25 6 6 8 9 0 13 9
SchedTask 15 47 23 51 13 27 28 18 27
32 cores
SelectiveOffload 7 21 8 27 5 4 0 9 10
FlexSC -51 -44 -56 7 6 29 12 4 -18
DisAggregateOS -2 16 4 20 6 20 4 6 9
SLICC 3 28 7 9 5 20 2 13 11
SchedTask 7 39 15 38 10 44 28 12 23
Table 4: Impact of the number of cores on the instruction throughput
3
Fin
d
Isc
p
Os
cp
Ap
ac
he DS
S
Fil
eS
rv
Ma
ilS
rvI
O
OL
TP
gm
ea
n
10
0
10
20
30
40
Ch
an
ge
 in
 in
st
. t
hr
ou
gh
pu
t (
%
)
-51 -44 -56 -20
SelectiveOffload
FlexSC
DisAggregateOS
SLICC
SchedTask
Figure 2: Impact of the instruction prefetcher on the in-
struction throughput
very high and it is shown to give better performance
than the classical instruction prefetchers such as next-line
prefetcher and correlation-based prefetcher [7]. We ob-
serve that CGP reduces the number of i-cache misses by
20-30% and thus improves the performance of a system
without an instruction prefetcher by around 4-5%1. Since
a baseline system with CGP incurs fewer i-cache misses,
the scheduling techniques gain lesser by improving the in-
struction locality. The mean improvements in the instruc-
tion throughput of the system after employing CGP are:
SelectiveOffload (8.37%), FlexSC (-20.93%), DisAggre-
gateOS (8.57%), SLICC (4.28%), and SchedTask (19.6%).
6 Trace Cache
Figure 3 shows the impact of core specialization tech-
niques on the instruction throughput when the baseline
system employs a trace cache. We use the trace cache im-
plementation that was proposed in [4]. We observe that
since the instruction footprints of the considered work-
loads are very large (>250KB), traces belonging to dif-
ferent SuperFunctions keep evicting each other from the
shared trace cache. Hence, the performance gains derived
by using core specialization techniques do not change
1The original paper [1] uses a 2-level memory hierarchy only and
hence it enhances performance more
Fin
d
Isc
p
Os
cp
Ap
ac
he DS
S
Fil
eS
rv
Ma
ilS
rvI
O
OL
TP
gm
ea
n
10
0
10
20
30
40
Ch
an
ge
 in
 in
st
. t
hr
ou
gh
pu
t (
%
)
-53 -46 -57 -20
SelectiveOffload
FlexSC
DisAggregateOS
SLICC
SchedTask
Figure 3: Impact of the trace cache on the instruction
throughput
much for a system employing a trace cache versus one that
does not employ a trace cache. For a system that employs
a trace cache, the mean performance gains derived by dif-
ferent techniques are: SelectiveOffload (7.2%), FlexSC (-
20.38%), DisAggregateOS (6.67%), SLICC (8.04%), and
SchedTask (20.6%).
7 Conclusion
In this report, we studied the sensitivity of five state of the
art core specialization techniques to multi-programmed
workloads, cache configurations, instruction prefetchers,
and trace-cache. Our studies show that SchedTask [3] out-
performs other techniques [6, 8, 5, 2] for all evaluated
configurations. This is because SchedTask employs a fine-
grained task scheduler and a superior work stealing algo-
rithm.
Acknowledgment
We thank Omais Pandith and Himani Raina for pro-
viding us their Tejas model of “Trace Based Instruc-
tion Caching”; it helped us evaluate the impact of Trace
Caches on different core specialization techniques.
4
References
[1] M. Annavaram, J. M. Patel, and E. S. Davidson. Call
graph prefetching for database applications. ACM
Transactions on Computer Systems, 2003.
[2] I. Atta, P. Tozun, A. Ailamaki, and A. Moshovos.
SLICC: Self-Assembly of Instruction Cache Collec-
tives for OLTP Workloads. In MICRO, 2012.
[3] P. Kallurkar and S. R. Sarangi. SchedTask: A
Hardware-Assisted Task Scheduler. In MICRO, 2017.
[4] R. F. Krick, G. J. Hinton, M. D. Upton, D. J. Sager,
and C. W. Lee. Trace based instruction caching, 2000.
US Patent 6,018,786.
[5] M. Lee. Memory region: a system abstraction for
managing the complex memory structures of multi-
core platforms. PhD thesis, Georgia Institute of Tech-
nology, 2013.
[6] D. Nellans, R. Balasubramonian, and E. Brunvand.
Interference Aware Cache Designs for Operating Sys-
tem Execution. University of Utah, Tech. Rep. UUCS-
09-002, 2009.
[7] K. J. Nesbit and J. E. Smith. Data cache prefetching
using a global history buffer. MICRO, 2005.
[8] L. Soares and M. Stumm. FlexSC: Flexible System
Call Scheduling with Exception-Less System Calls.
In OSDI, 2010.
5
