The instruction footprint of OS-intensive workloads such as web servers, database servers, and file servers typically exceeds the size of the instruction cache (32 KB). Consequently, such workloads incur a lot of i-cache misses, which reduces their performance drastically. Several papers [6, 8, 5, 2, 3] have proposed to improve the performance of such workloads using core specialization. In this scheme, tasks with different instruction footprints are executed on different cores. In this report, we study the performance of five state of the art core specialization techniques:
Multi-programmed Workloads
We compare the impact of all core specialization techniques on a server that is executing multiple OS-intensive applications. Table 1 shows the constituent benchmarks and their workloads for each multi-programmed workload, and Figure 1 shows the impact of different core specialization techniques on the weighted instruction throughput of each multi-programmed workload. We * The author contributed to this work while at Indian Institute of Technology Delhi Bag iSize is the size of the i-cache. iHit and Perf are the change (%) in i-cache hit rate and the instruction throughput respectively relative to the baseline with the same i-cache size Table 2 : Impact of the size of the instruction cache on the instruction cache hit rate and instruction throughput 2 Instruction Cache Size Table 2 shows the impact of the i-cache size on the i-cache hit rate and the instruction throughput derived by all core specialization techniques. We evaluate all techniques for the following three i-cache configurations: 4-way 16 KB, 4-way 32 KB, and 4-way 64 KB. A baseline system with a smaller i-cache incurs more cache misses and therefore, the core specialization techniques can improve instruction throughput better. Our proposed technique improves throughput by 25%, 23%, and 22% over the baseline for a 16 KB, 32 KB, and a 64 KB i-cache system, respectively. This results in a performance improvement of 13%, 12%, and 7% respectively over the best state of the art techniques. Table 4 shows the impact of the number of cores on the instruction throughput of different core specialization techniques. We evaluate all the techniques for the following four systems: system with 8 cores, system with 16 cores, system with 24 cores, and a system with 32 cores. We do not consider a system with less than 8 cores because such a system is not practical for the OS-intensive server-class workloads that we consider. Our proposed technique improves throughput by 18%, 27%, 27%, and 23% over the baseline for a system with 8 cores, 16 cores, 24 cores, and 32 cores respectively. This results in 3, 9, 12, and 12 percentage points enhancements, respectively, over the best existing techniques. Figure 2 shows the impact of core specialization techniques on the instruction throughput when the baseline system employs a hardware instruction prefetcher. We use the hardware-only mode (CGHC-2K+32K) of the Call Graph Prefetcher (CGP) [1] as the instruction prefetcher. We use CGP because its hardware overheads are not SelectiveOffload FlexSC DisAggregateOS SLICC SchedTask Figure 2 : Impact of the instruction prefetcher on the instruction throughput very high and it is shown to give better performance than the classical instruction prefetchers such as next-line prefetcher and correlation-based prefetcher [7] . We observe that CGP reduces the number of i-cache misses by 20-30% and thus improves the performance of a system without an instruction prefetcher by around 4-5% 1 . Since a baseline system with CGP incurs fewer i-cache misses, the scheduling techniques gain lesser by improving the instruction locality. The mean improvements in the instruction throughput of the system after employing CGP are: SelectiveOffload (8.37%), FlexSC (-20.93%), DisAggregateOS (8.57%), SLICC (4.28%), and SchedTask (19.6%).
Cache Configuration

Number of Cores
Instruction Prefetcher
6 Trace Cache Figure 3 shows the impact of core specialization techniques on the instruction throughput when the baseline system employs a trace cache. We use the trace cache implementation that was proposed in [4] . We observe that since the instruction footprints of the considered workloads are very large (>250KB), traces belonging to different SuperFunctions keep evicting each other from the shared trace cache. Hence, the performance gains derived by using core specialization techniques do not change 1 The original paper [1] uses a 2-level memory hierarchy only and hence it enhances performance more SelectiveOffload FlexSC DisAggregateOS SLICC SchedTask Figure 3 : Impact of the trace cache on the instruction throughput much for a system employing a trace cache versus one that does not employ a trace cache. For a system that employs a trace cache, the mean performance gains derived by different techniques are: SelectiveOffload (7.2%), FlexSC (-20.38%), DisAggregateOS (6.67%), SLICC (8.04%), and SchedTask (20.6%).
Conclusion
In this report, we studied the sensitivity of five state of the art core specialization techniques to multi-programmed workloads, cache configurations, instruction prefetchers, and trace-cache. Our studies show that SchedTask [3] outperforms other techniques [6, 8, 5, 2] for all evaluated configurations. This is because SchedTask employs a finegrained task scheduler and a superior work stealing algorithm.
