Increased parallelism on a single processor is driving improvements in peak-performance at both the node and system levels. However achievable performance, in particular from production scientific applications, is not always directly proportional to the core count. Performance is often limited by constraints in the memory hierarchy and also by a node inter-connectivity. Even on state-of-the-art processors, containing between four and eight cores, many applications cannot take full advantage of the compute-performance of all cores. This trend is expected to increase on future processors as the core count per processor increases. In this work we characterize the use of spare-cores, cores that do not provide any improvements in application performance, on current multi-core processors. By using a pulse-width modulation method, we examine the possible performance profile of using a spare-core and quantify under what situations its use will not impact application performance. We show that, for current AMD and Intel multi-core processors, sparecores can be used for substantial computational tasks but can impact application performance when using shared caches or when significantly accessing main memory.
Introduction
Increased silicon integration is the driving force behind the rapid increase in the number of cores available on a single processor. State-of-the-art mainstream processors from AMD, IBM and Intel already boast between 8 and 12 cores in a single processor, and many more cores are foreseen in the future such as the Intel 48-cores processor [1] . However the achievable performance, especially when considering scientific applications, is not always directly proportional to the core count. Many applications can only take advantage of a sub-set of the available cores before the achieved performance peaks [2] . Thus resulting in two classes of cores -those dedicated for application use, and those deemed as spare-cores.
A significant factor for many applications is the performance of the memory sub-system and in particular the available main-memory bandwidth. This is currently determined by the available pins dedicated to I/O in integrated circuits (IC). This fully depends on the IC manufacturing process. In today's IC processes such as Very-Large Scale Integration (VLSI) and some Ultra-Large Scale Integrations (ULSI) like the System-on-a-chip (SoC), the number of pins is proportional to the IC perimeter. Unfortunately the perimeter is not increasing significantly due to cost constraints. On the other hand, there are some ULSI proposals such as the utilization of Through Silicon Vias (TSV) [3] that promises to substantially increase the available bandwidth between processor and memory. This is achieved by stacking ICs on top of each other, but is still in experimental stages. In the short-term we expect that there will be an increase in the available spare-cores on a processor given the rise in the overall core-count.
There have been many proposals for using spare-cores in support activities [4], [5], [6], [7], [8], and [9], including to increase reliability or to monitor other core's activities, but little has been done to characterize what spare-cores can actually do from a performance standpoint. Even though spare-cores may be available, they share common resources available to all cores within a processor typically including: a level of shared cache, shared memory controllers (providing access to main memory), and shared network-interface-controllers. In this work we characterize the performance profile of using a spare-core in terms of its impact on application performance.
A Pulse-Width-Modulation (PWM) approach is used to characterize the impact of using the spare core. Two phases, an active and an inactive, exist within a single cycle which is continuously repeated during application execution. A separate micro-benchmark is used for each of: compute, local-cache access, sharedcache access, main-memory access, and inter-node network access; that incorporates the PWM approach. Though other approaches can be used PWM corresponds to a typical use of a spare-core running a process which exhibits phases such as a compute-phase, followed by a data-flush phase (to main memory), and a data-storage phase (to remote disk) for instance.
This work characterizes the impact of using spare-cores on two state-of-theart processing nodes: a four-processor six-core node from AMD (Istanbul), and a four-processor eight-core node from Intel (Nehalem). Four applications, which are either compute-bound or memory-bound, are used in this analysis. The impact on performance of using a spare-core, as we will see, is application dependent -for instance memory intensive applications are significantly slowed by sparecores using main memory. The contributions of this work are threefold, first we present a methodology that uses PWM to represent the activities of spare-cores; second we characterize what operations spare cores can actually perform without significantly impacting application performance, and finally we investigate and quantify the approach on current state-of-the-art processing nodes.
The rest of this paper is organized as follows. Section 2 describes our PWM approach as well as detailing the test-bed nodes used. Section 3 details the applications used and their achievable performance on the test-beds. Section 4 characterizes the impact on application performance of using spare-cores and discuss under what situations their use may be appropriate. Related work on analyzing the performance impact of using spare-cores is summarized in Section 5. Conclusions from this work are given in Section 6.
