Scale-Out Processors

Abstract

Global-scale online services, such as Google’s Web search and Facebook’s social networking, run in large-scale datacenters. Due to their massive scale, these services are designed to scale out (or distribute) their respective loads and datasets across thousands of servers in datacenters. The growing demand for online services forced service providers to build networks of datacenters, which require an enormous capital outlay for infrastructure, hardware, and power consumption. Consequently, efficiency has become a major concern in the design and operation of such datacenters, with processor efficiency being of, utmost importance, due to the significant contribution of processors to the overall datacenter performance and cost. Scale-out workloads, which are behind today’s online services, serve independent requests, and have large instruction footprints and little data locality. As such, they benefit from processor designs that feature many cores and a modestly sized Last-Level Cache (LLC), a fast access path to the LLC, and high-bandwidth interfaces to memory. Existing server-class processors with large LLCs and a handful of aggressive out-of-order cores are inefficient in executing scale-out workloads. Moreover, the scaling trajectory for these processors leads to even lower efficiency in future technology nodes. This thesis presents a family of throughput-optimal processors, called Scale-Out Processors, for the efficient execution of scale-out workloads. A unique feature of Scale-Out Processors is that they consist of multiple stand-alone modules, called pods, wherein each module is a server running an operating system and a full software stack. To design a throughput-optimal processor, we developed a methodology based on performance density, defined as throughput per unit area, to quantify how effectively an architecture uses the silicon real estate. The proposed methodology derives a performance-density optimal processor building block (i.e., pod), which tightly couples a number of cores to a small LLC via a fast interconnect. Scale-Out Processors simply consist of multiple pods with no inter-pod connectivity or coherence. Moreover, they deliver the highest throughput in today’s technology and afford near-ideal scalability as process technology advances. We demonstrate that Scale-Out Processors improve datacenters’ efficiency by 4.4x-7.1x over datacenters designed using existing server-class processors

    Similar works