This paper summarizes the idea of Zorua, which was published in MICRO 2016 [88], and examines the work's signicance and future potential. The application resource speci cation-a static speci cation of several parameters such as the number of threads and the scratchpad memory usage per thread block-forms a critical component of modern GPU programming models. This speci cation determines the parallelism, and hence performance, of the application during execution because the corresponding on-chip hardware resources are allocated and managed based on this speci cation. This tight-coupling between the software-provided resource speci cation and resource management in hardware leads to signi cant challenges in programming ease, portability, and performance. Zorua is a new resource virtualization framework, that decouples the programmer-speci ed resource usage of a GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing each resource transparently to the programmer.
Motivation: Key Challenges in Modern GPUs
Modern Graphics Processing Units (GPUs) o er high performance and energy e ciency for many classes of applications by concurrently executing thousands of threads. In order to execute, each thread requires several major on-chip resources: (i) registers, (ii) scratchpad memory (if used in the program), and (iii) a thread slot in the thread scheduler that keeps all the bookkeeping information required for execution.
Today, these resources are statically allocated to threads based on several parameters-the number of threads per thread block, register usage per thread, and scratchpad usage per block. We refer to these static application parameters as the resource speci cation of the application. This resource speci cation forms a critical component of modern GPU programming models (e.g., CUDA [63] , OpenCL [50] ). The static allocation over a xed set of hardware resources based on the software-speci ed resource speci cation creates a tight coupling between the program and the physical hardware resources. As a result of this tight coupling, for each application, there are only a few optimized resource speci cations that maximize resource utilization. Picking a suboptimal speci cation leads to underutilization of resources and hence, very often, performance degradation. This leads to three key di culties related to obtaining good performance on modern GPUs: programming ease, portability, and performance degradation.
Programming Ease. First, the burden falls upon the programmer to optimize the resource speci cation. For a naive programmer, this is a challenging task because, in addition to selecting a speci cation suited to an algorithm, the programmer needs to be aware of the details of the GPU architecture to t the speci cation to the underlying hardware resources. This tuning is easy to get wrong because there are many highly suboptimal performance points in the speci cation space, and even a minor deviation from an optimized speci cation can lead to a drastic drop in performance due to lost parallelism. We refer to such drops as performance cli s. Even a small change in one resource can result in a signicant performance cli , degrading performance by as much as 50%. Figure 1 depicts multiple sizable cli s in an example application, when di erent resource speci cations are used [88] .
when the program is run on a real modern GPU, the NVIDIA GTX 745. 1 Portability. Second, di erent GPUs have varying quantities of each of the resources. Hence, an optimized speci cation on one GPU may be highly suboptimal on another. This lack of portability necessitates that the programmer re-tune the resource speci cation of the application for every new GPU generation. This problem is especially signi cant in virtualized environments, such as data centers, cloud computing, or compute clusters, where the same program may run on a wide range of GPU architectures. Figure 2 depicts the 69% performance loss when porting optimized code from the NVIDIA Kepler [65] /Maxwell [66] architectures to the NVIDIA Fermi [64] architecture. Performance. Third, for a programmer who chooses to employ software optimization tools (e.g., auto-tuners [21, 24, 49, 74, 75, 79] ) or manually tailor the program to t the hardware, performance is still constrained by the xed, static resource speci cation. It is well known [27, 42, 48, 62, 87, 97] that the on-chip resource requirements of a GPU application vary throughout execution. Since the program (even after auto-tuning) has to statically specify its worst-case resource requirements, severe dynamic underutilization of several GPU resources ensues [87] , leading to suboptimal performance.
A Holistic Approach to Resource Virtualization
To address these three challenges at the same time, we propose Zorua, a new framework that decouples an application's resource speci cation from the available hardware resources by virtualizing all three major resources (i.e., scratchpad memory, register le, and thread slots) in a holistic manner. This virtualization provides the illusion of more resources to the GPU programmer and software than physically available, and enables the runtime system and the hardware to dynamically manage multiple resources in a manner that is transparent to the programmer.
Key Concepts
The virtualization strategy used by Zorua is built upon two key concepts. First, to mitigate performance cli s when we do not have enough physical resources, we oversubscribe resources by a small amount at runtime, by leveraging their dynamic underutilization and maintaining a swap space (in main memory) for the extra resources required. Second, Zorua improves utilization by determining the runtime resource requirements of an application. It then allocates and deallocates resources dynamically, managing them (i) independently of each other to maximize each resource's utilization; and (ii) in a coordinated manner, to enable e cient execution of each thread with all its required resources available. Figure 3 depicts the high-level overview of the virtualization provided by Zorua. The virtual space refers to the illusion of the quantity of available resources. The physical space refers to the actual hardware resources (speci c to the target GPU architecture), and the swap space refers to the resources that do not t in the physical space and hence are spilled to other physical locations. For the register le and scratchpad memory, the swap space is mapped to the global memory space in the memory hierarchy. For threads, only those that are mapped to the physical space are available for scheduling and execution at any given time. If a thread is mapped to the swap space, its state (e.g., the PC) is saved in memory. Resources in the virtual space can be freely remapped between the physical and swap spaces to maintain the illusion of the virtual space resources. 
Challenges in Virtualization
Unfortunately, oversubscription means that latency-critical resources, such as registers and scratchpad, may be swapped to memory at the time of access, resulting in high overheads in performance and energy. This leads to two critical challenges in designing a framework to enable virtualization. The rst challenge is to e ectively determine the extent of virtualization, i.e., by how much each resource appears to be larger than its real physical amount, such that we can minimize oversubscription while still reaping its bene ts. This is di cult as the resource requirements continually vary during runtime. The second challenge is to minimize accesses to the swap space. This requires coordination in the virtualized management of multiple resources, so that enough of each resource is available on-chip at the same time when needed.
Design Ideas
To solve these challenges, Zorua employs two key ideas. First, we leverage the software (the compiler) to provide annotations with information regarding the future resource requirements of each phase of the application. This information enables the framework to make intelligent dynamic decisions ahead of time, with respect to both the extent of oversubscription and the allocation/deallocation of resources. Second, we use an adaptive runtime system to control the allocation of resources. This allows us to (i) dynamically alter the extent of oversubscription; and (ii) continuously coordinate the allocation of multiple on-chip resources and the mapping between their virtual and physical/swap spaces; depending on the varying runtime requirements of each thread. We brie y describe each design idea in turn.
2.3.1. Leveraging Software Annotations of Phase Characteristics. We observe that the runtime variation in resource requirements typically occurs at the granularity of phases of a few tens of instructions. This variation occurs because di erent parts of kernels perform di erent operations that require di erent resources. For example, loops that primarily load/store data from/to scratchpad memory tend to be less register heavy. Sections of code that perform speci c computations (e.g., matrix transformation, graph manipulation), can either be register heavy or primarily operate out of scratchpad. Often, scratchpad memory is used for only short intervals [97] , e.g., when data exchange between threads is required, such as for a reduction operation. Figure 4 depicts a few example phases from the N-Queens Solver (NQU) [18] kernel. NQU is a scratchpad-heavy application, but it does not use the scratchpad at all during the initial computation phase. During its second phase, it performs its primary computation out of the scratchpad, using as much as 4224B. During its last phase, the scratchpad is used only for reducing results, which requires only 384B. There is also signi cant variation in the maximum number of live registers in the di erent phases, as shown in Figure 4 . 384;------------------------------------------------- Figure 4 : Example phases from N-Queens Solver (NQU). Reproduced from [88] .
phasechange 24,4224;-------------------------------------------------
In order to capture both the resource requirements as well as their variation over time, we partition the program into a number of phases. A phase is a sequence of instructions with su ciently di erent resource requirements than adjacent phases. 2 Barrier or fence operations also indicate a change in requirements for a di erent reason-threads that are waiting at a barrier do not immediately require the thread slot that they are holding. We interpret barriers and fences as phase boundaries since they potentially alter the utilization of their thread slots. The compiler inserts special instructions called phase speci ers to mark the start of a new phase. Each phase speci er contains information regarding the resource requirements of the next phase. Phase changes are shown as ".phasechange" pragmas in Figure 4 .
A phase forms the basic unit for resource allocation and deallocation, as well as for making oversubscription decisions. It o ers a ner granularity than an entire thread to make such decisions. The phase speci ers provide information on the future resource usage of the thread at a phase boundary. This enables (i) preemptively controlling the extent of oversubscription at runtime, and (ii) dynamically allocating and deallocating resources at phase boundaries to maximize utilization of the physical resources.
2.3.2.
Control with an Adaptive Runtime System. Phase speci ers provide information to make oversubscription and allocation/deallocation decisions. However, we still need a way to make decisions on the extent of oversubscription and appropriately allocate resources at runtime. To this end, we use an adaptive runtime system, which we refer to as the coordinator. Figure 5 presents an overview of the coordinator.
The virtual space enables the illusion of a larger amount of each of the resources than what is physically available, to adapt to di erent application requirements. This illusion enables higher thread-level parallelism than what can be achieved with solely the xed, physically available resources, by allowing more threads to execute concurrently. The size of the virtual space at a given time determines this parallelism, and those threads that are e ectively executed in parallel are referred to as active threads. All active threads have thread slots allocated to them in the virtual space (and hence can be executed), but some of them may not be mapped to the physical space at any given time. As discussed previously, the resource requirements of each application continuously change during execution. To adapt to these runtime changes, the coordinator leverages information from the phase speciers to make decisions on oversubscription. The coordinator makes these decisions at every phase boundary and thereby controls the size of the virtual space for each resource.
Zorua: An Overview
To address the challenges in virtualization by leveraging the above ideas, Zorua employs a software-hardware codesign that comprises three components: (i) The compiler annotates the program by adding special instructions (phase speci ers) to partition it into phases and to specify the resource needs of each phase of the application. (ii) The coordinator, a hardware-based adaptive runtime system, uses the compiler annotations to dynamically allocate/deallocate resources for each thread at phase boundaries. The coordinator plays the key role of continuously controlling the extent of the oversubscription at each phase boundary. (iii) Hardware virtualization support includes a mapping table for each resource to locate each virtual resource in either the physically available on-chip resources or the swap space in main memory, and the machinery to swap resources between the physical space and the swap space.
Zorua has two key hardware components: (i) the coordinator that contains queues to bu er the pending threads and control logic to make oversubscription and resource management decisions, and (ii) resource mapping tables to map each of the resources to their corresponding physical or swap spaces. Our MICRO 2016 paper [88] provides the detailed implementation of Zorua in Section 4. In particular, we describe several key issues, including how (1) Zorua determines the amount of oversubscription for each resource (Section 4.4 of [88] ), (2) Zorua virtualizes each resource (Section 4.5 of [88] ), and (3) the compiler identi es each phase (Section 4.6 of [88] ).
Results
In this section, we evaluate the e ectiveness of Zorua in improving programming ease, portability, and performance. Our detailed experimental methodology is described in Section 5 of our MICRO 2016 paper [88] . More results are provided in Section 6 of [88] .
E ect on Performance Variation and Cli s
We rst examine how Zorua alleviates the high variation in performance by reducing the impact of resource speci cations on resource utilization. Figure 6 summarizes the range in performance across a wide range of resource speci cations (indicating an undesirable dependence on the speci cation), for the baseline architecture, WLM (which allocates resources at the ner granularity of a warp [91] ), and Zorua for a representative set of applications, using a Tukey box plot [61] . The boxes in the box plot represent the range between the rst quartile (25%) and the third quartile (75%). The whiskers extending from the boxes represent the maximum and minimum points of the distribution, or 1.5× the length of the box, whichever is smaller. Any points that lie more than 1.5× the box length beyond the box are considered to be outliers [61] , and are plotted as individual points. The line in the middle of the box represents the median, while the "X" represents the average. We make two major observations from Figure 6 . First, we nd that Zorua signi cantly reduces the performance range across all evaluated resource speci cations. Averaged across all of our applications, the worst resource speci cation for Baseline achieves 96.6% lower performance than the best performing resource speci cation. For WLM [91] , this performance range reduces only slightly, to 88.3%. With Zorua, the performance range drops signi cantly, to 48.2%. We see drops in the performance range for all applications except SSSP. With SSSP, the range is already small to begin with (23.8% in Baseline), and Zorua exploits the dynamic underutilization, which improves performance but also adds a small amount of variation.
Second, while Zorua reduces the performance range, it also preserves or improves performance of the best performing points. As we examine in more detail in Section 3.2, the reduction in performance range occurs as a result of improved performance mainly at the lower end of the distribution.
To gain insight into how Zorua reduces the performance range and improves performance for the worst performing points, we analyze how it reduces performance cli s. We study the tradeo between resource speci cation and exe- cution time for three representative applications: DCT (Figure 7a) , MST (Figure 7b ), and NQU (Figure 7c ). For all three gures, we normalize execution time to the best execution time under Baseline. We make two observations from the gures. First, Zorua successfully mitigates the performance cli s that occur in Baseline. For example, DCT and MST are both sensitive to the thread block size, as shown in Figures 7a  and 7b , respectively. We have circled the locations at which cli s exist in Baseline. Unlike Baseline, Zorua maintains more steady execution times across the number of threads per block, employing oversubscription to overcome the loss in parallelism due to insu cient on-chip resources. We see similar results across all of our applications.
Second, we observe that while WLM [91] can reduce some of the cli s by mitigating the impact of large block sizes, many cli s still exist under WLM (e.g., NQU in Figure 7c ). This cli in NQU occurs as a result of insu cient scratchpad memory, which cannot be handled by warp-level management. Similarly, the cli s for MST (Figure 7b ) also persist with WLM because MST has a lot of barrier operations, and the additional warps scheduled by WLM ultimately stall, waiting for other warps within the same block to acquire resources. We nd that, with oversubscription, Zorua is able to smooth out those cli s that WLM is unable to eliminate.
E ect on Performance
As Figure 6 shows, Zorua either retains or improves the best performing point for each application, compared to the Baseline. Zorua improves the best performing point for each application by 12.8% on average, and by as much as 27.8% (for DCT ). This improvement comes from the improved parallelism obtained by exploiting the dynamic underutilization of resources, which exists even for optimized speci cations. Applications such as SP and SLA have little dynamic underutilization, and hence do not show any performance improvement. NQU does have signi cant dynamic underutilization, but Zorua does not signi cantly improve the best performing point as the overhead of oversubscription outweighs the bene t, and Zorua dynamically chooses not to oversubscribe. We conclude that even for many speci cations that are optimized to t the underlying hardware resources, Zorua is able to further improve performance.
We also note that, in addition to reducing performance variation and improving performance for optimized points, Zorua improves performance by 25.2% on average for all resource speci cations across all evaluated applications.
E ect on Portability
Performance cli s often behave di erently across di erent GPU architectures, and can signi cantly shift the best performing resource speci cation point. We study how Zorua can ease the burden of performance tuning if an application has been already tuned for one GPU model, and is later ported to another GPU. To understand this, we de ne a new metric, porting performance loss, that quanti es the performance impact of porting an application without re-tuning it. To calculate this, we rst normalize the execution time of each speci cation point to the execution time of the best performing speci cation point. We then pick a source GPU architecture (i.e., the architecture that the GPU was tuned for) and a target GPU architecture (i.e., the architecture that the code will run on), and nd the point-to-point drop in performance (when the code is executed on the target GPU) for all points whose performance on the source GPU comes within 5% of the performance at the best performing speci cation point. 3 Figure 8 shows the maximum porting performance loss for each application, across any two pairings of our three simulated GPU architectures (NVIDIA Fermi, Kepler, and Maxwell). We nd that Zorua greatly reduces the maximum porting performance loss that occurs under both Baseline and WLM for all but one of our applications. On average, the maximum porting performance loss is 52.7% for Baseline, 51.0% for WLM, and only 23.9% for Zorua.
Notably, Zorua delivers signi cant improvements in portability for applications that previously su ered greatly when ported to another GPU, such as DCT and MST. For both of these applications, the performance variation di ers so much between GPU architectures that, despite tuning the application on the source GPU to be within 5% of the best achievable performance, their performance on the target GPU is often more than twice as slow as the best achievable performance on the target platform. Zorua signi cantly lowers this porting performance loss down to 28.1% for DCT and 36.1% for MST. We also observe that for BH, Zorua actually slightly increases the porting performance loss with respect to the Baseline. This is because for Baseline, there are only two points that perform within the 5% margin for our metric, whereas with Zorua, we have ve points that fall in that range. Despite this, the increase in porting performance loss for BH is low, deviating only 7.0% from the best performance. We conclude that Zorua enhances portability of applications by reducing the impact of a change in the hardware resources for a given resource speci cation. For applications that have already been tuned on one platform, Zorua signi cantly lowers the penalty of not re-tuning for another platform, allowing programmers to save development time.
Related Work
To our knowledge, our MICRO 2016 paper [88] is the rst work to propose a holistic framework to decouple a GPU application's resource speci cation from its physical on-chip resource allocation by virtualizing multiple on-chip resources. This enables the illusion of more resources than what physically exists to the programmer, while the hardware resources are managed at runtime by employing a swap space (in main memory), transparently to the programmer. We design a new hardware/software cooperative framework to e ectively virtualize multiple on-chip GPU resources in a controlled and coordinated manner, thus enabling many bene ts of virtualization in GPUs.
We brie y discuss prior work related to di erent aspects of our proposal: (i) virtualization of resources, (ii) improving programming ease and portability, and (iii) more e cient management of on-chip resources.
Virtualization of Resources. Virtualization [20, 22, 33, 41] is a concept designed to provide the illusion, to the software and programmer, of more resources than what truly exists in physical hardware. It has been applied to the management of hardware resources in many di erent contexts [5, 10, 20, 22, 33, 41, 67, 89] , with virtual memory [11, 22, 26, 41] being one of the oldest forms of virtualization that is commonly used in high-performance processors today. Abstraction of hardware resources and use of a level of indirection in their management leads to many bene ts, including improved utilization, programmability, portability, isolation, protection, sharing, and oversubscription.
In this work, we apply the general principle of virtualization to the management of multiple on-chip resources in modern GPUs. Virtualization of on-chip resources o ers the opportunity to alleviate many di erent challenges in modern GPUs. However, in this context, e ectively adding a level of indirection introduces new challenges, necessitating the design of a new virtualization strategy. There are two key challenges. First, we need to dynamically determine the extent of the virtualization to reach an e ective tradeo between improved parallelism due to oversubscription and the latency/capacity overheads of swap space usage. Second, we need to coordinate the virtualization of multiple latency-critical on-chip resources. To our knowledge, this is the rst work to propose a holistic software-hardware cooperative approach to virtualizing multiple on-chip resources in a controlled and coordinated manner that addresses these challenges, enabling the di erent bene ts provided by virtualization in modern GPUs.
Prior works propose to virtualize a speci c on-chip resource for speci c bene ts, mostly in the CPU context. For example, in CPUs, the concept of virtualized registers was rst used in the IBM 360 [5] and DEC PDP-10 [10] architectures to allow logical registers to be mapped to either fast yet expensive physical registers, or slow and cheap memory. More recent works [67, 93, 94] , propose to virtualize registers to increase the e ective register le size to much larger register counts. This increases the number of thread contexts that can be supported in a multi-threaded processor [67] , or reduces register spills and lls [93, 94] . Other works propose to virtualize on-chip resources in CPUs (e.g., [15, 19, 25, 31, 99] ). In GPUs, Jeon et al. [42] propose to virtualize the register le by dynamically allocating and deallocating physical registers to enable more parallelism with smaller, more powere cient physical register les. Concurrent to this work, Yoon et al. [98] propose an approach to virtualize thread slots to increase thread-level parallelism. These works propose speci c virtualization mechanisms for a single resource for speci c bene ts. None of these works provide a cohesive virtualization mechanism for multiple on-chip GPU resources in a controlled and coordinated manner, which forms a key contribution of our MICRO 2016 work.
Enhancing Programming Ease and Portability. There is a large body of work that aims to improve programmability and portability of modern GPU applications using software tools, such as auto-tuners [21, 24, 49, 74, 75, 79] , optimizing compilers [17, 37, 47, 59, 95, 96] , and high-level programming languages and runtimes [23, 35, 72, 85] . These tools tackle a multitude of optimization challenges, and have been demonstrated to be very e ective in generating high-performance portable code. They can also be used to tune the resource speci cation. However, there are several shortcomings in these approaches. First, these tools often require pro ling runs [17, 21, 75, 79, 95, 96] on the GPU to determine the best performing resource speci cations. These runs have to be repeated for each new input set and GPU generation. Second, software-based approaches still require signi cant programmer e ort to write code in a manner that can be exploited by these approaches to optimize the resource speci cations. Third, selecting the best performing resource speci cations statically using software tools is a challenging task in virtualized environments (e.g., cloud computing, data centers), where it is unclear which kernels may be run together on the same SM or where it is not known, a priori, which GPU generation the application may execute on. Finally, software tools assume a xed amount of available resources. This leads to runtime underutilization due to static allocation of resources, which cannot be addressed by these tools.
In contrast, the programmability and portability bene ts provided by Zorua require no programmer e ort in optimizing resource speci cations. Furthermore, these auto-tuners and compilers can be used in conjunction with Zorua to further improve performance.
E cient Resource Management. Prior works aim to improve parallelism by increasing resource utilization using hardware-based [6, 7, 30, 42, 45, 46, 55, 57, 62, 71, 84, 86, 91, 97] , software-based [32, 36, 53, 58, 68, 92, 97] , and hardwaresoftware cooperative [8, 9, 43, 44, 73, 81, 82, 87] approaches. Among these works, the closest to ours are [42, 98] (discussed earlier), [97] , and [91] . These approaches propose e cient techniques to dynamically manage a single resource, and can be used along with Zorua to improve resource e ciency further. Yang et al. [97] aim to maximize utilization of the scratchpad with software techniques, and by dynamically allocating/deallocating scratchpad memory. Xiang et al. [91] propose to improve resource utilization by scheduling threads at the ner granularity of a warp rather than a thread block. This approach can help alleviate performance cli s, but not in the presence of synchronization or scratchpad memory, nor does it address the dynamic underutilization within a thread during runtime. We quantitatively compare to this approach in Section 3 and demonstrate Zorua's bene ts over it.
Other works leverage resource underutilization to improve energy e ciency [2, 27, 28, 29, 42] or perform other useful work [54, 87] . These works are complementary to Zorua.
Signi cance and Long-Term Impact
In this section, we describe the signi cance and long-term impact of our MICRO 2016 work, Zorua, by delineating its novelty, what it can enable in future systems, and new research directions that it triggers.
Novelty
• This is the rst work that takes a holistic approach to decoupling a GPU application's resource speci cation from its physical on-chip resource allocation via the use of virtualization. We develop a comprehensive virtualization framework that provides controlled and coordinated virtualization of multiple on-chip resources to maximize the e ectiveness of virtualization.
• Making GPUs easy to program is critical for their widespread use, and also to achieve the high performance promised by the massively parallel architecture. A key limiting factor in GPU programming today is the burden placed on the programmer in nding a hardware resource speci cation that achieves very high performance. This is the rst work to ease that burden without compromising performance by virtualizing the major hardware resources programmers are required to manage today.
• Portability across GPU architectures is vital in environments such as cloud computing and data centers to achieve predictably good performance, irrespective of the GPU generation the application is executing on. This is the rst work to tackle the portability challenges that arise from the programmer's management of the xed on-chip resources with a holistic resource virtualization strategy.
What Zorua Can Enable in Future Systems
GPUs have emerged as the dominant massively parallel GPU architecture, used as the platform of choice for a wide range of parallel applications from machine learning to scienti c simulation. However, there are a number of key challenges that limit the adoption of GPUs across broader classes of applications and environments, e.g., data centers, cloud computing, etc. Programmability and portability of GPU applications are two such challenges. But future GPUs will need to address several other challenges before truly becoming rst-class compute engines. As we describe below, we believe that our work can help address some of these other challenges.
Multiprogramming in Virtualized Environments. Zorua lends itself to easily addressing two key challenges in enabling multiprogramming in virtualized environments today:
Fine-grained resource sharing across kernels: Zorua manages the di erent resources independently and at a ne granularity, using a dynamic runtime system. Hence, Zorua can be extended to support ne-grained sharing and partitioning of resources across multiple kernels to enable e cient multiprogramming in GPUs. Zorua enables better resource utilization in these multiprogrammed environments, while providing the ability to control the partitioning of resources at runtime to provide QoS, fairness, etc., by leveraging the hardware runtime system. Zorua can work synergistically with systems such as Mosaic [8] and MASK [9] , which enable e cient memory virtualization techniques for GPUs, to enable true full-system multi-kernel execution.
Preemptive multitasking: Another key challenge in enabling true multiprogramming in GPUs is enabling rapid preemption of kernels [69, 83, 90] . Context switching on GPUs incurs a very high latency and overhead, as a result of the large amount of register le/scratchpad state that needs to be saved before a new kernel can be executed. Zorua enables ne-grained management and virtualization of on-chip resources. It can be naturally extended to enable quick preemption of a task via intelligent management of the swap space and the mapping tables. It can also work synergistically with CABA [87] , framework for assist warp execution in GPUs, to provide exible and e cient support for multitasking and context switching.
Support for Other Parallel Programming Paradigms. The xed static resource allocation for each thread in modern GPU architectures requires statically dictating the parallelism for the program throughout its execution. Other forms of parallel execution that are dynamic (e.g., CILK [12] ) require more exible allocation of resources at runtime, and are hence more challenging to enable on GPUs. Examples of this include nested parallelism [56] , where a kernel can dynamically spawn new kernels or thread blocks, and helper threads [87] to utilize idle resource at runtime to perform di erent optimizations or background tasks in parallel. Zorua makes it easy to enable these paradigms by providing on-demand dynamic allocation of resources.
Energy E ciency, Scalability, and Reliability. To support massive parallelism, on-chip resources are a precious and critical resource. However, these resources cannot grow arbitrarily large as GPUs continue to be area-limited and onchip memory tends to be extremely power hungry and area intensive [2, 27, 28, 42, 73, 98] , which are trends we believe will become increasingly important for the foreseeable future. Furthermore, complex thread schedulers that can select a thread for execution from an increasingly large thread pool are required. Zorua enables using smaller register les, scratchpad memory and less complex or fewer thread schedulers to save power and area while still retaining or improving parallelism. The indirection o ered by Zorua, along with the dynamic management of resources, could also enable better reliability. The virtualization framework trivially allows portions of a resource that contain hard or soft faults to be remapped to other portions of the resource that do not contain faults, or to spare structures, thereby increasing the error tolerance of these resources.
New Research Directions Zorua Enables
Zorua opens up several new avenues for more research, which we brie y discuss here.
Flexible Programming Models for GPUs and Heterogeneous Systems. By providing a exible but dynamically controlled view of the on-chip hardware resources, Zorua changes the abstraction of the on-chip resources that is offered to the programmer and software. This o ers the opportunity to rethink resource management in GPUs from the ground up. One could envision more powerful resource allocation and better programmability with programming models that do not require static resource speci cation, leaving the compiler/runtime system and the underlying virtualized framework to completely handle all forms of on-chip resource allocation, unconstrained by the xed physical resources in a speci c GPU, entirely at runtime. This is especially signicant in future systems that are likely to support a wide range of compute engines and accelerators, making it important to be able to write high-level code that can be partitioned easily, e ciently, and at a ne granularity across any set of accelerators, without statically tuning any code segment to run e ciently on the GPU.
Virtualization-Aware Compilation and Auto-Tuning. Zorua changes the contract between the hardware and software to provide a more powerful resource abstraction (in the software) that is exible and dynamic, by pushing some more functionality to the hardware, which can more easily react to runtime resource requirements of the program. We can re-imagine compilers and auto-tuners to be more intelligent, leveraging this new abstraction and, hence the virtualization, to deliver more e cient and high-performing code optimizations that are not possible with the xed and static abstractions of today. They could, for example, leverage the oversubscription and dynamic management that Zorua provides to tune the code to more aggressively use resources.
Support for System-Level Tasks on GPUs. As GPUs become increasingly general purpose, a key requirement is better integration with the CPU operating system, and with complex distributed software systems such as those employed for large-scale distributed machine learning [1, 39] or graph processing [3, 4, 60] . If GPUs are architected to be rst-class compute engines, rather than the slave devices they are today, they can be programmed and utilized in the same manner as a modern CPU. This integration requires the GPU execution model to support system-level tasks like interrupts, exceptions, etc. and more generally provide support for access to distributed le systems, disk I/O, or network communication. Support for these tasks and execution models require dynamic provisioning of resources for execution of system-level code. Zorua provides a building block to enable this.
Applicability to General Resource Management in Accelerators. Zorua uses a program phase as the granularity for managing resources. This allows handling resources across phases dynamically, while leveraging static information regarding resource requirements from the software by inserting annotations at phase boundaries. Future work could potentially investigate the applicability of the same approach to manage resources and parallelism in other accelerators (e.g., processing-in-memory accelerators [3, 4, 13, 14, 34, 38, 40, 51, 52, 70, 77, 78, 80, 100] or directmemory access engines [16, 55, 76] ) that require e cient dy-namic management of large amounts of particular critical resources.
Conclusion
We propose Zorua, a new framework that decouples the application resource speci cation from the allocation in the physical hardware resources (i.e., registers, scratchpad memory, and thread slots) in GPUs. Zorua encompasses a holistic virtualization strategy to e ectively virtualize multiple latency-critical on-chip resources in a controlled and coordinated manner. We demonstrate that by providing the illusion of more resources than physically available, via dynamic management of resources and the judicious use of a swap space in main memory, Zorua enhances (i) programming ease (by reducing the performance penalty of suboptimal resource speci cation), (ii) portability (by reducing the impact of di erent hardware con gurations), and (iii) performance for code with an optimized resource speci cation (by leveraging dynamic underutilization of resources). We conclude that Zorua is an e ective, holistic virtualization framework for GPUs. We believe that the indirection provided by Zorua's virtualization mechanism makes it a generic framework that can address other challenges in modern GPUs. For example, Zorua can enable ne-grained resource sharing and partitioning among multiple kernels/applications, as well as low-latency preemption of GPU programs. We hope that future work explores these promising directions, building on the insights and the framework developed in our MICRO 2016 paper.
