Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel acceleration where the context for thousands of concurrent threads are resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. The design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale.
Telephone:
(800) 553-6847 Facsimile:
(703) 605-6900 E-Mail:
orders@ntis.fedworld.gov Online ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online 
Introduction
With the rise of General-Purpose Graphics Processing Unit (GPGPU) computing and computeheavy workloads like machine-learning, compute accelerators have become a necessary component in both high-performance supercomputers and datacenter-scale systems. The first exascale machines are expected to heavily leverage the massively parallel compute capabilities of GPUs or other highly parallel accelerators [4] . As the software stack and programming model of GPUs and their peer accelerators continue to improve, there is every indication that this trend will continue. As a result, architects that wish to study the design of large-scale systems will need to evaluate the effect their techniques have using a GPU model. However, the focus of all publicly available cycle-level simulators like GPGPU-Sim [2] is on single-node performance. In order to truly study the problem at scale, a parallelizable, multi-node GPU simulator is necessary. .1 depicts the current CPU/GPU model co-processor model. On the left is the common high-performance, discrete GPU configuration, where the CPU and GPU have separate memory spaces and are connected via either PCIe or a high-bandwidth link, like NVLink. The right shows the APU model where the CPU and GPU share the same memory. Note that in even in the discrete memory case, modern memory translation units allow the CPU and GPU to share the same address space, although the memories themselves are discrete.
In this report we will detail a model that is capable of simulating both discrete and unified memory spaces by leveraging the MemHeirarchy interface in SST [5] . This report details our efforts to integrate the functional and streaming multiprocessor core models from the open-source simulator GPGPU-Sim into SST.
Chapter 2 Scheduler Component
The first step in integrating GPGPU-Sim into SST is to handle the interaction with an SST CPU component. Since GPUs today function solely as co-processors, functionally executing GPU-enabled binaries requires the CPU to initialize and launch kernels of work to the GPU. In our model, the GPU is constructed out of two discrete SST components -a scheduler and a SM block [1] . When CUDA functions are called from the CPU component, they are intercepted and translated into messages that are sent over SST links to the GPU (along with the associated parameters). Table 2 .1 enumerates the CUDA API calls currently intercepted and sent to the GPU elements. These calls are enough to enable the execution of a number of CUDA SDK kernels, DoE proxy apps as well as a collection of Kokkos Unit tests. Table 2 .2 lists the number of Kokkos unit tests that pass with our current implementation of SST-GPU, which is about 60%. There is ongoing work with the PTX parser to increase the number of running kernels. Aside from the basic functional model provided by GPU-SST, an initial performance model has also been developed. Figure 2 .1 details the overall architecture. A CPU component (Ariel in the initial implementation) is connected via SST links to 2 GPU components: the SMs, which implement the timing and functional model for the GPU cores, and a centralized kernel and CTA scheduler (GPUSched). When CUDA calls are intercepted from the CPU, messages are sent to both the SMs and the GPU scheduler. Messages related to memory copies and other information necessary to populate the GPU functional model are sent directly to the SMs element, since the functional model for executing the GPU kernels lives inside the SMs element. Calls related to enqueuing kernels for execution are sent to the GPU scheduler element, which co-ordinates the As CTAs complete on the SMs, messages are sent back to the GPU scheduler element, which pushes new work to the SMs from enqueued kernels as needed. Memory copies from the CPU to GPU address space are handled on a configurable page-size granularity, similar to how conventional CUDA unified memory handles the transfer of data from CPU to GPU memories. The centralized GPU scheduler receives kernel launch commands from the CPU, then issues CTA launch commands to the SMs. The scheduler also receives notifications from the SMs when the CTAs finish. The reception of kernel launch and CTA complete notifications are independent, therefore we designed a different handler for each type of message. Figure 2 .2 shows the design of the centralized kernel and CTA Scheduler. The kernel handler listens to calls from a CPU component and pushes kernel launch information to the kernel queue when it receives kernel configure and launch commands. The SM map table contains CTA slots for each of the SMs, which is reserved when launching a CTA and released when a message indicating that a CTA has finished is received from the SMs. The scheduler clock ticks trigger CTA launches to SMs, when space is available and there is a pending kernel. On every tick, the scheduler issues a CTA launch command for currently unfinished kernels if any CTA slot is available or tries to fetch a new kernel launch from kernel queue. The CTA handler also waits for SMs to reply the CTA finish message, so that CTA slots in the SM map table may be freed.
Chapter 3

Streaming-Multiprocessor Component
To support the GPGPU-Sim functional model, a number of the simulator's overloaded CUDA Runtime API calls were updated. A number of functions that originally assumed the application and simulator were within same address space now support them being decoupled. Initialization functions, such as cudaRegisterFatBinary, now take paths to the original application to obtain the PTX assembly of CUDA kernels. Supporting the functional model of GPGPU-Sim also requires transferring values from the CPU application to the GPU memory system. This is solved by leveraging the inter-process communication tunnel framework from SST-Core, as shown in 3.1. Chunks of memory are transferred from the CPU application to the GPU memory system at the granularity of a page (4KiB). The transfer of pages is a blocking operation, therefore all stores to the GPU memory system must be completed before another page is transferred or another API call is processed.
To model GPU performance, the memory system of the public GPGPU-Sim is completely removed. Instead, all accesses to GPU memory are sent though SST links to the MemHierarchy interface. As Figure 3 .2 shows, a multi-level cache hierarchy is simulated with the shared L2 sliced between different memory partitions, each with its own memory controller. Several backend timing models have been configured and tested, including SimpleMem, SimpleDRAM, Timing-DRAM, and CramSim [3] ; CramSim will be used to model the HBM stacks in the more detailed performance models. We have created an initial model for the GPU system similar to that found in an Nvidia Volta. The configuration for the GPU, CramSim and Network components is shown in Listing 3.1. 
Conclusion
This report has detailed the first phase of the SST-GPU project, where the execution-driven functional and performance model of a GPU had been integrated SST. Initial results demonstrate significant coverage of applications. The next phase of the project will focus on further disaggregating the GPU to enable truly scaled GPU performance in a multi-process MPI simulation.
