Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of unprecedented amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale.
Telephone:
(800) 553-6847 Facsimile:
(703) 605-6900 E-Mail: orders@ntis.gov Online order: https://classic.ntis.gov/help/order-methods 
DE P A R T M EN T O F E N
E R G Y • • UN IT E D S
T A TES O F A M E R IC A

LIST OF FIGURES
LIST OF TABLES
INTRODUCTION
As the architectures of high-performance computing (HPC) evolves, there is a growing need to understand and quantify the performance and design benefits of emerging technologies. To complicate the design space, the rise of General-Purpose Graphics Processing Units (GPGPUs) and other compute accelerators, which are needed to handle the growing demands of compute-heavy workloads, have become a necessary component in both high-performance supercomputers and datacenter-scale systems. That the first exascale machines will leverage the massively parallel compute capabilities of GPUs [11, 10, 14] is indicative of the growing necessity of acceleratorbased node architectures to obtain high compute throughputs. As the software stack and programming model of GPUs and their peer accelerators continue to improve, there is every indication that this trend of accelerator integration will continue, leading to a diverse ecosystem of technologies. GPUs are likely to continue to play a role as discrete accelerators or integrated as a part of an SOC. As a result, architects who wish to study the design of large-scale systems will need to evaluate system and software designs using a GPU model. However, the focus of all publicly available cycle-level simulators (e.g. GPGPU-Sim [3] ) to date has been on single-node performance. In order to truly study the problem at scale, and to permit larger workloads to be evaluated, a parallelizable, multi-node GPU simulator is necessary.
The Structural Simulation Toolkit (SST) [13] is a parallel discrete event-driven simulation framework that provides an infrastructure capable of modeling a variety of high performance computing systems at many different scales. Currently used by a wide variety of government agencies and computer manufacturers to design and simulate HPC architectures, and, supported by a Python and C++ code base with a large array of customization options, SST offers the HPC community powerful, highly customizable, tools to create and integrate models for evaluating current and future HPC node architectures and interconnect networks. What has been lacking, up to this point, has been a method to integrate accelerators into a node model in SST. This report builds upon previous work [9] , providing more details on our efforts to integrate an open-source GPGPU simulator, GPGPU-Sim, into SST. This integration effort will provide SST users the ability to run GPGPUbased simulations using the Balar GPU component and will serve as a model for future accelerator integration studies. 9
GPGPU-SIM INTEGRATION WITH SST
The latest version of GPGPU-Sim is a (mostly) object-oriented design written in C++ with a sharp delineation between the functional and timing models of the system. A high-level overview of the organization is shown in Figure 2 -1. The SIMT blocks can be thought of as NVIDIA-like Streaming Multiprocessors (SMs) or AMD-like Compute Units (CUs). The interconnect is currently a modified version of BookSim [12] , which can model a variety of topologies, and the memory partition is a custom model. The remainder of this section describes the integration of GPGPU-Sim with SST. 
SCHEDULER
The first step in integrating GPGPU-Sim into SST is to handle the interaction with a SST CPU component. Since GPUs today function solely as co-processors, functionally executing GPU-enabled binaries requires the CPU to initialize and launch kernels of work to the GPU. In our model, the GPU is constructed out of two discrete SST components -a scheduler and a SM block [2] . A block diagram of the model is shown in Figure 2 The CPU component (Ariel in the initial implementation) is connected via SST links to 2 GPU components: the SMs, which implement the timing and functional model for the GPU cores, and a centralized kernel and CTA scheduler (GPUSched). When CUDA calls are intercepted from the CPU, messages are sent to both the SMs and the GPU scheduler. Messages related to memory copies and other information necessary to populate the GPU functional model are sent directly to the SM elements, since the functional model for executing the GPU kernels lives inside the SM elements. Calls related to enqueuing kernels for execution are sent to the GPU scheduler element, which co-ordinates the launching of CTAs on the SMs, e.g. cudaConfigureCall and cudaLaunch.
Figure 2-2. SST Element Architecture for Kernel/CTA Scheduler and SMs Components
As CTAs complete on the SMs, messages are sent back to the GPU scheduler element, which pushes new work to the SMs from enqueued kernels as needed. Memory copies from the CPU to GPU address space are handled on a configurable page-size granularity, similar to how conventional CUDA unified memory handles the transfer of data from CPU to GPU memories. The centralized GPU scheduler receives kernel launch commands from the CPU, then issues CTA launch commands to the SMs. The scheduler also receives notifications from the SMs when the CTAs finish. The reception of kernel launch and CTA complete notifications are independent, therefore we designed a different handler for each type of message. Figure 2-3 shows the design of the centralized kernel and CTA Scheduler. The kernel handler listens to calls from a CPU component and pushes kernel launch information to the kernel queue when it receives kernel configure and launch commands. The SM map table contains CTA slots for each of the SMs, which is reserved when launching a CTA and released when a message indicating that a CTA has finished is received from the SMs. The scheduler clock ticks trigger CTA launches to SMs, when space is available and there is a pending kernel. On every tick, the scheduler issues a CTA launch command for currently unfinished kernels if any CTA slot is available or tries to fetch a new kernel launch from kernel queue. The CTA handler also waits for SMs to reply the CTA finish message, so that CTA slots in the SM map table may be freed.
STREAMING-MULTIPROCESSOR
To support the GPGPU-Sim functional model, a number of the simulator's overloaded CUDA Runtime API calls were updated. A number of functions that originally assumed the application and simulator were within same address space now support them being decoupled. Initialization functions, such as __cudaRegisterFatBinary, now take paths to the original application to obtain the PTX assembly of CUDA kernels.
Figure 2-4. SST Link and IPCTunnels for Functional Model Support
Supporting the functional model of GPGPU-Sim also requires transferring values from the executing CPU application to the GPU memory system. This is solved by leveraging the inter-process communication tunnel framework from SST-Core, as shown in Figure 2 -4. Chunks of memory are transferred from the CPU application to the GPU memory system at the granularity of a standard memory page (4KiB). The transfer of pages is a blocking operation, therefore all stores to the GPU memory system must be completed before another page is transferred or another API call is processed.
To complete the integration with SST, everything except the SIMT units are replaced with SST components, as shown in Figure 2 -5. The host CPU is swapped for a SST execution component, such as Ariel or Juno. The interconnect is swapped for a SST networking component, such as Shogun, Kingsley, or Merlin. The memory partition, caches, and ports are swapped for SST memHierarchy components -caches and backing store models such as TimingDRAM, Simple-Mem, or CramSim. A detailed description of how to use the GPU component can be found in [9] . The remainder of this paper is dedicated to a discussion of performance. 
PERFORMANCE ANALYSIS
This chapter is devoted to an analysis of the functional and timing models in SST when using the GPU component as a part of a generalized compute node. Since GPUs in HPC function solely as co-processors, functionally executing GPU-enabled binaries requires the CPU to initialize and launch kernels of work to the GPU. Figure 3-1 shows the general simulation model used for the evaluation. The Command Link and Data Link serve as the transport mechanism for the CPU (host) to launch and coordinate work with the GPU (device). The other components in each model can be tailored to fit any host and device that one wishes to model. 
FUNCTIONAL TESTING
The functional correctness of the model was validated using the unit tests from the Kokkos Kernels suite [15] . The unit tests were compiled using the parameters in Table 3 -1. The target node architecture was assumed to be an Intel Broadwell attached to an NVIDIA Pascal GPUs. This target architecture was chosen based on hardware availability, specifically Sandia's Doom cluster, which is based on the CTS-1 procurement. The SST model is derived from Figure 3 -1 using the model parameters in Table' 3-2 to represent an NVIDIA P100 SXM2 [1] . green. Of the remaining tests, all but the yellow tests fail in both SST-GPGPU and GPGPU-Sim.
The tests in pink fail because the PTX parser cannot locate a post-dominator. This can happen when the control flow is too complex for the parser and a path from the point of divergence to a convergence point for all of the threads cannot be found. There are plans to work with the Kokkos Kernels developers to find a solution. The tests in purple fail because of a bug in the parser that creates an empty file, which should contain the PTX for the kernel, but the simulator crashes when it attempts to read from it. Neither the SST developers nor the GPGPU-Sim developers have been able to locate the problem. The tests in red fail because the problem size is overflowing the address space. Both teams are actively engaged on a fix for this. The two remaining tests, in yellow, run to completion and pass in GPGPU-Sim but have run for more than 10 days without completion in Balar. It is believed that they would complete successfully if given more run time.
CORRELATION WITH VOLTA
A validation sweep was run using two kernels and a mini-app. The three applications were run using an SST model that approximates a Nvidia V100 attached to a CPU. The simulation parameters are shown in Table 3 -4. The overall kernel runtime was compared with the results of running the three applications through nvprof on Sandia's Waterman testbed, which is comprised of IBM Power9 CPUs and Nvidia Volta GPUs. Table 3 -5 shows the total number cycles that each application took on the SST-GPU model and on the native V100. Note that this is only cycles where a kernel was running and does not include host execution time. There are challenges isolating the cause of the performance gaps. This is one of the largest, if not the largest, node simulation that has been run with 139 unique components and 906 links (the statistics output contains nearly 20k unique entries). The complex model interactions and scale make it difficult to pinpoint where models are lacking in detail or are incorrect. Turning on debug for even a small run can produce multi-terabyte output files. That being said, the authors do have some intuition into why there are gaps and how to close them. 
Vector Addition
The vectorAdd application is from the Cuda SDK with error checking removed. It implements element by element vector addition using an array with 163840 elements.
vectorAdd contains a single kernel with a single invocation that, essentially, streams through memory performing integer operations. It was expected that this would have a higher correlation, but the fact that there are so many memory dependencies and memory operations make the results highly dependent on the model for the backing store. A number of models were tried and flaws were found in all of them. With the exception of Cramsim, all of the models are derived from simple DRAM models and are unable to accurately replicate the behavior of HBM. It is believed that there is an issue in the memory controller that Cramsim uses and that when this is solved, it will serve as a good model for HBM2. However, the timingDRAM model clearly provides enough detail for kernels that are not bottle-necked by memory bandwidth.
3.2.2.
LU Decomposition
The lud application is from the Rodinia benchmark suite [4] [5] and implements the LU decomposition algorithm to solve a set of linear equations using a 256x256 element matrix.
The lud application from Rodinia contains 3 kernels with 46 total kernel launches. lud has the worst correlation. The perimeter and diagonal kernels occupy the majority of the compute time diagonal has 16 invocations and consumes 63% of the time; perimeter has 15 invocations and consumes 22% of the time; internal has 15 invocations and consumes 14% of the time. perimeter and diagonal spend 50% and 80% of their time inactive, respectively, due to the number of divergences. Given that LULESH has a much greater diversity of instructions, including FP64, and the previously reported issues determining control flow, it's unlikely that the problem lies in the ALU models and more likely that the issues stem from how the GPU model handles divergences or complex issues exposed by the differences in using PTX verses SASS.
LULESH
LULESH is one of the most widely used mini-applications developed by the US Department of Energy. The code was originally developed by Lawrence Livermore National Laboratory to represent challenging hydrodynamics algorithms that are performed over unstructured meshes [7] [8] .
Such algorithms are common in many high-performance computing centers and are particularly prevalent within the NNSA laboratories. In the original LULESH specification, the authors state that such algorithms routinely count in the top ten application codes in terms of CPU hours utilized [7] .
The unstructured nature of LULESH presents challenges for the design of memory subsystems, not least because operands are gathered from a fairly limited locale but are done so sparsely. This makes efficient streaming and vectorization of the data operations difficult and places additional pressure on the memory subsystem (typically the L2 caches) to provide operands quickly.
For this experiment, the problem size was set to 22 with 50 iterations, leading to an application that contains 26 kernels with 1400 total invocations. The top three kernels, in terms of execution time, provided a good mix of operations, shown in Table 3 -6. The diversity of operations in lulesh, compared to the other too applications, obfuscates the areas where the simulation is lacking, leading to higher correlation with the V100 target platform.
It's clear that a more detailed study is needed to isolate the weaknesses in the models. 
3.3.
LULESH PERFORMANCE STUDY
A parameter sweep was performed using LULESH, described in Section 3.2.3. The device clock was varied from 500MHz to 1312MHz to 1800MHz. The memory clock was varied from 877MHz to 1200MHz to 1600MHz. Figure 3-2 shows the results, where lower runtime time is better.
As expected, changing the frequency of the backing store has little effect on LULESH for this problem size because it is not memory bandwidth bound. The most improvement is seen at the low device clock frequency, but at this frequency the speedup is still small at 1.04x. However, increasing the frequency of the SMs does improve the performance noticeably. Going from 500MHz to 1312MHz shows a 2.5x speedup; going from 1312MHz to 1800MHz shows a further 1.3x speedup.
Although this was a small study, one can imagine being able to run a more complete parameter sweep over any of the Balar parameters. 
CONCLUSION
This report described the final integration of phase one of the SST-GPU project. Functional validation against the Kokkos Kernels unit tests shows that the GPU component can successfully run more than 60% of the tests with a path to reach a coverage of greater than 90%. Correlation with the Waterman V100 testbed is excellent, showing 4-22% error in the runtime for the applications considered. The next phase of the project will focus on further disaggregating the GPU to enable truly scaled GPU performance in a multi-process MPI simulation.
20
