Abstract-Heterogeneity is increasing at all levels of computing, certainly with the rise in general purpose computing with GPUs in everything from phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life at every level of computing, efficiently managing heterogeneous compute resources is becoming a critical task. The focus of my dissertation is developing methods and systems to allow software to adapt to the heterogeneous hardware it finds at runtime. The goal is to make the complex functions of heterogeneous computing autonomic, handling load balancing, memory coherence and other performance critical factors in the runtime. The investigation began by studying heterogeneity caused by system topology and resource contention in MPI applications. Since then the focus has shifted to work-sharing across CPU and GPU resources for accelerated OpenMP, and automatically managing the hardware capability imbalances between these resources. Moving forward, I propose to produce a system extending upon both previous approaches to offer worksharing, topology aware affinity management, as well as novel automated memory transformations to reduce communication and increase memory access efficiency.
I. INTRODUCTION
Over the past several years, the average computer has shifted from a serial machine with a single processor into a highly complex parallel system. The rise in intra-node complexity is an unavoidable consequence of the ever-growing trend toward parallelism as the main driver of performance in new processor and system designs. While we classically expect our supercomputers and high performance workstations to have complex architectures, the heterogeneity is now increasing throughout the computing ecosystem. Devices from supercomputer nodes to cell phones, CAD workstations to smart watches now carry multiple CPU cores. Beyond the expansion to parallelism, computational coprocessors with separate instruction sets, architectures and programming models are now being employed for general purpose computation. Coprocessors power the fastest ranked supercomputer in the world as of this writing. They also accelerate operating system functions of the laptop on which this document was and the smartphone sitting next to it.
Heterogeneity has become a fact of life in modern computing, and it only serves to complicate the already daunting task of bringing parallel computing into the mainstream. A modern application must employ parallelism in order to attain reasonable performance. Even then, it must adapt to a myriad of possible hardware targets and further to the shifting capabilities of each piece of hardware at runtime.
Automatically and transparently adapting applications to system resources is now critically important. Just as human beings rely on their autonomic nervous system to handle the intricate complexities of keeping our bodies running, programmers rely on programming models, schedulers and middleware to manage the complexities of their hardware. The issue today is that our tools have not kept pace with the hardware. They no longer offer the level of performance portability or the depth of capabilities to exploit these systems for the performance and efficiency benefits heterogeneous designs can offer.
My dissertation seeks to address aspects of this problem by creating runtime systems and interfaces to automate the process of adapting applications to exploit heterogeneous architectures. The first major arc focuses on process affinity and mapping MPI applications to physically homogeneous, but effectively heterogeneous, hardware. By integrating with the MPICH2 MPI distribution, the Systems Mapping Manager (SyMMer) [6] library detects processor capability imbalances and automatically remaps processes for improved average performance. While SyMMer addresses issues of processor locality, it cannot address issues of fundamental load imbalance, nor can it address systems with physically heterogeneous processing elements. In order to address these issues, working inside the MPI runtime is no longer sufficient. Rather, we target a cross-device work-sharing construct for Accelerated OpenMP, and evaluate the capabilities with a library we call the Core Task-Size Adapting Runtime (CoreTSAR) [8] . Each of these addresses one aspect of automatically adapting applications to heterogeneous systems, and combining their strengths is planned as the culmination of my dissertation.
II. THE SYSTEMS MAPPING MANAGER (SYMMER)
The beginning of my interest in this topic was a physically homogeneous system which didn't act like one. Repeated runs of a simple network bandwidth test would yield widely varied results, as depicted in Figure 1 , with the slowest being as much as 30% slower than the fastest. After some investigation, it was determined that the network interrupts were being processed by the core with the lowest bandwidth. In contrast, the best performing core was the other on the same die, with those on other dies in the middle. This phenomena was caused by interrupts from the 10gig-e network interface overloading the interrupted core, keeping it from making progress on any other work on that core, while increasing the network cache locality for the other core on the die thanks to a shared L2 cache.
Over the following months, we developed the Systems Mapping Manager (SyMMer) runtime as part of MPICH2. It was designed to detect imbalances caused by a high speed network interface, as well as natural imbalances in the compute phases of the application being run. Having determined those parameters, SyMMer would remap ranks across cores in a node to achieve consistently high performance. In addition, it would automatically resolve certain issues caused by imbalance, even problems which could only be detected across multiple nodes in a distributed system. The overall effect was highly positive for our test benchmarks, including GROMACS, LAMPPS and the MPI implementation of the FFTW library. As shown in Figure 2 , our evaluation shows that SyMMer can save as much as 50% of the total communication time in a LAMMPS run, and improves the execution time of GROMACS and FFTW by 10-15% and 3-5% respectively. These results were originally published at SC08 [6] .
III. TASK SIZE ADAPTING RUNTIME: CORETSAR Physically heterogeneous systems present a different kind of challenge. Rather than hiding the heterogeneity from the user, they require a user to explicitly implement their algorithm for each type of device in order to use it. In the best case, as with programming models like Accelerated OpenMP, a user can annotate a single source implementation to target multiple device types. Even in accelerated OpenMP however, there is no mechanism to work-share a parallel region across all available resources, CPUs and GPUs alike, as users are used to doing on homogeneous multicore systems.
In fact, current programming models offer neither worksharing across CPUs and GPUs nor work-sharing across multiple GPUs. This is a result of the fact that they do not cross address space boundaries. Users can certainly employ all of FFTW, lower is better (C) GROMACS, higher is better the hardware in a system with existing models, but they must divide their work and data manually in order to do so. Some systems, such as StarPU and OmpSs, offer automatic load balancing of user-specified discrete tasks, but not worksharing in the style of OpenMP. If such is desired, users must implement it themselves. This can be a very daunting task, and generally results in static scheduling if it is done at all.
The ultimate goal is to enable OpenMP programmers to experiment with coscheduling and combinations of CPUs and GPUs without having to re-create the work necessary to split, replicate, and to load balance their computation. This requires the compiler and runtime system to (1) split accelerated OpenMP loop regions across compute devices (2) manage the distribution of inputs and outputs while preserving the semantics of the original region and (3) preserve the correctness of the computational code in the region. Most importantly the solution must offer performance gains while remaining easy to use. Our solution is to offer a runtime system, and propose an extension of Accelerated OpenMP or OpenACC, to automatically distribute and load balance parallel loop regions across the resources in a system.
In order to fill this void, I have created a prototype runtime known as the Splitter library in [8] and CoreTSAR in my continued work in [7] , which is compatible with any CUDA/c code or accelerated OpenMP implementation which emits CUDA compatible code. This implementation operates on the principle that each iteration in an accelerated loop construct may be treated as separate, though related, task and scheduled on GPUs or CPUs at will assuming its inputs and outputs are available.
In order to support this, I have developed a new syntax for specifying the input and output of regions. Rather than specifying a rectangular region of memory to be copied in or out, it specifies the association between loop iterations and data elements. In this fashion, the data required by a range of iterations can be computed by the runtime, and transferred to and from any memory space in the system at will.
Given the capability to arbitrarily subset the range of tasks and distribute them across multiple devices, not only work-sharing but load-balancing and automated coscheduling become possible. In OpenMP, as in many other scheduling systems, there are two basic scheduler types available, the static schedule and the dynamic schedule. The OpenMP static schedule assigns an even amount of work, or at least an even amount of iterations, to each thread in the team executing the region. This is an extremely simple and efficient way to distribute work, incurring almost no overhead and only a single synchronization point, but does not adjust to either heterogeneous cores or heterogeneous iterations. The dynamic schedules, including guided, divide work into a number of "chunks," each of which is assigned to a thread when it runs out of work. These can effectively be thought of as workqueues, where all threads contend for more work whenever they complete an allotment. The benefit to this approach is that load is effectively balanced with heterogeneous hardware, iterations, or any other condition that might occur. The downside is that the smaller the chunk size is, the higher the overhead incurred in synchronization assigning work.
On CPU systems, synchronization overhead for chunk schedulers is relatively minimal, requiring only a few atomic operations and no memory movement. In heterogeneous systems however, at least those where some of the processors do not share cache-coherent memory, the scheduling and memory movement overhead can be quite high. The static approach has significantly lower overhead, but does not provide loadbalance. Our solution to this is to generate a static split based on the capabilities of each device. We use an integer linear optimization program to compute the split of iterations, depicted in Figure 3 . It is a conceptually simple program, minimizing the predicted difference in runtimes between all devices. If all run in the same amount of time, the automatic barrier at the end of the region will not cause any one device to idle.
Using the linear program as a base, CoreTSAR contains four scheduling options, the static, adaptive, split, and quick schedulers The CoreTSAR static split runs the linear program once, using baseline inputs for device performance, and uses the resulting split for the rest of the application run. Adaptive is equivalent to static the first time it encounters the region, on each subsequent entrance into the region the split is updated based on the performance of the previous runs. Split and quick each subdivide the iterations in the region to generate more scheduling points, allowing them to adapt faster but at the cost of higher overhead. They differ in that split divides the region into a number of equal parts, ten by default. On the other hand, quick divides it into only two parts, allowing the user to select the size of the first and then switching to the adaptive schedule for all subsequent runs through the region.
To evaluate the prototype, I implemented the necessary modifications in five benchmarks; GEM, an n-body molecular modeling application; cg, the conjugate gradient benchmark from NPB; kmeans, the classic feature clustering algorithm; Helmholtz, a discrete finite difference code implementing I = total iterations available i j = iterations for compute unit j f j = fraction of iterations for compute unit j p j = recent time/iteration for compute unit j
(1) n = number of compute devices
. . . Figure 4 . As the results show, using CoreTSAR to employ multiple GPUs, or multiple GPUs and CPUs, is almost always the right choice. Given the code required, the only numbers achievable at that complexity with the original system are the CPU and GPU values in the leftmost column, all the rest use CoreTSAR's facilities for, in some cases, a 4× or greater speedup, and even in GPUaverse cases never worse than 15% slowdown with an adaptive schedule.
IV. RELATED WORK
As my work sits at the cross-section of task scheduling, workload and data partitioning and heterogeneous programming models, there is a significant body of related work, this section discusses some of the most related works. With the proliferation of GPUs and other computational accelerators, several programming models and task schedulers have been proposed specifically for these environments. The most mature of these are StarPU [2] , [1] and OmpSs [4] , [?] . As a class, these schedulers are designed to run general directed acyclic graphs of discrete tasks. While these can effectively target a complex heterogeneous system in an adaptive way, they address a different aspect of heterogeneous scheduling. They schedule asynchronous tasks across resources as the resources become idle, and attempt to preserve locality, relying on a large number of discrete tasks for performance and loadbalance. CoreTSAR addresses worksharing parallel loops, or other data-parallel constructs, and partitions the work into a minimal subset of executable tasks. CoreTSAR creates tasks to fit the computational needs of devices at the time, reducing the dependence on user-splitting of work and reducing overhead of managing fine-grained tasks. Others have investigated versions of the problem we tackle however. Ayguadé et al. question the need for a schedule clause in OpenMP [3] , instead proposing a history-based adaptive approach to balancing a parallel loop across cores. Their results are quite positive, and while they do not universally beat the existing OpenMP schedules, the general design has has an influence on the heterogeneous load-balancing techniques used in CoreTSAR. Qilin [5] presents a novel heterogeneous programming API that supports adaptive scheduling between CPUs and a GPU in the form of a C/C++ template library, much like NVIDIA's thrust, which operates on special array structures. While the scheduler in Qilin uses an adaptive approach similar to that used in our previous work on HTS [8] , supporting one GPU, it is calculated in a training pass and simply reused in later runs.
V. ONGOING WORK AND CONCLUSIONS
In my work to date, I have shown that heterogeneous scheduling can improve the performance and programmability of software by automatically adapting the application to the system's capabilities. Each of the previous systems targeted a particular issue and approach. SyMMer improves the locality of an application by re-mapping MPI processes. CoreTSAR improved balance altering the work assigned to each resource. These are complementary goals, which should be addressed together. Unfortunately SyMMer, operating on MPI processes (one per core), and CoreTSAR, operating in OpenMP, are incompatible.
The concepts underlying them are not incompatible however. My ongoing work focuses on creating what I currently call AffinityTSAR. It will be capable of exploiting the memory hierarchy of heterogeneous and NUMA based systems, as well as a given devices natural affinity for memory ordering, in novel ways. Our proposed system will explore the potential benefits of automatically replicating, re-associating, and reshaping memory regions. For example it could automatically transposing memory on load into a device, pack data subsets so as not to waste padding to preserve offsets, or automatically migrating or caching memory regions across memory nodes. The simplest of these, software caching onto remote memory nodes, has been found in preliminary tests to increase overall performance of a simple matrix multiplication workload by up to 50% in a system with only two memory nodes.
