Because of technology and capital cost limitations, supercomputer systems are becoming increasingly complex. These systems provide the expected compute capability at the cost of deeper memory hierarchies, heterogeneous compute elements, and heterogeneous memories. Users of these systems need to determine the mapping of MPI tasks, OpenMP/POSIX threads, and OpenMP/CUDA kernels to the underlying hardware resources. Not only this can be challenging but when executing the same application on a different system, the mapping will likely change to attain reasonable performance. This work presents a memory-centric algorithm to map a parallel hybrid application to the underlying hardware resources transparently, efficiently, and portably from an application's point of view. There are two fundamental aspects of this algorithm. First, unlike existing mappings, its primary design point is the memory system. Compute elements are selected based on the identified memory components and not viceversa. Second, it embodies a global awareness of hybrid programming abstractions as well as heterogeneous devices.
INTRODUCTION
Because of technology and capital cost limitations, supercomputer systems are becoming increasingly complex. These complex systems provide the expected compute capability at the cost of deeper Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). memory hierarchies, heterogeneous compute elements, and heterogeneous memories. For example, the Lawrence Livermore National Laboratory 2018 CORAL machine [5] will include latencyoptimized cores (IBM Power9 processors), throughput optimized cores (NVIDIA Volta GPUs), multiple memory domains (GPU memory and host memory), as well as several types of memory (volatile stacked and non-volatile memory). Leveraging these resources efficiently will be a significant challenge for applications and necessitates solutions at all levels of the software stack from the operating system and resource manager to programming abstractions such as RAJA [9] , Kokkos [4] , and OpenMP 4+ [1] .
A predecessor of the CORAL 2018 system includes two Power8+ processors and four Pascal GPUs per compute node. Each processor consists of 10 cores, each with eight hardware threads. This results in one node having 160 CPU processing units and four GPUs. A pair of GPUs is connected to a processor through the NVLINK interconnect. A user of this system needs to determine the mapping of MPI tasks, OpenMP/POSIX threads, and OpenMP/CUDA kernels to the hardware resources. Not only this can be challenging but when executing the same application on a different system (e.g., dual-socket Intel Xeon-based commodity machine), the mapping will likely change to attain reasonable performance.
This work introduces an algorithm to map a parallel application to the underlying hardware resources transparently, efficiently, and portably from an application's point of view. Unlike existing mappings or affinity approaches, the primary concern of the proposed algorithm is the memory system. Compute elements are chosen based on the identified memory components and not viceversa.
LIMITATIONS OF EXISTING AFFINITY APPROACHES
There are several mechanisms to enforce a mapping of processes and threads to hardware resources. Examples include OpenMP and pthread affinity, the portable hardware locality (hwloc) package [2, 6, 7] , and utilities such as numactl and taskset. These mechanisms can be classified into two types. The first uses highlevel APIs, which require application modifications, and can be applied to the code with fine granularity. The second involves setting environment variables or launching the application under the control of a different program but is applied on a coarse granularity (whole program). Affinity approaches use these mechanisms to implement mapping policies. Affinity policies or affinity approaches are in charge of the actual mapping of processes and threads to the underlying hardware resources [2, 3, 8, 10, 11] . There are a number of dimensions related to these. First, the programming abstraction, which is mostly OpenMP threads or MPI processes. Second, the static or dynamic nature of the application program, i.e., whether the number of workers (processes or threads) change during the life of the program. And, third, their optimization focus such as improving inter-process communication or enabling faster synchronization among threads. In the rest of this section, I examine important limitations of existing affinity approaches.
Lack of awareness of programming abstractions interoperation
While the mechanisms for binding processes and threads to hardware resources exist, the policies to achieve an efficient application mapping are significantly limited. A key limiter stems from the lack of awareness of an affinity policy from one programming abstraction to another. For example, OpenMP provides policies to spread the threads among resources or keep them close to a master thread. At the same time, MPI implements affinity policies at the process (task) level including assigning processes on a per core or per socket basis. It is up to the application developer to use these policies in a congruent and efficient way to avoid a negative effect on application performance.
The memory system is not the primary consideration
The primary consideration of many affinity policies is the compute resources. The OpenMP spread, close, and master policies, for example, are based on places, whose predefined values include threads, cores, and sockets, i.e., compute abstractions. This is also the case for the MPI core and socket policies. Many applications, however, have sufficient compute resources, while the critical performance limiter is in the memory hierarchy. For HPC, in particular, memory latency and memory bandwidth are key performance considerations in current and emerging architectures.
Lack of integrated accelerator or device support
While there are device-and accelerator-specific affinity mechanisms, there is no integration of these into a single arbiter that would coordinate with the traditional processor affinity policies. Again, it is up to the user to employ these mechanisms properly for efficient application execution. This case is similar to the lack of awareness of interoperation between programming abstractions above but applied to heterogenous devices on a node.
MPIBIND
mpibind is an algorithm that provides a mapping of processes and threads to hardware resources including traditional CPUs and heterogeneous devices such as GPU (Graphics Processing Unit) and NIC (Network Interface Controller) devices. The fundamental aspects of mpibind that enable efficient mappings over existing policies include a focus on the memory hierarchy as a primary design consideration and a global awareness of hybrid programming abstractions as well as heterogeneous devices. At a high-level mpibind assigns processes and threads (from all workers in a hybrid program) to memory elements rather than compute elements. For example, a process or thread may be assigned a NUMA domain, a shared L2 cache, or a private L1 cache. Compute resources and accelerators, then, are assigned based on their proximity or locality to the assigned memory element. The detailed algorithm follows below. Given a hybrid parallel (Processes+Threads+Kernels) application A with N processes, map each worker (e.g., task or thread) to hardware resources as described below. A common instance of A is an MPI+OpenMP+CUDA program.
The algorithm
(1) Determine the machine hardware topology, i.e., the hardware components and their interconnections. (2) Create a tree G of the memory system hierarchy attaching compute resources, e.g., cores and GPUs, to their corresponding memory vertices (similar to hwloc's topology). Figure 1 shows an example. Assume heiдht (G) = h, vertical level k of the tree is identified with l k , and p is the smallest processing element type, e.g., a hardware thread. (3) Identify whether the input application uses processes-andthreads or processes-only. If A is threaded and the number of threads per process is specified, save this value into nthreads. (4) Calculate the number of workers,
If A is threaded and nthreads was not given,
Starting from the top of the tree (l 0 ) and traversing downwards, determine the smallest k such that,
For each vertex in l k , traverse down the tree and select a set of processing units (PU) at the leaves. This results in mapping m ′ : vertices (l k ) → PU (7) Using a 1:1 mapping between the workers and the vertices in level l k , use m ′ to bind each worker to the corresponding processing units. This results in the following mapping:
For each worker, bound to processing units {p i }, assign devices that are local or physically closer to {p i }. This step, for example, determines the GPU and NIC to use in a multiprocess, multi-GPU, multi-NIC system. NVM devices can be assigned similarly.
Implementing mpibind
There are three aspects involved in an implementation of mpibind. First, the topology of the machine: hardware components and their interconnections. These components may include NUMA memory, caches, sockets, cores, hardware threads, GPUs, and NVMs. This topology information can be retrieved using the hwloc package and other device-specific utilities such as nvidia-smi. Second, constructs to bind processes and threads. One can employ Linux utilities such as taskset or numactl to bind processes and OpenMP constructs to bind threads. Third, device-specific constructs to assign a device to a given process or thread.
ADVANTAGES AND LIMITATIONS
The goal of mpibind is to help application developers map their hybrid codes to heterogeneous architectures relieving them from having detailed knowledge of the architecture. Furthermore, no affinity considerations are needed when moving from one architecture to another. At the same time, the algorithm may not provide optimal performance for all applications but is expected to perform well on HPC applications without any user intervention.
Application transparency and portability
mpibind requires no application changes and it relieves the user from knowing low-level details of the hardware and its topology. The only information required is the number of processes per node. For hybrid codes, providing the number of threads per process is optional. If mpibind is implemented by the resource manager, no changes would be needed to launch the application. Finally, the mpibind algorithm is hardware agnostic. Architecture-specific details are abstracted in the topology discovery through libraries such as hwloc.
Efficiency and performance reproducibility
mpibind is designed to provide as much cache and memory as possible to each application worker. While this design point may work well for many applications, it may not be optimal for all applications. In addition, binding all workers results in increased performance reproducibility by limiting migrations by the OS and allowing better placement of system services.
CURRENT STATUS AND FUTURE WORK
The key strengths of mpibind are a primary focus on the memory hierarchy and an awareness of hybrid programming abstractions as well as heterogeneous devices. mpibind is being implemented on a number of systems including Intel's Xeon-based processors with and without GPUs and IBM's Power architecture with multiple GPUs and NICs. Ongoing work includes a cross-architecture empirical evaluation to quantitatively assess the impact of mpibind on parallel scientific applications with respect to performance and performance reproducibility.
