Abstract-Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems-enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMC's resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMC's impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. Additionally, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Lastly, we evaluated three potential techniques to reduce NiMC's impact, namely hardware offloading, core reservation and software-based network throttling.
Ç

INTRODUCTION
O N today's high-performance computing (HPC) systems, standard communication mechanisms are synchronous (i.e., two-sided). This requires active participation and resource expenditures by both the source and destination nodes. On emerging exascale-class systems, synchronous communication is expected to become a prohibitive bottleneck [2] . Analytics, distributed services, and applications must take advantage of asynchronous techniques to effectively utilize future systems. As a result, it is crucial that we understand the behavior of asynchronous communication and its impact on the memory subsystem.
Remote direct memory access (RDMA), also called onesided communication, is the underlying mechanism that facilitates asynchronous communication. RDMA allows a node's local memory to be read from or written to by a remote node without involvement of the target operating system or CPU. Such out-of-band communication incurs minimal direct overhead on the target machine. There are many attractive use cases for RDMA, such as overlapping application communication and computation phases, for in-memory asynchronous checkpointing and for in-situ analytics. Consider, for example, uncoordinated checkpoints that are staged in the memory of remote nodes before being moved to stable storage, as in SCR [23] . Memory traffic from a remote node writing a checkpoint will contend with the memory transactions of a local application. This contention between local operations and out-of-band network operations can cause significant disruptions to the memory subsystem. We refer to this phenomenon as networkinduced memory contention (NiMC).
Further exacerbating the situation, many-core technologies will be a fundamental part of the exascale system design solution. However, a greater number of hardware threads means a greater demand for shared resources such as the network, main memory and cache. As shown in Fig. 1 , while historically off-chip bandwidth has been increasing, per-core memory bandwidth has not been increasing as rapidly. Coupling these trends with the increased interest in one-sided communication, it becomes critical that we understand the potential application performance impact of NiMC. While researchers have identified NiMC in previous work [14] , our work explores the topic in much greater detail using machine learning to characterize and predict the impact of NiMC.
In this work, we studied the application performance impact of NiMC on a variety of hardware architectures, explored how we might use machine learning to predict its impact, and evaluated three candidate solutions. Results show that NiMC can dramatically reduce system performance at scale (increasing run time by 3X at scales of 8, 192 The authors are with the Sandia National Laboratories, Albuquerque, NM 87123. E-mail: tgroves@lbl.gov, regrant@sandia.gov, agonzales@cs.unm. edu, dorian.arnold@emory.edu. processes). To address this challenge, an HPC system must choose between one of the software-based solutions (bandwidth throttling or core reservation). We show that sophisticated techniques from machine learning and commonly available hardware counters can predict the impact of NiMC. This prediction facilitates the selection of the most effective mitigation strategy.
The specific contributions of this paper are: a quantification of the performance impact of NiMC for a range of systems and applications; a characterization of the system and application attributes that exacerbate NiMC, examining differences and similarities of these attributes across workloads; an evaluation of the efficacy of machine-learning to predict the impact of NiMC using commonly available performance counters. The remainder of this work is divided into five sections: Understanding NiMC (Section 2), Characterizing NiMC (Section 3), Detecting and Predicting NiMC (Section 4), Mitigating NiMC (Section 5) and Conclusions (Section 6).
In Section 2 we provide a basic background of RDMA and discuss existing work on memory contention. We then outline our methodology and characterize NiMC for eight different systems and seven workloads, showing widespread impact with potential for dramatic performance degradation at scale (Section 3). Following this characterization, we explore how we might (1) detect NiMC and (2) predict its impact using statistical techniques of machine learning with random forests (Section 4). This section includes its own sections of motivation, background and methodology specific to the use of machine learning. In Section 5 we demonstrate the effectiveness of three proposed solutions at scale with tests conducted on two large systems comprising a total of 160,000 core hours. Throughout these investigations, we present evidence that the slowdowns attributed to NiMC are not due to contention for compute or network resources. These results further motivate the need for techniques that can predict NiMC's impact so that we can dynamically choose the most effective solution. Lastly, we highlight the lessons learned and the overall impact of this work (Section 6).
UNDERSTANDING NIMC
Background
Network-Induced Memory Contention
Remote memory operations can induce memory contention in two ways: (1) RNICs (RDMA-enabled network interface controllers) producing memory traffic in offload networks; and (2) CPU-to-memory traffic when the CPU is used for onload network processing. Both of these paths are illustrated in Fig. 2 . For onloaded RDMA, not all traffic necessarily flows through CPU cores before being placed in memory. However, programming the DMA engines on the RNIC requires CPU intervention and causes some data to be transferred from the RNIC to the core to facilitate DMA requests. While there is an ongoing debate [10] on the merits of onloaded versus offloaded networking, our work reveals that NiMC should be an important consideration in this debate.
As Fig. 1 illustrates, Infiniband network InfiniBand is now much greater than per-core memory bandwidth (similar trends apply to other varieties of HPC networks). For future exascale systems, this gap is expected to grow [20] , compounding the memory contention caused by RDMA operations. Trends away from traditional BSP programming models toward finer-grained, asynchronous models that admit higher levels of concurrency also will lead to to Fig. 1 . A 10 year history of network and memory bandwidths. While total bandwidth has been increasing, per-core network and memory bandwidth have not significantly increased. Even with the recent development of stacked memory technologies, current per-core memory bandwidth sits near 2012 levels. The result is increased contention for the shared network and memory resources. Fig. 2 . An illustration of the data transfer path for NiMC, note that the path for a fully offloaded networking approach does not involve the CPU, while the on-loaded networking approach requires some CPU intervention to setup the data transfer.
greater demands on the memory subsystem and the network. Lastly, other exascale requirements, for example application resilience, in-situ data analysis and uncertainty quantification, will generate additional local and remote memory traffic associated with activities only indirectly related to the application.
RDMA traffic that causes NiMC has the potential to also impact application performance through congestion on the network fabric [32] (rather than memory). Therefore, we use methods of introducing NiMC that are realistic usage scenarios, but also ones that do not create additional contention on the network. Specifically, (1) the examination of NiMC impact on single node jobs (Section 3.3) does not utilize the network for application communication; and (2) in Section 5.2 we show that the observed slowdowns at scale are caused by NiMC on the node, rather than contention the fabric.
Related Work
Memory contention has been documented on many platforms and tools have been developed to help understand its impacts [11] . Concerns about the ability of memory technologies to keep up with the bandwidth requirements of an increasing number of cores have been expressed [30] . In response, code developers heavily optimize their code for cache use [15] , [18] , [27] . Additional investigations examined the causes of memory contention [36] on modern multi-core systems, exploring the effects of cores count and problem size.
Concerns over memory subsystem performance at extreme scale, particularly with the expected growth in core counts, have prompted investigations into memory bandwidth reductions [33] and contention [7] . Tiwari et al. [33] proposed a model for studying the anticipated reduction in percore bandwidths expected in the exascale time frame by varying the memory frequency on a single node of the Gordon supercomputer at SDSC. While their model was motivated by a desire to explore the per-core memory bandwidth reduction for future extreme-scale systems, it was evaluated on a single compute node. As a result, the model does not account for any source of memory traffic from the network.
Casas and Bronevetsky's work [7] is the closest to this work in terms of memory contention studies. Like the Memory Bandit tool [11] , they seek to create "memory bandwidth interference" to observe the impact on application performance. Unlike the Memory Bandit [11] work, Casas and Bronevetsky can purposefully perturb different levels of the cache hierarchy while introducing threads that create memory traffic unrelated to the executing application. They present a methodology for introducing main memory traffic with minimal cache impact for studying off-chip memory contention. Their observations of application slowdown in the worst case of 20-35 percent is inline with observations from single node tests in this work. This is expected as their method of introducing interference for the application created main memory contention. However, the main difference between their work and ours is the source of the memory contention. While Casas and Bronevetsky introduce the contention from cores, they do not account for RDMA traffic.
While NiMC has been observed in previous studies [14] we provide a richer characterization of the subject.
Furthermore, this is the first work to consider how to predict the impact of memory contention due to asynchronous communication.
CHARACTERIZING NIMC
Methodology
In this section we outline our methodology for characterizing NiMC, detecting its presence, and predicting its impact on application performance.
The Workloads
Throughout this study we use a variety of workloads to evaluate the impact of NiMC. In the text below, we include brief descriptions of each.
The STREAM memory benchmark [21] performs a small set of memory benchmarking kernels (copy, sum, scale, triad) that perform a small number of reads, arithmetic operations and a write back to memory. We used these operations to measure the sustainable memory bandwidth and corresponding computation rates by working with data sets significantly larger than the available cache. We used the STREAM triad test, aðiÞ ¼ bðiÞ þ q Ã cðiÞ, which is the most representative of a typical workload. In later sections we specify either STREAM-DRAM (the standard) or STREAM-cache (a modified version that uses smaller arrays designed to fit in last level cache).
CNS [8] is a "simple stencil-based proxy-app for computing the hyperbolic component of a time-explicit advance for the Compressible Navier-Stokes equations using 8th order finite differences and a 3rd order, low-storage TVD RK algorithm in time." [12] CNS is intended to mimic the stencil operations of more realistic combustion applications, it does not mimic a typical problem found in combustion applications.
HPCCG [16] is an unstructured implicit finite element application, which calculates the conjugate gradient for a 3D chimney domain, running on an arbitrary number of processes. HPCCG creates a 27-point finite difference matrix, for which each MPI rank is designated a userdefined sub-block. This mini-app is generally considered to be memory-bandwidth bound, using from 25 to 75 percent of the total system memory. HPCCG is a designed to provide excellent weak scaling.
LAMMPS [29] , the Large-scale Atomic/Molecular Massively Parallel Simulator, is a molecular dynamics code modeling particles in different states. LAMMPS provides excellent weak scaling with the majority of communication occurring among nearest neighbors. In this work, we used a benchmark problem/data set to model the melt of a 3D Lennard-Jones system, using a weak scaled problem of a similar size to studies published by the authors of LAMMPS (32,000 atoms per core) [35] .
LULESH [19] represents shock hydrodynamics code solving a simple Sedov blast problem, illustrating the behavior of such solvers in ALE3D. This proxy-app distributes the spatial domain onto a set of volumetric elements defined by a mesh, where each intersection of mesh lines represents a node. Within the LULESH proxy-app, there are a variety of kernels, of which some subset are memory bound. One constraint of LULESH is that it must run with a cubic number of MPI Ranks. Therefore, for our experiments we extracted additional parallelism by leveraging OpenMP (OMP) on any unused cores.
SNAP [38] models the performance of a modern discrete ordinates neutral particle transport application. This proxyapp does not employ any real physics in its calculations. Instead, SNAP produces the computational workload, memory requirements, and communication patterns that match the Los Alamos National Laboratories application, PARTISN. To distribute larger problem sizes, SNAP spatially decomposes its 3D mesh and maps it onto a 2D domain of MPI ranks. MPI ranks send and receive data following wave propagation which limits the weak scaling of the proxy-app.
XSBench [34] is a proxy-app which represents the most significant kernel (85 percent of runtime) in a robust nuclear reactor core Monte Carlo particle (neutron) transport simulation. This variety of simulation can have significant data usage requirements and the proxy-app is considered to be memory-intensive. XSBench focuses on modeling intranode performance characteristics of OpenMC and is not intended to be run at scale, as communication is limited to a single reduction at the end of a run.
The Platforms
Our study used nine different platforms from the Sandia National Laboratories, University of New Mexico and the Texas Advanced Computing Center consisting of a variety of architectures. We provide a concise description in Table 1 .
For a subset of the machines (Westmere, Lisbon, and Piledriver-1600/1866), we performed our experiments with varied memory frequencies, allowing us to see the impact available memory bandwidth has on NiMC. Westmere and Lisbon required BIOS option changes, whereas the Piledriver systems are separate nodes with different memory modules.
All of the systems, other than Haswell-X2 and SandyBridge-X2-FDR-offload, utilize an InfiniBand QDR network with a maximum bandwidth of 32 Gbit/s. These two outliers use a QDR and FDR offload NIC, respectively. In Table 1 , the network column signifies whether the NIC is an onload (on) or offload (off) NIC. The observed bandwidth of these systems varies with the physical network topology and the degree of contention. Our general observation is that larger production clusters (e.g., Sandy Bridge-X2-FDRoffload) tend to have a larger variability in observed bandwidth, since network resources are shared among multiple jobs which may compete for bandwidth. However network fabric interference is outside the scope of our work. We refer the reader to [3] for further discussion of performance degradation due to nearby jobs.
Characterizing NiMC
Our approach to measuring the impact of NiMC is straightforward: for each application and hardware configuration under test, we injected remote memory operations into the compute node(s) and measured the resulting application perturbation by comparing application performance with and without RDMA traffic.
First, we used a memory-bandwidth benchmark to establish a baseline for application perturbation. These experiments were executed on multiple architectures to observe the NiMC impact for different hardware configurations. Our subsequent experiments then helped us to assess NiMC impact for real applications in single node and distributed contexts.
To inject RDMA operations we used ib_write_bw from the Open Fabric Enterprise Distribution (OFED) Performance Tests [25] to generate network remote data streams. The OFED Performance Tests are a set of tests that use InfiniBand's user-level verbs API to measure IB performance. The ib_write_bw test uses RDMA writes to perform a series of operations between two connected nodes. As previously described, compared to traditional send/recv tests, the benefits of one-sided tests are that message delivery and synchronization are decoupled: after memory registration, writes do not significantly involve the CPU on the target node. We chose to use write operations as the overhead of performing read operations on the nodes running the applications potentially would perturb the tests through use of compute cores to create and issue the read requests. Write requests do not have this issue, as the source node bears the burden of creating and issuing the network requests. An illustration of the flow of traffic to an individual node is presented in Fig. 2. 
Architectural Characterization
In our first NiMC experiments, we used the memorybound, synthetic benchmark, STREAM (Section 3.1.1). These experiments were used to assess NiMC impact for worst-case memory intensive applications and to evaluate what architectural features may affect the impact of NiMC.
STREAM used one OMP thread per core to saturate the available memory bandwidth. A first-touch memory allocation policy was used to optimize memory utilization within the NUMA hierarchy. Each experiment comprised 10 STREAM executions with 300 iterations of the triad kernel per execution, with each iteration taking a few milliseconds of walltime. The average sustained bandwidth was then calculated based on the average time to complete a triad operation.
To measure the impact that NiMC has on memory bandwidth, we repeated the same set of experiments; this time, our origin node continuously injects 64 KiB of data (using ib_write_bw as described in Section 3.1) into a buffer allocated on the remote target node which was running the STREAM benchmark.
The results, shown in Table 2 , illustrate varied NiMC behavior that is dependent on the underlying architecture. The performance degradation ranged from 0 to 60 percent. On all systems other than Xeon-Phi, Sandy Bridge-X2-FDRoffload and Haswell-X2, STREAM experienced significant (greater than 20 percent) memory bandwidth degradation due to NiMC. The Piledriver-1600 system and Sandy Bridge-X2-onload stand out by exhibiting a 60 and 51 percent performance degradation, respectively. The increased 9 percent degradation on Piledriver-1600 is expected due to a higher network bandwidth to memory bandwidth ratio compared to the Sandy Bridge system. We observe that all three of the systems with variable memory frequency see a decreased impact from NiMC as available memory bandwidth increases.
Of greater interest, our Sandy Bridge systems demonstrated how NiMC might impact onload versus offload networks differently. Both systems utilize Sandy Bridge CPUs, but Sandy Bridge-X2-FDR-offload utilize Mellanox offload network cards, whereas Sandy Bridge-X2-onload uses QLogic onload NICs. There are stark differences between the STREAM Triad results of these two machines. While the offload system sees no performance degradation, we see a 51 percent performance penalty to the onload system's Triad performance. We observe that the onload-NIC systems (Piledriver-1600/1866 and Sandy Bridge-X2-onload) are the most impacted by competing RDMA flows.
Examining the four offload NIC systems in isolation, we see a bi-modal impact of NiMC. When the CPU fully utilizes close to the theoretical memory bandwidth (as is the case of Westmere and Lisbon), competing RDMA traffic can degrade the Triad performance by 16-25 percent. Westmere and Lisbon, show effective memory bandwidth of 98 and 93 percent the theoretical memory bandwidth, respectively. This compares to Sandy Bridge and Haswell, which only achieve 74 and 85 percent effective memory bandwidth, respectively. The decrease in the effective memory bandwidth in Sandy Bridge and Haswell leaves additional headroom for the RDMA traffic, so that we see no impact of NiMC. However, as chip designers look to increase the effective utilization of the memory subsystem in the future, we cannot rule out the resurgence of NiMC in offload systems.
For the Intel Xeon Phi (KNC) system, we observed a small (4 percent) memory bandwidth decrease. This is due partially to the Phi's unique memory architecture (as compared to traditional CPUs). For instance, all other systems under tests had a single memory controller. The Phi system has eight controllers that control 16 channels of GDDR5 memory. Each core has 64k of L1 cache and a 512k fully coherent L2 cache, which are connected over a bi-directional ring interconnect. Additionally, the Phi runs its own operating system, requiring a dedicated core for OS services. By leaving this reserved core open, we may be leaving additional memorybandwidth-headroom into which the RDMA traffic can fit.
In summary, we saw that NiMC impacts a range of architectures spanning multiple vendors and hardware generations. Of importance, the NIC architecture (onload versus offload) appears to play a significant role determining the impact of NiMC, as every system utilizing an onload NIC saw significant degradation. Additionally, as one might expect, the results suggest that increased available memory bandwidth reduces NiMC interference, while decreased available memory bandwidth increases NiMC interference.
Workload Characterization
While STREAM illustrated NiMC effects for worst-case memory-bound applications, STREAM is not necessarily reflective of typical HPC applications. By mimicking the operations of a variety of important scientific problems, the DOE proxy applications (described in Section 3.1.1) are more accurate HPC workload representations.
We ran our workloads on single nodes to study NiMC effects in isolation from interference on the network fabric. The applications use Open MPI v1.8 for inter-process communication. Additionally, LULESH and XSBench use OMP for increased on-node parallelism. For these latter hybrid applications, we used the highest performing combination of MPI processes and OMP threads. 1 Once again, we measured application execution times with and without injected RDMA traffic. Reported results are the averages of 10 runs with error bars displaying the standard deviation. For the experiments in Sections 3.3 and 3.4, we used the Sandy Bridge-X2-onload system. Table 3 shows the application time-to-solution results in the absence and presence of NiMC, and Fig. 3 shows the application performance slow-down due to NiMC. These results illustrate several interesting behaviors. First, five out of six proxy-apps, observed significant performance penalties; CNS exhibited almost no performance degradation. Second, three proxy-apps exhibited a performance degradation within 30 percent of that seen in STREAM. This was unexpected because we selected STREAM as a worst-case indicator of NiMC, due to its intensive memory usage. Furthermore, our pre-experiment hypothesis was that memory-intensive proxy-apps, like HPCCG, would experience the most interference among the proxy-apps; however HPCCG exhibited the second least relative interference. These results suggested that additional sources of contention, beyond the memory-channel bandwidth, may be influencing performance. Accordingly, our next set of experiments were aimed at uncovering these additional contention effects.
It can be hard to decipher precisely what is happening in a system with regards to NiMC. Often, to maintain competitive advantages, hardware vendors do not publicly share the details of components such as memory-controllers or network drivers. While it can be difficult to determine NiMC root causes, we glean insights by profiling application activity as measured by available hardware counters. We use OpenSpeedShop (OSS) [31] , which can provide counter collection and analysis for large parallel applications. Using OSS, we found that for all but CNS (the sole application that did not exhibit a NiMC-based performance degradation), with RDMA traffic, NiMC impacted one core more than the other cores.
To understand why this process was performing much slower under RDMA activity, we used OSS to measure the performance of the L1, L2 and L3 caches as well as the total number of stalled cycles in the presence and absence of RDMA traffic. With the exception of CNS, we saw increases to L1 cache misses and, to a lesser extent, increased L2 cache misses. Additionally, there was a significant increase in the number of stalled cycles in the presence of RDMA traffic. The relationship between cache misses and stalled cycles is consistent with Fig. 3 in that proxy-apps that have more reasonable cache efficiency experience the worst NiMC interference. At the same time, applications with poor cache utilization are less affected by additional cache misses since their cache efficiency is low to begin with. These results also help to explain why only one application process experienced significant performance degradation: the largest differences in cache behavior were observed at cache levels not shared by other cores.
Among the proxy-apps, CNS is an outlier in that it experiences almost no interference from NiMC. CNS demonstrates that the slowdown seen in other runs is not primarily due to CPU (core) perturbance, as CNS receives identical amounts of RDMA traffic and services it in the same manner as the rest of the benchmarks, but observes only a 1.6 percent slowdown from RDMA traffic processing overhead. This is because in CNS, each MPI process utilizes an extremely small amount of memory (approximately 4 MB). Such a small working set of memory leaves space for both the onesided RDMA and the proxy-app to effectively utilize cache. As a result, there is only a minor increase in the amount of idle cycles as we add interference from the network.
As Fig. 2 illustrates, when a DMA transfer is serviced by an onload NIC, some amount of data is distributed throughout the cache hierarchy. This data takes up a larger proportion of space in L1 cache than L2 and L3, which is why we would see a greater impact on the performance at lower levels of the cache. When sending data synchronously, it makes sense that the application would want that information in cache so that the CPU may service communication events faster. However, when this data is sent asynchronously, we do not know when the application will require the written data (if it does at all). In the asynchronous case, loading the data into cache can be benign (as seen with CNS) or create significant bottlenecks (as seen in the other proxy-apps).
Our reasoning above presumes some relationships between the slowdown seen with RDMA traffic, activity in cache and CPU stalled cycles. To determine the strength of these relationships, we performed a correlation analysis between runtime, stalled cycles and cache related counters. We used Pearson's R for correlation, which measures linear correlation between two variables. The analysis results (Table 4) show that without additional RDMA traffic, the application runtimes are very strongly correlated to L1/L2/ L3 misses but are not correlated with stalled cycles. Stalled cycles for the non-RDMA case do not meet required levels of significance to assert that a relationship exists. For the RDMA traffic case in Table 4 a very strong correlation exists between runtime and stalled cycles with a 95 percent certainty. We see that there is a very strong correlation between the stalled cycle count and the cache misses, particularly L3 misses, showing that the stalled cycle increase is almost certainly due to misses throughout the cache hierarchy and requests to main memory. This correlation coupled with the large rise in stalled cycles that occurs when introducing RDMA traffic leads us to conclude that the increase in runtime observed is due to time waiting for the memory subsystem. Though cache pollution from RDMA is correlated with an increased number of stalled cycles, it is not necessarily the only factor contributing to NiMC. Other contributing factors can include: the policy and scheduling of the memory controller(s), such as open-page row-buffer management, the degree of concurrent operations the memory controller(s) can handle, the number of memory channels, and how these memory channels are written to, for example, ganged or unganged. To develop a general analytical model for the impact of NiMC, these factors would need to be considered.
Multi-Node Characterization
From Section 3.3 we gained a better understanding of the underlying causes of application interference in the context of a single, isolated node targeted with an high volume of RDMA traffic. Our multi-node characterization examines NiMC in applications at scale with realistic RDMA traffic volumes.
As with our single node experiments, we use the Sandy Bridge-X2-onload cluster executing a series of weak scaling LAMMPS experiments. We select LAMMPS for the large scale runs for several reasons. First, of all our workloads, LAMMPS is the only real application: it is not a proxy app nor a simplified benchmark. Second, LAMMPS scales very well for the size of system under study. Finally and most importantly, LAMMPS is widely regarded as an application that is resistant to external interference [24] . Therefore, LAMMPS represents a good challenge when testing for a performance degradation due to external perturbations like RDMA traffic.
As in our previous experiments, in addition to the target nodes running our application, we reserve an additional set of origin nodes to push RDMA traffic to the target nodes. However, unlike in our previous experiments, we limit concurrent writes to a small subset of the total nodes. We also limit the duration of each write operation. We use a hypothetical uncoordinated, in-memory checkpointing protocol 2 as the motivation of our RDMA traffic pattern. Since there is no known optimal checkpoint interval for uncoordinated checkpointing, we use Daly's estimate [9] to derive optimal coordinated checkpoint intervals. We use Daly's estimate to compute the average number of processes simultaneously taking a checkpoint using a five year mean time to interrupt (MTTI), and use this number as the number of concurrent writers, shown in Table 5 . We compute RDMA write duration by optimistically assuming that all checkpoints take 46,000 message iterations (equivalent to %1s for 4X-QDR IB and %0.5s for 4X-FDR IB). Each data point in Fig. 4 represents the minimum runtime of 5 runs. We chose the minimum because (1) the minimum is the hardest metric to overcome, when showing the existence of NiMC at scale; (2) it shows the impact of NiMC rather than the impact of contention on the network or I/O subsystem from other jobs that are outside of our control.
The results in Fig. 4 show that as we increase the number of application processes the impact of NiMC grows substantially. Despite the fact we decrease write duration to a single second, scaling up the number of application processes greatly amplifies the magnitude of interference. This is similar to phenomena seen in the research of OS noise, where scale amplifies the magnitude of the overall perturbance [17] , [28] . Even with a constant 0.2 percent of total processes as simultaneous RDMA writers, time-to-solution nearly doubles at scale due to NiMC. Though we have increased the pressure on the network by introducing additional RDMA writes, we show in Section 5.2 that this additional traffic is not responsible for the increase to runtimes.
DETECTING AND PREDICTING NIMC
Motivation
With evidence for NiMC's impact in the previous section, it is crucial that we are able to detect NiMC and predict the resulting performance degradation, so that we can deploy an effective mitigation strategy. One of the challenges of NiMC is that it is not easily detected, since it originates from RDMA communication. In previous sections we explicitly controlled the amount of RDMA traffic being injected on a target node. However, in a production environment the target node may not know how much RDMA traffic is being injected into its memory. And while the target NIC does have counters that can expose the total number of bytes received, it does not specify whether the data originated as an RDMA request. Furthermore, some of the uncore 3 counters that might provide some insight are often 2. Contemporary approaches for in-memory checkpointing use a coordinated protocol in which all processes take a checkpoint simultaneously. However, for next generation systems there is a concern that coordination at massive scale can become prohibitively costly.
3. Closely related to the core, but not directly part of, e.g., QPI or memory controller.
unavailable. For this reason, we need to explore alternative ways of detecting when NiMC is occurring and predicting its impact on application performance. Our approach to addressing these challenges applies machine learning (random forests) to a set of easily accessible performance counters. Specifically, we are interested whether we can use random forests and simple performance counters to: 1) To detect the presence of NiMC on onload NICs. 2) To predict the volume of RDMA traffic. 3) To predict the impact on application CPU time. Beyond these three objectives, we also evaluated the relationship between each application and the 18 performance counters. We looked for any counters that were universally important across all applications when answering these questions. In several instances, the counters of highest importance (such as instruction cache misses) surprised us, which shows the value of a unbiased approach such as random forests, since as system experts we might have allowed our intuition to lead us to selecting features that were not as rich in information.
Random Forests
Random forests [6] are an ensemble method of tree predictors, such that a collection of classifying trees with randomly selected feature-vectors vote to select the most popular class. Building a forest (i.e., ensemble) rather than developing a single tree improves the robustness of the predictions. The benefit of random forests is that overfitting is not a concern, given a large enough number of trees in the forest [6] . Random forests provide internal estimates of the generalization error, classifier strength and dependence with out-of-bag estimates [5] . This internal method of validation is integrated into the algorithm, as trees are trained on a subset of the input and are then validated against the remaining data. In this study we are interested in measuring the importance of individual features (i.e., counters) in a set. Feature importance is reported by the off-the-shelf random forests packages (python scikit-learn [26] ). Furthermore, once a random forest has been built using the training data, further classification can be done efficiently in real-time.
Of particular relevance to our work, in [4] , Bhatele et al. used machine learning to identify sources of network congestion and their success inspired us to employ machine learning techniques in this work, though we are exploring different phenomena. In their work, they found that the ExtraTreesRegressor/Classifier package outperformed RandomForestRegressor/Classifier. We explored both packages and found only marginal differences in average cross validation scores, though our tests showed that for our data the RandomForest class outperforms the ExtraTrees class.
Methodology
In this section, we apply sophisticated techniques for prediction and analysis, to the Sandy Bridge-X2-onload system and a selection of workloads. These techniques further characterize NiMC and answer the questions set forth in the beginning of this section.
We have increased the number of runs to 6,400 for each feature set and reduced the number of workloads in this section to a targeted subset, specifically: STREAM-DRAM, STREAM-cache, CNS, LAMMPS, and HPCCG. Each of these workloads were chosen to be representative of particular characteristics. STREAM-DRAM and its variant STREAM-cache were chosen, because they are synthetic benchmarks that aggressively push the memory system and easily reasoned about. The behavior of STREAMcache is identical to STREAM-DRAM with the exception that we are ensuring that the data matrices fit inside the L3 cache, whereas STREAM-DRAM uses matrices of at least four times the size of last level cache. CNS was selected as a control, since it experiences almost no impact from NiMC. We chose LAMMPS since it was the most impacted and a full application. HPCCG falls between the synthetic benchmarks and full applications as a proxy app having high memory utilization, but not as much as STREAM.
In Section 3.3, we used a limited number of runs to perform correlation analysis between a small set of counters. In this section we have expanded the counters from four to 17 (seen in Table 6 ). Alongside each feature name, Table 6 includes a description of the counter. We record each feature with respect to each process, which creates 6,400 samples per feature set (400 in the case of STREAM since it uses OpenMP). These features are not comprehensive of all the hardware counters PAPI provides but they do represent the supported preset events on the SandyBridge-X2-onload system. For the system evaluated, we were only able to collect a subset of performance counters for any given run (exactly, six non-derived counters), where a non-derived counter may be L2_DCM and a derived feature could be (L2_DCM/L2_TCA). This restriction is due to hardware constraints of the CPU, which determine the overall number of reportable counters as well as incompatibilities between amongst multiple counters. For this reason we divided the experiments up into three feature sets, such that each set represents a different selection of counters.
All of our experiments using random forests are performed with single-node runs of the application or benchmark, with a secondary node that writes RDMA traffic into application node. We avoid multi-node application runs to eliminate outside sources of contention and noise that are known to impact system runtimes. This allows us to isolate the effects of NiMC and provides a clearer picture of what features are most indicative of a true NiMC perturbance. Each experiment runs 200 times for each feature set, with and without added RDMA traffic.
When building a random forest, the forest must contain enough trees to develop accurate predictions. We used 100 estimators 4 (trees in the forest), with separate runs for each feature set. That is, we ran the regression for each feature set in isolation. We use out of bag (OOB) samples to estimate the generalization error. The OOB score is the average error of observations from a sampled subset of observations. This can also be explained as the ratio of correct predictions over the total number of trials, where each observation is evaluated by forests not trained on that particular observation. This score is represented as a number between 0 and 1, with a larger number representing higher accuracy. If we wanted to combine feature sets from multiple runs so that the trees were built considering all 17 features, this would be possible by substituting in the average value of a feature for runs where it was not recorded. However, increasing the number of features, increases the runtime of the learning algorithms, in that a greater number of trees must be constructed for accurate results. Additionally, given each feature set in isolation, our results were accurate enough (demonstrated by the OOB scores), so that using averages was unnecessary.
Predicting the Presence of NiMC
In our first use of random forests, we evaluate whether the machine learning correctly classifies the RDMA traffic for a binary classification (RDMA/No RDMA). Specifically, we remove the IB_BW feature from each feature set, then create a binary classification, where any bandwidth greater than zero is labeled as RDMA and otherwise No RDMA. This binary vector is used for our target values.
As displayed in Table 7 , the OOB score for each feature set and application was excellent with the exception of CNS. CNS has a OOB score of 0.74 which is not particularly meaningful, since in the case of a coin flip a random guess would obtain an OOB score of 0.5. Therefore, while there are some features of value in the CNS results, they are not particularly reliable in their predictive power.
After determining the OOB score, we examine the importance of each feature in our predictions. Random forests have a measure of feature importance that is calculated differently depending on whether the forest is targeting classification or regression. In the case of classification this is commonly calculated by adding up the decrease to gini impurity criterion [6] for each individual variable over all trees in the forest. In the case of regression, minimizing the mean squared error is commonly used as the impurity measure to calculate importance. In this section, we use the importance measure to identify features that provide predictive power. Because of the volume of feature importances calculated, we have placed them in the appendices, which can be found on the Comp u t e r S o c i e t y D i g i t a l L i b r a r y a t h t t p : / / d o i . ieeecomputersociety.org/10.1109/TPDS.2017.2773483. Once we identify the most important feature in each set, we generate histograms (Figs. 5, 6 , 7, 8, and 9 that show the per-process counter output for runs with and without added RDMA traffic. These figures provide a detailed picture of how the impact of NiMC is distributed unevenly across processes.
Analysis of STREAM-DRAM
Examining the results of feature set 1, L1 instruction cache miss count was a particularly important feature. If we look at the histograms of runs with and without NiMC, we see stark differences in the distribution and values. Without NiMC we see in Fig. 5a that the vast majority of runs (95 percent) have between 2:5 Â 10 7 and 2:6 Â 10 7 misses. With added RDMA writes, Fig. 5a shows that the number of misses goes up substantially (in some cases nearly doubling) from 3:0 Â 10 7 to 4:3 Â 10 7 misses. Interestingly, though the number of L1 data cache misses is several orders of magnitude greater, the feature provides little information, since it does not change significantly with the addition of NiMC. This is largely due to the fact that STREAM is designed to marginalize the cache by using arrays several times bigger than could fit into last level cache. Because the overall data cache miss rate is so high to begin with, the added misses due to NiMC contain relatively little information when compared to instruction cache misses. From the third feature set, L3_ICA was the most important feature. Comparing the median runs with and without RDMA, in the presence of NiMC there is a 12 percent increase to L3_ICA.
Analysis of STREAM-CACHE
As a consequence of using smaller matrix sizes, L1_DCM becomes an important feature in the machine learning. While we can clearly distinguish the two distributions in Fig. 6a , in reality the difference between the two distributions is less than 1 percent of the L1_DCM. Even though STREAM-cache has poor cache utilization and a large number of L1_DCM, this shows that as we increase the application's cache efficiency, we begin to see a shift in the importance from instruction cache to data cache.
Analysis of HPCCG and LAMMPS
In contrast with the STREAM runs, data cache misses and total cache accesses are important features for HPCCG and LAMMPS. This makes sense given the rate of data cache accesses drops dramatically in HPCCG and LAMMPS compared to STREAM. Looking closer at the results of HPCCG and LAMMPS in Figs. 7a-9c there are some interesting results. Specifically, for the vast majority of processes, data cache misses and total cache accesses decrease as we add a competing RDMA traffic. It is important to remember that these measurements are the sum over a run and not a rate of misses or accesses over time. In other words, we see a decrease to total cache misses and accesses as the CPU time increases for these two applications.
This is an example of how machine learning can point us in directions that we might not explore otherwise. Once we know where to look, human intelligence and expert knowledge can provide a deeper understanding of the environment. The reason for the observed decrease in L1_misses can be explained by a slowdown which causes a decrease to the rate that operations are issued. In other words, the L1 cache access per unit time is decreasing, which allows for better hit-to-miss ratios in the L1 cache. Better hit to miss ratios at the L1 create less pressure on L2 and L3 caches, resulting in both fewer accesses and fewer misses at higher levels in the cache. Looking at Figs. 7a-9c, we see a small number of outliers in both misses and CPU time. By manually examining the individual runs, we confirmed these outliers are often just a single process of a run, which creates an intra-run imbalance. At points of synchronization and communication faster processes are forced to wait for the completion of the slower process. This waiting period provides the faster processes an opportunity to satisfy any outstanding requests and clear the pipeline. When the slowest process reaches the synchronization point the majority of processes are able to take advantage of the cleared pipelines and achieve better cache efficiency until they fill up.
The next logical question is why this behavior is not observed in STREAM? We are unable to answer this with absolute certainty, but one idea is that this is because STREAM is designed to miss cache. Therefore, any benefit of clearing the processor's pipeline at synchronization point is negated as the empty pipeline rapidly fills up waiting on outstanding cache misses. After accounting for CPU time, we found that STREAM-DRAM and STREAM-cache see an 8X increase to L1_DCM per unit time, compared to HPCCG.
Outcomes
For each of the applications other than CNS, the classifier had an out-of-bag score of 0.995 or higher, meaning that if NiMC is present in system with Onload NICs, we can accurately predict presence of NiMC using any of the three feature sets evaluated 99.5 out of 100 times.
Additionally, we saw a breadth of features within each set were used to predict the presence of NiMC. This exemplifies how machine learning and expert knowledge may be leveraged in future systems, since the best set of features appears to vary, dependent on the behavior of the application.
Predicting Volume of RDMA Traffic
Next, we wanted to determine whether we could predict the amount of RDMA traffic with more detail than just a binary classification. To do so, we used random forest regression rather than random forest classifiers on the same data as before calculating both the coefficient of determination (R 2 ) and OOB scores. While R 2 suggested some success this was misleading and contradicted low OOB scores. The reason for this discrepancy is that, given a predicted value ofŷ and a mean value of y, R 2 is calculated as
Because the distribution of IB_BW is often bi-modal, this makes the use of average in v particularly poor, creating an inflated score that really isn't meaningful. The OOB score does not suffer from this and for the purposes of our work is more informative.
Unfortunately, the results indicated that none of the feature sets were informative enough to precisely predict the volume of RDMA traffic in the presence of NiMC. Part of the reason may be that the throughput of the RDMA traffic injected onto the target node is also dependent on contention in the network and may not be fully represented by the features recorded. Because SandyBridge-X2-onload is a shared system, the distribution of throughput varied across the runs depending on the competing workloads. While our preset counters did not provide the necessary information, this doesn't rule out the other performance counters that might be explored in future work. These might include PAPI native events, which are more numerous than the PAPI preset events, but are accessed via the low-level PAPI interface.
Predicting Runtime Impact of NiMC
The goal of this section is to predict the CPU time of each process for the set of applications and benchmarks with and without NiMC. Again we utilized random forest regression. Our target vector is the CPU time recorded by OpenSpeedShop, while the training input samples are comprised of the features in Table 6 . Examining the results, we found that each feature set was able to provide quite accurate predictions of CPU time (Table 8 ). The OOB prediction across the different feature sets and applications ranged from 0.966 to 0.997, with the worst (although still good) score being for CNS using feature set 3. This is likely due to the fact that CNS has small working set of memory and a relatively small number of L3 misses compared than other applications. Furthermore, the results suggested that each feature set was capable of accurate predictions of the CPU time.
While the OOB scores were quite high, the IB_BW feature dominated in importance, providing little information about the significance of the other hardware counters. Furthermore, we only have knowledge of the volume of RDMA traffic injected because we introduced it in a controlled experiment. In a more realistic approach this information would need to be shared with the remote (application) node incurring a significant delay due to latency. In response, we removed the IB_BW feature from the training set and ran the experiments again. What we found is that even when the IB_BW feature was removed, we maintained very good predictions (Table 8 . The largest decreases to the OOB score were around -0.03 (e.g., 0.997 to 0.967). This suggests that while the IB_BW feature is valuable, it provides information that can be gained through the other performance hardware counters (features) to accurately predict CPU time.
In the following paragraphs we use the rankings of feature importance as a tool to further analyze the results and try to understand why a particular performance counter was ranked more importantly than others. For brevity, we focus on the features that were important in predicting CPU time, that were not mentioned in the previous section.
Analysis of STREAM-DRAM
While many of the features that were important in predicting the presence of NiMC are important in predicting the CPU time, there are some differences. This is expected given that NiMC is only one of many components that determines the overall runtime of an application. From the second feature set, the derived fraction of L2_DCM over L2_TCA is an important feature in predicting CPU time that was valued much less in detecting the presence of NiMC. In general, it is not surprising that a derived metric of L2_efficiency is valuable in determining the performance of STREAM. The histogram for this feature (Fig. 9a) shows that in the presence of NiMC, the STREAM-DRAM distribution of L2 cache efficiency changes from a distribution that looks normal to a distribution that is much harder to characterize. Specifically the feature shows that some worst case runs with NiMC achieve a ratio of just over 0.60 compared to 0.56 without RDMA traffic.
Analysis of STREAM-CACHE
When examining the results for STREAM-CACHE, we found the features that were important in Section 4.4 grew in importance, though the ordering of features did not change. Specifically, when predicting CPU time we found that the random forest assigned much greater importance to L1_DCM, L2_ICH, and L3_ICA (more than double in the case of L3_ICA) than when predicting NiMC.
Analysis of CNS
CNS is different from the other applications because the added RDMA traffic has negligible impact on it. In fact, IB_BW is rated as the least important feature in determining CPU time. This matches our intuition since we are using CNS primarily as a control-case in our experiments. For these reasons, the rankings of importance with respect to predicting CPU time of CNS provide little insight about NiMC.
Analysis of HPCCG
When comparing the feature importance of HPCCG to predict NiMC versus predicting CPU time, there is a change in the rankings of feature set 2. We see that L2_ICH becomes a very important feature in predicting CPU time, whereas it was the 5th most important feature when detecting NiMC. Looking at the Histogram in Fig. 9b , we see the familiar distribution where a small number of processes are fall well outside the median. In the worst case, some processes saw up to a 7X increase in the number of L2_ICH. One possible reason for this could be additional instructions issued by the drivers for the onload NIC.
Analysis of LAMMPS
When reviewing the results from LAMMPS, the most important features from Sets 1, 2 and 3 are L1_DCM, L2_ICH, and L3_DCA, respectively. Within feature sets 1 and 2, the importance becomes heavily weighted towards a single feature. Of the applications we ran, with the exception of CNS, LAMMPS has the best cache utilization, such that if there is an increase to L2 misses that becomes an L3 access, it is more noticeable than it would be in STREAM or HPCCG.
Outcomes
Several conclusions can be drawn from these experiments. First, if we exclude IB_BW, no single feature is universally important. For example, hardware events like L1_DCM were very important in determining the CPU time of STREAM-cache, HPCCG and LAMMPS, but less so when used to predict the CPU time of STREAM-DRAM and CNS. There tends to be some overlap in importance among similar applications and machine learning facilitates an understanding that complements expert knowledge. We did see that IB_BW was a valuable feature, however extracting it from the system is not straightforward. It is possible that it could be gleaned from PCI-e counters however these are not readily available. We believe it would be a great benefit to the community if hardware vendors would make these sorts of counters easily accessible. Another alternative is that the NICs or hardware drivers could report this information, but this requires additional development effort from the vendors. A third possibility is that whatever service is pushing data into the remote system could report it to the relevant applications, however this may negate much of the benefit of one-sided communication.
Second, for some applications a majority of processes saw decreases in cache misses as a result of running slower. Because of the black-box design of most hardware, we can only speculate on the reasons behind this. We can generalize that this behavior is limited to applications that have better cache utilization than the benchmarks which are designed to induce cache misses. Furthermore, we saw that the processes that see a reduction in cache misses spend a significant amount of time waiting on processes delayed by NiMC.
Third, it is apparent that some processes are disproportionately impacted by NiMC. Due to the synchronization points and barriers of today's BSP-style programs these stragglers delay all other processes. In other words, the fastest processes are limited by the slowest. This is further magnified as the number of processes increases. It would seem that the problem of NiMC is largely an artifact of BSP-style programs and that we might expect it to become less prevalent in future asynchronous applications. However, this is not entirely true. A closer examination of our results show that even the fastest CPU times of processes for the applications are slowed down significantly. Part of this is a result of shared hardware resources between the cores, particularly last level cache. While it is likely that NiMC will not impact an asynchronous program as much as current BSP programs, we expect it to remain a issue in future systems.
From a high-level viewpoint, our experience with machine learning shows that it is an effective technique, that once trained, will allow us to respond to NiMC in a timely manner. Additionally, this work shows the value of machine learning in HPC, as we have successfully use these techniques to develop a deeper understanding of the relationship between performance counters, applications and NiMC. In several instances, the features of highest importance surprised us, which shows the value of a principled approach such as random forests, since as system experts we might have allowed our intuition to lead us to selecting features that were not as rich in information.
Random forests allow us to predict the impact of NiMC has on an application. If vendors and systems developers do not provide a mechanism exposing the amount of RDMA traffic in a system, future systems will need to leverage statistical techniques to infer the impact indirectly. Whether explicitly provided or inferred, the information regarding RDMA will be crucial to to enact the best possible solution. In the next section we evaluate the effectiveness of three potential hardware and software based mitigation strategies.
MITIGATING NIMC
We evaluated three approaches for reducing the impact of NiMC: (1) offload hardware, (2) core reservation, and (3) software based network throttling. All of these techniques have been applied in other areas of research [1] , [10] , [13] , [22] but this work is the first to evaluate their effectiveness in mitigating NiMC. For the results in Figs. 10a and 10c we present the best (minimum) baseline LAMMPS runtimes and the median runtimes for NiMC. In our results, the difference between the minimum and the median runs was negligible. As in Section 3.4, the volume of RDMA traffic is calculated using Daly's estimate (Table 5) .
Offload NICs as a Solution
In Section 3.2, we showed that NiMC did not negatively impact the performance of the most recent offload systems for the benchmark STREAM. In this section, we show that offload NICs continue to provide a solution to NiMC at scale for real applications. In Fig. 10a , we show results of LAMMPS on the Sandy Bridge-X2-FDR-offload system, weak scaling up to 8, 192 processes. Comparing the results of No RDMA and Daly, it is evident that NiMC does not have any observable impact on offload system performance at scales of up to 8, 192 processes. From these results, we conclude that offload NICs provide a solution for modern systems that would otherwise experience NiMC. The main drawback to offload NICs is their greater monetary cost compared to their onload equivalents. However, this is not a guarantee that future systems will be unaffected by NiMC as the disparity between network and memory bandwidth shrinks. Even though none of the most recent offload systems were impacted by NiMC-on slightly older systems (Westmere and Lisbon) we observed a 16-25 percent decrease to STREAM Triad performance due to NiMC. We believe the future viability of an offload solution is dependent on how fully CPUs utilize memory bandwidth and by future network bandwidth increases.
Core Reservation
Dedicating a core to service communication is another possible solution to mitigate NiMC. This reduces the memory throughput of the CPUs, and sets aside separate cache resources for handling network data. The downside to core reservation is that, in the absence of RDMA communication, the reserved core's computational power is wasted. To test the effectiveness of core reservation, we repeated the STREAM and LAMMPS tests described in Sections 3.2 and 3.4 while reserving one core per node to process communication. 5 Additionally, we evaluated core reservation for STREAM-cache which uses array sizes designed to fit entirely in last level cache (LLC). The modified STREAM allows us to evaluate LLC performance, with respect to NiMC. Both of these tests can be seen in Figs. 10b and 10c . Fig. 10b shows how STREAM-CACHE and STREAM-DRAM performance is impacted as we increase the volume of RDMA writes. The two flat lines represent runs where a single core was reserved and the downward sloping lines show runs utilizing all 16 cores. Interestingly, an intersection near 400-500 MBps on the x-axis, where a core reservation strategy begins to provide a performance benefit. As the RDMA bandwidth increases towards 3000 MBps we see performance gains by setting aside a core. This suggests that a dynamic strategy for reserving a core to service the network may be an attractive approach for future systems. Making this decision on a live system could be determined by using a random forest to predict the impact of NiMC and determine if core reservation was necessary.
In Fig. 10c , we repeat the experiments of Section 3.4 on Sandy Bridge-X2-onload, however we only utilize N-1 cores per node for the application. These results show that core reservation continues to be an effective strategy to prevent NiMC at scale, independent of the volume of RDMA traffic. These results clearly demonstrate that contention on the network is not a factor in the increase to runtime seen in Section 3.4, Fig. 4 . This can be observed in Fig. 10c , where the cases with and without RDMA traffic have only a 0.1 percent difference in runtime, despite contention on the fabric that would be present in the RDMA results. This allows for a quantification of the induced network contention due to the RDMA streams, which is much less than the observed contention due to NiMC. Of minor note, there is a 10 percent increase to the runtime of the runs that utilize 480 processes compared to the runs of 60 and 960 processes. After discussion with LAMMPS developers it was determined to be the result of increased communication due to a less efficient domain decomposition for 15 cores per node. Overall, the results suggest that core reservation incurs a runtime increase of 6.4 percent compared to 16-core to 15-core runs of LAMMPS without RDMA traffic. While a 6.4 percent increase to runtime is costly, it costs significantly less than the 330 percent increase without core reservation (Fig. 4) .
Software-Based Solutions
There are several methods for throttling or shaping the traffic sent over the network. One such method is to artificially throttle the throughput of the network so that the same total volume of traffic is sent over a longer time period. The practicality of throttling is partially dependent on the significance of the network data to the application. It is important to remember that traffic throttling increases the time to deliver the data, but makes more memory bandwidth available to the CPU. If the network data is part of the application's critical path and the application is executing faster (due to the increase in available memory bandwidth), throttling may leave the application stalled. In Fig. 10b we plot the impact that varying network speeds have on STREAM DRAM and LLC performance. This experiment reduced RDMA throughput by decreasing message rate and leaving message size constant. Additional experiments varying message size (rather than rate) induced similar behavior and suggest that total RDMA throughput is the most significant factor in determining NiMC's impact. Specifically, increasing message rate, without increasing throughput, does not result in additional performance degradation for the systems evaluated. Least square linear regression of the results in Fig. 10b show that for every bit per second (bps) of RDMA bandwidth we add we slow the DRAM performance by 14 bps. When considering LLC, this trade off becomes even more expensive as 1 bps of RDMA bandwidth reduces LLC bandwidth by 22 bps. On modern HPC systems where memory performance is a highly valued commodity, the large disparity of this trade off makes network throttling less appealing as a solution compared to Fig. 10. Fig. 10a highlights the performance of offload cards at scale for the application LAMMPS. Fig. 10b shows the relationship between RDMA traffic and available memory bandwidth in DRAM and LLC using N and N-1 cores. Every bit per second of RDMA traffic reduces DRAM and LLC bandwidth by 14 and 22 bits per second, respectively. Fig. 10c demonstrates a core-reservation solution at scale for the application LAMMPS.
5. Core 0, which is where the QIB driver is bound a hardware offloading. For the system evaluated, softwarebased throttling is only effective if you can reduce the amount of network traffic to under 500 MBps. If an RDMA service requires more than 500 MBps network bandwidth, core reservation becomes a better solution.
Summary of Solutions
Our results show that a hardware offload solution is ideal for modern systems, though (as Westmere and Lisbon demonstrate) future effectiveness is dependent on trends in CPU memory bandwidth utilization and network speeds. Second, many systems utilize lower cost onload NICs. For these systems there are two approaches to mitigate NiMC. The first approach of throttling the network is limited in its effectiveness. Specifically there is a crossover point where the amount of network traffic becomes large enough that core reservation becomes a better solution. This crossover point will vary system to system but was 500 MBps for our evaluated platforms. Lastly, core reservation allows for unthrottled RDMA bandwidth but at a base-level increase to runtime of 6.4 percent. Though we provide general guidelines, determining the best solution for each system requires a knowledge of the application, services and underlying hardware. As we have demonstrated, machine learning allows us to predict the impact of NiMC on application runtime. These predictions enable applications or the runtime system to select the best strategy in a dynamic solution space.
CONCLUSIONS
In this work, we discussed the concept of NiMC and demonstrated its impact for a variety of current HPC system architectures. We showed that NiMC is a concern for both onloaded and offloaded networking hardware, with the onloaded hardware observing the largest performance impact. For all but one of our applications, we observed significant performance impact on modern onload systems due to NiMC, and we ruled out that the observed NiMC impact was not significantly attributable to CPU contention or network contention. We explored several causes of NiMC-results which will inform future system design in both the existence of NiMC as well as the potential methods to reduce its performance impact. Using real applications at large scale, we showed that NiMC can lead to significant performance degradation even in applications like LAMMPS that have previously demonstrated performance robustness in the presence of other types of system noise.
Our work examined how to detect NiMC and predict its impact on workloads by using random forests. This ground work allows us to dynamically select the most effective mitigation strategy based on the performance degradation NiMC introduces. In this research we found that no single counter was universally important to all workloads, validating the use of sophisticated methods of machine learning. We demonstrated how random forests could effectively (and without expert bias) determine which counters were important to predicting NiMC's impact on a particular application. In addition, this study offers evidence to system software and application developers that NiMC should be taken into account when developing software that may be co-located with other applications or services that consume network bandwidth, even for small durations of time.
While this work only examined InfiniBand based networks, such networks are relevant for HPC as evidenced by the recent $325 million CORAL procurement [37] , a 150 peta-FLOP IB-based system. Lastly, we evaluated three strategies to mitigate NiMC, namely offload NICs, core reservation, and network throttling. Our results suggest that Offload NICs appear to provide the best, albeit most expensive, solution to NiMC on modern systems, provided there is sufficient headroom between theoretical and observed CPU memory bandwidth. In the event a cluster utilizes an onload NIC, setting aside a CPU core to service the network is a viable solution for onload systems, but incurs a runtime penalty proportionate to the power of the core (6.4 percent in our study). The current disparity in the way memory bandwidth is provisioned between RDMA and the CPU makes the third solution (network throttling) attractive only if the required RDMA bandwidth is below specified thresholds (Fig. 10b) .
Taylor Liles Groves received the BS degree from Texas State University, in 2009 and the MS and PhD degrees in computer science from the University of New Mexico, in 2012 and 2017, respectively. He spent three years in Sandia Laboratories as a graduate student researcher and is now an architecture and performance engineer in Berkeley Labs-NERSC. His research interests include networks, monitoring, modeling, and evaluation. He is a member of the IEEE and the ACM.
Ryan E. Grant received the BSc, MScE, and PhD degrees in computer engineering from Queens University. He is a senior member of Technical Staff in Sandia National Laboratories in Albuquerque, New Mexico and an assistant research professor with the University of New Mexico, Albuquerque, New Mexico. His research interests include high performance networking, HPC, and power-aware computing. He has authored more than 40 publications. He is a member of the IEEE and the ACM.
Aaron Gonzales received the BS degree in psychology and the MS degree in computer science from the University of New Mexico, in 2010 and 2016, respectively. He is a data scientist with Twitter. His research interests include deep learning for investigating sequences of unstructured or semi-structured data.
Dorian Arnold received the BS degree in mathematics and computer science from Regis University, Denver, Colorado, the MS degree in computer science from the University of Tennessee, and the PhD degree in computer sciences from the University of Wisconsin. He is an associate professor in the Department of Mathematics and Computer Science, Emory University. His research interests include highperformance computing and large scale distributed systems focusing on scalable middleware, run-time data analysis, fault-tolerance, and HPC tools. He has authored more than 50 peer reviewed research articles and papers. He is a senior member of the IEEE.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
