The increasing complexity of HPC systems has introduced new sources of variability, which can contribute to signi cant di erences in run-to-run performance of applications. With components at various levels of the system contributing variability, application developers and system users are now faced with the di cult task of running and tuning their applications in an environment where runto-run performance measurements can vary by as much as a factor of two to three. In this study, we classify, quantify, and present ways to mitigate the sources of run-to-run variability on Cray XC systems with Intel Xeon Phi processors and a dragon y interconnect. We further demonstrate that the code-tuning performance observed in a variability-mitigating environment correlates with the performance observed in production running conditions.
INTRODUCTION
On many HPC systems, the performance of benchmarks and applications is subject to signi cant variation when run in the typical production mode [16] . This variability can make HPC systems less productive and performance measurements less reliable in a variety of ways [11] . Systems and applications are often evaluated in terms of their performance characteristics under optimal conditions, and in the presence of variability, systems are less productive than their generally best-case performance numbers would suggest, as any variability would represent a reduction in performance from the optimal value [27] .
The ability to measure and report accurate and meaningful performance data is critical to the assessment and improvement of HPC systems, and can become signi cantly more complicated in environments with variability. Performance data can require multiple executions to collect and cannot be meaningfully reported as a single value but must be expressed as a data distribution. Since opportunities to run at large scale are limited, collecting large data samples can become impractical. Similarly, application tuning becomes more complex, requiring multiple executions to gather statistically signi cant results and determine whether performance changes are the result of code modi cations or the system performance uctuations [20, 29] . Auto-tuning, an already expensive process, can also become impractically expensive [5] . Further complications brought on by variability include hiding the performance impact of system software changes, complicating the assessment of system health via benchmarks. Variability also makes it di cult to predict job duration and cost, which impacts user productivity and can also complicate job scheduling [3] .
This work will examine the nature of performance variability on Cray XC40 systems that utilize the second-generation Intel Xeon Phi (KNL) processor. Presented are results that quantify the degree of variability that is present on these systems and that identify the various sources that contribute to the cumulative variability observed. Quanti cation of system baseline variability is important because reported changes in performance can only be evaluated within the context of the underlying level of system variability. It is anticipated that a wide variety of benchmark and application performance results will be reported on these systems, and these data can serve to help design and interpret performance studies.
Variability on KNL XC40 systems was studied using a variety of micro-benchmarks, mini-apps, and applications including the Nekbone and MiniFE mini-apps and the LAMMPS and MILC applications. Observed performance variations were over 70% for some applications and over 2-3X for the benchmarks. Because of the variety of causes of variability in the system, di erent applications are found to be sensitive to di erent sources of variability to di erent degrees. Several sources of variability have been identi ed, some unique to the Xeon Phi, including variability at the core, tile, node, and network levels. While IO is a known source of variability, it has distinct characteristics and is not tightly coupled to the computeintensive components of an application which are the focus of this study, and therefore will not be included in this work. In addition to identifying sources of variability and quantifying their magnitude, we have identi ed several mitigation strategies that may be employed to reduce variability when assessing the impact of code changes.
In Section 4.1, we examine the lowest level core and OS noise issues. Section 4.2 considers variability originating at the level of the two core KNL tiles. Section 4.3 examines the causes of variability at the node level as the problem size increases into MCDRAM (multi-channel DRAM) or DRAM. In Section 5, we move away from intra-node variability and examine inter-node variability at the network level. Finally, in Section 6, the entire combination of variability sources and their e ects are examined using a set of application codes running up to hundreds of nodes.
RELATED WORK
Parallel application scalability can be signi cantly impacted by the node-level performance variability. Several researchers have studied the e ects of OS services on performance variability [2, 7] . On multi-core architectures, one approach to concentrate the OS services onto a speci c core is through core specialization [21] , which e ectively isolates the rest of the cores from OS activity. We demonstrate that OS noise can still be signi cant, especially for micro benchmarking studies; however, we show that its e ects on the real applications can be mitigated e ectively.
Another potential source of variability is the dynamic frequency scaling e ects. Performance variation caused by dynamic overclocking on top supercomputing platforms was demonstrated by Bilge et al. [1] . They noticed performance degradation caused by frequency variation on math kernels and HPC applications. Contention for the shared resources on a node by the co-running codes can lead to performance variability [22] . The shared cache contention can in uence the variability between co-located threads on a multi-core processor. As we show in this paper, even the co-located processes can be in uenced by cache contention on KNL owing to its tilebased architecture, where two cores on a tile share the same L2 cache.
At the system scale, researchers have identi ed the network contention as the prime contributor for the variability. Some have used MPI collective benchmarks to demonstrate this; Skinner et al. [24] noticed a 2-3X increase in MPI_Allreduce latency due to network contention from co-located jobs. Hoe er et al. used simulation to study the e ect of OS noise on communication operations and concluded that it strongly in uences the collective operations [12] . Bhatele et al. [4] and Yang et al. [28] have observed that shared network resource contention is the major contributing factor for variability on dragon y networks. They have also highlighted that the e ects of node placement on the variability were not dominant enough compared to network congestion e ects. Jain et al. [15] proposed fabric isolation schemes to address the inter-job e ects in the dragon y networks and have quanti ed the bene ts through simulation.
Tools have been developed to quantify the extent of inter-job e ects in HPC systems. For instance, the tool mentioned in [8] measures the e ects of inter-job contention by continually running the benchmarks and provides insights using network performance counter data. Several researchers have analyzed the variability on speci c systems using real applications. Van Straalen et al. [26] identi ed the heap management algorithm implementation on Cray XT as the source of variability observed on AMR applications. A reduced noise kernel was proposed for the Cray XT5 platform and its e ect on applications was demonstrated [19] . León et al. [17] categorized the CORAL application benchmarks based on their sensitivity to the network bandwidth and run-to-run variability. Heirman et al. [10] proposed methods to accurately analyze the performance of applications on di erent architectures considering the variability aspects. In this paper, we not only quantify the extent of variability at all the levels of the Cray XC system hardware, we also present ways to mitigate some of the variability sources.
EXPERIMENTAL SETUP
In this section, we describe the high-level aspects of our experimental design and the machines used in this study. This overview will describe the broad strategy of the experimental design; more speci c details for an experiment are provided in the relevant section.
Experimental Design and Measurements
The investigation began with measuring the performance variability at the lowest level of granularity by running microbenchmarks on a single core with a small number of instructions per iteration. As we captured the variability imparted at this level of the hardware, we applied mitigations to reduce or eliminate the variability and then proceeded to the next-higher level of hardware. At any given stage, it is critical to run each test repeatedly to establish a statistically valid baseline and review the distribution of results.
Machines Used
The majority of the experiments used the Theta supercomputer installed at the Argonne Leadership Computing Facility (ALCF). Theta is a Cray XC40 with 3624 Intel KNL 7230 compute nodes. The compute nodes are connected with the Cray Aries high-speed network in the Cray dragon y topology. The Intel KNL [25] 7230 has 64 CPU cores with 16 GiB of MCDRAM and 192 GiB of DDR4 2400. The KNL is built based on "tiles" where a tile contains 2 cores and a shared 1 MiB L2 cache. Each core has its own 32 KiB L1 instruction and L1 data cache. A core has out-of-order execution and supports four hardware threads with two AVX512 vector units. The 7230 has a base frequency 1.3 GHz, and AVX frequency of 1.1 GHz when the instruction stream contains approximately more than 40% of AVX instructions. The Intel Turbo feature can dynamically scale the frequency up to 1.5 GHz for a single tile or 1.4 GHz for several tiles. Individual cores cannot scale their frequency up or down. The KNL MCDRAM can be con gured on boot into one of several combinations of memory and cluster modes [25] . The two memory modes we focus on in this paper are the cache and at modes, and we use the quadrant cluster mode. The cache memory mode con gures the MCDRAM as a last level direct-mapped cache of the DRAM. The at memory mode con gures the MCDRAM as a separately addressable memory space with its NUMA characteristic as "far" such that the OS will not allocate into it, leaving the MCDRAM for the application. The quadrant cluster mode de nes how the tag directory map of the DRAM and MCDRAM is handled. We selected quadrant mode for the best performance versus ease-of-use tradeo .
The Aries interconnect is a three-level dragon y topology network. The rst two levels (rank-1 and rank-2) are copper based with 10.5 GB/s bi-directional bandwidth per link, and the rank-3 level is optical with 9.38 GB/s per link. Each node has a peak injection rate of 16 GB/s.
In addition to evaluating the tests on Theta, we leveraged two systems from the National Energy Research Scienti c Computing Center (NERSC), Cori and Edison. Cori is a Cray XC40 system similar to Theta but much larger consisting of 2004 Intel Haswell E5-2698 v3 compute nodes and 9688 Intel KNL 7250 compute nodes. The Cori KNL nodes are 68-core parts with an increased clock rate of 100 MHz over the Theta KNL nodes. In this study, we use only the KNL nodes. Cori also uses the Aries high-speed interconnect similar to that described above for Theta. NERSC's Edison machine is a Cray XC30 system. Edison uses Intel Ivy Bridge processors that vary signi cantly from a KNL, but we do not describe them in detail here. We use Edison only to con rm the network-based experiments. Edison uses the same high-speed Aries interconnect with a 3-level dragon y topology. The Aries networks on Edison and Cori are larger than that of Theta; hence they are prone to potentially more frequent occurrences of high network congestion rate.
INTRA-NODE VARIABILITY
Within this section, various sources of intra-node variability will be examined and characterized using a set of benchmarks. For the core level studies, we run all experiments on a given node looking at only the run-to-run variation rather than variation between di erent nodes. However, we have repeated the experiments on multiple nodes to validate the statistical signi cance of the observations. All test cases in this section are performed on Theta. Where possible, mitigation strategies are recommended to reduce the overall variability resulting from a given source. These strategies are applied as the test cases are built up to help isolate new sources of variability.
We used the following microbenchmarks: Sel sh [2] , FWQ [12] , HACCmk [6] , STREAM [18] , and Naïve Matrix Multiplication (NMM). NMM is a simple double-precision square matrix multiplication algorithm.
Core-Level Variability
At the core level, a well-known and primary cause of variability can be the activity from the OS [2] . The Sel sh [2] benchmark is used to estimate OS noise by running an idle timed loop and measuring time detour from the expected threshold. Figure 1 shows the OS noise (detour) of up to 250 µs on a randomly picked core running the benchmark. This level of noise can have a signi cant impact on kernels when examining performance at small time scales. The Cray system provides a feature named core specialization, which a nitizes the OS activities to speci c cores/hardware threads. The user can specify the number of resources to reserve, and the OS threads will be bound starting with the highest core number. Figure 2 shows 10X reduction of noise to about 25 µs when core specialization (reserving one core for OS) is used. This noise can be observed within an application kernel as well. Here, we use a naïve matrix multiplication (NMM) with no optimizations and the number of elements sized to t entirely in the L1 cache. In Figure 3 , we show the e ects of the OS noise on this kernel. The application processes running on core 0 and core 1 are executing much more slowly than the other processes owing to contention with the OS as discussed above. The results of the same experiment with core specialization are shown in Figure 4 . By con ning the OS threads to the last core, the absolute performance di erence between the processes is greatly reduced. Prior studies [1, 23] showed performance variability occurring due to turbo mode on Xeons, so we examined it as a possible cause of variability on our systems. Our experiments on the KNL node did not show any material di erence from turbo mode for a single node. Variability from frequency scaling is about 0.5% with inconclusive trends; variability can go up or down by 0.5% when turbo is enabled. However, turbo mode can improve performance by 1% to 7% for the benchmarks we evaluated. The impact of core-level variability on applications has been evaluated using three codes. HACCmk [6] is the major compute kernel from the HACC code, FWQ [12] is a synthetic compute benchmark that repeatedly performs a xed amount of work, and again we use our matrix multiply test. Each code was con gured to use data from only the L1 cache and run for di ering lengths of time ranging from 1 millisecond to 1 second, and the range of variability was observed. As Figure 5 shows, the observed variability between the runs decreases as the run duration increases, with core-level variabilty approaching zero as the run time approaches 1 second. This reduction in variabilty can be explained by an averaging out of the core-level variability over su ciently long intervals.
Some variability is still observed at the core level, and the exact cause continues to be investigated. Architectural features of the KNL many-core node can potentially induce core-to-core variability.
As an example, core proximity to the memory controller may result in asymmetric memory latency or on-die mesh bandwidth. In multinode systems, variability between cores across di erent nodes could exist due to manufacturing di erences, although frequency binning during chip production can signi cantly reduce variability from this source.
Recommendation: Use OS features such as core specialization that reduce the impact of the OS on the application. The impact of core-level variability e ects can be minimized by using su ciently large run times.
Tile-Level Variability
The KNL features a tile architecture with two cores sharing the L2 cache. The impact of the shared L2 cache is illustrated using the weak-scaled MPI version of HACCmk. The variability when a serial version of HACCmk is run on both cores on the tiles is shown in Figure 6 . (The cores from rst and last tiles are excluded for this experiment in order to avoid the OS noise e ects possible on these tiles even with core specialization.) The maximum of the run-to-run variability range across all the 60 cores is 5.3%, and the maximum overall core-to-core variability range is 12.4%. A bimodal runtime distribution can be seen in the gure. This bimodal time distribution correlated well with the L2 cache miss count distribution shown in Figure 7 . Around 2X variability in the L2 cache miss count is observed. The e ects of variability for this kernel can be mitigated when only one core per tile is used. As shown in Figure 8 , with one core per tile, the overall core-to-core variability range is only 2%.
This experiment was repeated across several nodes to validate the statistical signi cance of the observations made here. The average variability observed across these nodes is 4.97% when using one core/tile and 14% when using two cores/tile. NMM is used to study the variability as a function of memory footprint size. Similar to the earlier experiment using HACCmk, NMM is run on 60 cores on a node with di erent matrix sizes such that the working set size varies from 20% to 200% of the L2 cache capacity. Figure 9 shows the percentage of maximum to minimum core-to-core variability when two cores and one core per tile are used. Starting at about 80% of L2 capacity, cache con icts resulted in higher variability with two processes/tile. Variability for memory footprint in the [80% of L2 -200% of L2] range is subject to application address distribution and access patterns when two cores/tile are used. Lower variability is observed when only one core per tile is used.
As the memory footprint approaches L2 cache size, cache contention (as indicated by the L2 cache miss count) between the two cores resulted in the performance variability. As of now, we do not understand the root cause for the cache-sharing unfairness. 1 However, using one core per tile mitigates this contention and greatly reduces the overall variability. 1 The tile-level variability is not observed on other KNL clusters and also it is not observed on Theta after sofwatre updates, hence, we reason out that the variability could be an artifact of the software and not inherent in the hardware. 
Memory Mode Variability
One of the unique features of the KNL processor is the two types of memories it has: DRAM and MCDRAM. As mentioned in Section 3.2, the KNL can be con gured in di erent modes. In this section, the di erences between the cache and at mode in terms of the variability are characterized. The peak STREAM Triad bandwidth for DRAM is 90 GiB/s, and MCDRAM is over 480 GiB/s.
In at mode, the MCDRAM is a separately addressable memory and has the properties very similar to those of DRAM except for higher available bandwidth. When MCDRAM is in cache mode, it is a direct mapped cache. Multiple memory locations from DRAM can map to a single location in the MCDRAM cache. The mapping uses the physical addresses, and not the application-level logical addresses. Even when an address is contiguous in the logical address space, the physical addresses that the OS assigns to the application memory are not required to be contiguous. This treatment can cause con icts in the MCDRAM cache when a portion of the MC-DRAM cache is used, thus e ectively resulting in reduced memory bandwidth. This reduction can vary from run to run, as OS page allocation may change from run to run. We characterize the e ect of this by using the STREAM Triad benchmark.
The benchmark is run on multiple nodes repeatedly for 10 times, and the variability in the measured bandwidth between nodes is captured. A working set size of 7.5 GB, which is less than half of the available MCDRAM memory is used. The Triad bandwidth measured on 50 di erent nodes in at MCDRAM-only mode is around 485 GB/s, and it is consistent across the nodes with less than 1% node-to-node variability. Figure 10 shows the bandwidth for the same working set on 50 nodes in MCDRAM cache mode. The peak bandwidth achieved in the cache mode is just around 345 GB/s. However, several nodes achieve lower than this peak, with a speci c node achieving just around 170 GB/s. The node-to-node variability in the e ective bandwidth measured is around 2X. This kind of variability can make even a perfectly balanced code run appear imbalanced, which means time lost to hardware ine ciencies that are potentially unknown and cannot be controlled by the user.
Node performance monitoring event counters can be used to understand the reasons for the high node-to-node variability in the cache mode. The number of reads and writes for both MCDRAM and DRAM can be counted using these events [13] . Additionally, in the cache mode, MCDRAM hits and misses when MCDRAM is in either a clean or dirty state can also be measured. The summation of MCD_Misse and MCD_Missm (MCDRAM misses when MCDRAM is clean and dirty, respectively) represent the number of writes from L2 to MCDRAM. Figure 11 shows the uncore counter data corresponding to one run of the STREAM benchmark run in the cache mode. The node that has a lower bandwidth in Figure 10 has a correspondingly higher MCDRAM miss ratio (a higher number of MCDRAM writes) in Figure 11 , thus indicating the potential con icts incurred in the MCDRAM cache. The untile counter data for the Flat MCDRAM-only mode has the same miss ratio across all of the nodes as shown in Figure 12 .
The above observations are for a speci c working set size (7.5GB); we then veri ed cache mode variability for other sizes that t within MCDRAM. Figure 13 shows the node-to-node variability across 50 nodes with MCDRAM as a cache for di erent working set sizes that are smaller than the MCDRAM capacity. Bandwidth on some of the nodes is low with all the working set sizes. Since the workaround to address the page con ict issue in cache mode is unknown; therefore, running in the at quad mode is the recommended mitigation. Cray's Zone sort [14] algorithm, enabled by default on Theta and Cori, sorts the MCDRAM free page list with each job invocation. The expectation is that this sort would help address the MCDRAM page con ict issues; however, the state of the free page list could potentially be di erent from node to node depending on the prior application that used the MCDRAM on that node. We demonstrate the e ects of memory mode variability on the MiniFE [6] application. MiniFE is a memory bandwidth-intensive application; we study its performance variability in the cache mode. We run MiniFE with di erent problem sizes and report the performance in GFlops. It is run on 128 single-node jobs in cache quad mode on Cori, and the mean performance across the 128 jobs is noted. Figure 14 shows the mean performance along with the standard deviation (as error bars). The variability is high when the problem size ts within the MCDRAM (16 GiB), and low when the problem size ts only in the DRAM. While Zone sort helps in reducing the variability range across the 128 jobs, it does not help reduce the variability entirely. MiniFE running in node MCDRAM at memory mode showed no variability. Recommendation: Shared node-level resources can have a sign ciant impact on variability and need to be characterized to understand the performance. Memory bandwidthintensive applications could be run in at mode for repeatable performance.
SYSTEM-LEVEL VARIABILITY STUDY
The scale of the interconnection networks is increasing along with scale of the supercomputers. While the op-rate is expected to grow, the challenge for application performance is the nominal network performance growth. Higher-radix, lower-diameter network topologies such as dragon y are widely used in the recent HPC systems. As the network scale increases, the network diameter increases and the bisection bandwidth relative to the number of nodes decreases. Resource managers typically attempt to maximize job throughput and system utilization. Consequently, multiple application jobs running on the system share the network resources such as routers and links, and this resource sharing can potentially lead to great performance variability from run to run. We rst quantify the extent of this variability using MPI collective benchmarks. While designing the experiments in this section, the recommendations described earlier for reducing the core-and node-level variability are adhered to. The MPI collective benchmarks are run on di erent node sizes on Theta using the at memory mode. The benchmarking job is run on several days, which potentially results in a randomized job placement and job mix. We have run the benchmark job in an isolated environment where no other job is running on the system. We refer to this as the ideal case. Figure 15 shows the latency for the MPI_Allreduce with a message size of 8 MB (1M doubles) on 128 processes (using 1 process per node) on Theta. We use only 1 process per node in order to avoid any source of variability coming from the node level. Within each job, the collective call is repeatedly run 100 times; each box in the gure represents the variability across the 100 runs. The run-to-run variability for di erent days is shown with a corresponding box. The bottom and top of the box are the 25 th and 75 th percentile respectively, and the band near the middle of the box is the 50 th percentile, or the median. The middle horizantal line shows the median of medians, and the other four horizontal lines show +/-5% and +/-10% di erence from the median of medians. The run in the isolated system has the lowest latency of around 15 ms, and the benchmark has good repeatability irrespective of the node placement used (not shown in the gure). The median latency across all the days (median of medians) is 5% higher than this level. The median latency on two speci c days is around 10% and 50% greater, respectively, than the overall median latency. For some days, while the median latency is well within 5% from the overall median, there are a few outliers that are around 50% higher than the overall median. As only one process per node was used in this experiment, the variability observed here is attributable solely to the network congestion.
MPI Collective Benchmarks
We now experiment with a con guration that has higher nodelevel parallelism. The MPI_Allreduce variability for 2K processes using 8 processes each from 256 nodes with 64 MB (8M doubles) message size is shown in Figure 16 . The MPI processes within a node are mapped such that no two MPI processes are mapped onto the cores on the same tile to avoid the e ects of L2 contention as discussed in Section 4.2. While the run-to-run variability on most of the days is small, there are a few cases where it is as high as 15%. The higher latency can be attributed to the higher inter-job contention during the job run. Avg. Latency (us) 24 5 We now show the latency of MPI_Allreduce with di erent message sizes on a 256-node allocation on Edison where the benchmark job is repeated 10 times. Figure 17 shows variability across these 10 di erent jobs. Edison also uses the dragon y topology, and we wish to quantify here the e ects of inter-job contention on Edison. For the smaller message sizes, which are latency sensitive, the jobto-job variability is as high as 7X (the slanted numbers within the plot represent the percentage of maximum to minimum variability). This study on Edison brings forth the point that the variability is inherent for the Cray's dragon y implementation and is independent of the node architecture.
We then studied the routing mode and reordering e ects on the variability of MPI_Allreduce latency. The routing mode decides the routing policy for the message on the dragon y topology. The choice of routing mode used is user con gurable. The job scheduler provides two di erent heuristic-based reordering methods for remapping MPI processes onto nodes. Figure 18 show the variability for MPI_Allreduce with a message of 256 Double precision data under di erent routing modes and reordering methods. The variability across the runs within a job and the variability across the 4 di erent jobs is shown in the gure. While the routing modes such as MIN-HASH and IN-ORDER, which are static routing modes, enable relatively low average latency compared to other modes, the overall variability is still signi cant. However, using the static routing modes can help mitigate the variability that occurs due to adaptive routing. At the moment, for mitigating the network variability, we do not yet have a recommended strategy other than running in an isolated system.
APPLICATIONS
Now that we have quanti ed the variability at the node level and the system level using the benchmarks as described in the earlier sections, we study how applications are a ected by variability sources and how we can perform performance optimization in the presence of this variability. We study the applications MILC [6] , Nekbone [6] , and LAMMPS [6] , which have di erent computation and communication characteristics, and analyze how those applications are impacted by the variability sources identi ed in the earlier sections.
The applications and input con gurations have been selected to obviate the need for any signi cant I/O, and we measure only the computational portion of the application. First, we study the application running under a production con guration to obtain a baseline pro le for the application. We will then run the application under a "reduced variability" con guration with both the initial code and the tuned code. We strive to demonstrate that with a low-variability environment, performance studies do not require a signi cant number of iterations to obtain a statistically signi cant average performance. We demonstrate that the performance gain obtained in the low-variability environment will translate into the normal environment by taking large samples for both con gurations and showing that the average performance has improved.
MILC
MILC is a suite of applications for simulating quantum chromodynamics (QCD). The speci c application used is the su3_rhmd_hisq with a lattice size of 96 3 ×48 using 8192 MPI processes on 128 nodes. This con guration results in 24 4 lattice sites per node, which results in a memory usage that ts within the 16 GiB of MCDRAM. Table 1 shows the performance pro ling when run in the production con guration under the MCDRAM cache mode. The performance di erence between a slow run and a fast run is 371.8 seconds or approximately 19% of the total runtime. From the pro le, we see that the majority of that di erence, about 273 seconds, is from MPI. Although we do not have a precise mitigation for variability induced by the application beyond running it in isolation, instead MILC is run in at mode, which happens to have a small number of nodes (500) reserved on Theta in sequential order. Running under this con guration provided the minimum variabilty. The su3_rhmd_hisq application provides a detailed set of output for the various solver steps, which can be processed by a script to generate a performance result in MFLOP/s. The FLOPs in this case are as de ned by the MILC application developers. We run three tests which exhibit less that 1% variability and achieve a performance of 44,839.56 MFLOP/s. A simple optimization for MILC is to apply an optimized rank-to-node mapping which can be accomplished with Cray's grid_order tool. Running the MILC application with this modi ed MPI rank mapping achieves 54,821.88 MFLOP/s resulting in 22.2% performance gain. Figure 19 shows the results for running this con guration under cache mode both with and without the job placement optimization. A total of 20 tests were run for each base case and optimized case, where we nd the one-sample t-test for both sets is 2.2e-16. The mean performance for the base case is 36,450.22 MFLOP/s, and the mean performance of the optimized case is 44,954.79 MFLOP/s, for a similar improvement of 23.3%. As can be seen in Figure 19 , the ranges for the absolute performance overlap. If only a single random sample was used, a user could be led to believe that the original con guration had better performance than the optimized version. 
Nekbone
The Nekbone application is a mini-app for the CFD code, NEK5000. Nekbone solves the 3D Poisson equations using an iterative conjugate gradient method. Nekbone has a number of ne-grained timers which measure the three major subcomponents: memory bandwidth-bound dot product type (DAXPY), compute-bound matrix multiplication type (MXM) compute components, and the collective communication (COMM) component. We analyze here how these components are a ected by di erent sources of variability. Nekbone is set up with a ploynomial order of 16 and 512 elements per process, and 63 processes on a single node of Theta are used. With this input con guration, Nekbone is memory bandwidth bound. We repeatedly run the application 10 times within a node, and the job is run on 70 nodes. Figure 20 provides the time breakdown for the three di erent kernel components of Nekbone within one run in each of 70 jobs under the production environment. The variability observed in the total time correlates closely with the DAXPY type subcomponent. The DAXPY type subcomponent is a memory bandwidth-sensitive computation, and hence it is a ected by the cache mode e ects in a manner similar to what is presented with the STREAM benchmark in Section 4.3. Whereas the range of variability with the STREAM benchmark is around 100%, the variability with Nekbone is measured to be around 21.95%. Given that the primary component that dominated variability is the DAXPY kernel, we can run this test under at mode to reduce the variability. Figure 21 provides the time breakdown for Nekbone within one run on 70 nodes in at mode. A reasonably low variability level of 3.57% is observed across the jobs on di erent nodes. To further qualify our ndings detailed above, we can examine the variability of the total solve time within the node and across the nodes. Figure 22 shows the run-to-run variability range (de ned here as the di erence between maximum and minimum time) across 10 runs within a job and the variability from node to node across the 70 di erent nodes that are in at-quad mode. The maximum run-to-run variability observed was 2.68%, and the node-to-node variability was 4.91%. The same application when run on cachequad nodes leads to higher variability ranges with a maximum run-to-run variability of 18.81% and a node-to-node variability of 23.34% as shown in Figure 23 .
Nekbone performance can be optimized by applying a modi cation that improves the performance of small matrix multiplication. Intel has contributed work on the libxsmm project [9] for accelerating small matrix multiplication on the KNL processor. We build Nekbone with the libxsmm library, replacing the Nekbone native matrix multiply kernels. This substitution provides an improvement of 20.7% over the base case when running under the reduced varibility environment. Figure 24 demonstrates the results that Node-to-node range = 4.9 % Max. Run-to-run range = 2.7 % Figure 22 : Node-to-node variability for Nekbone: 512 elements, 16 polynomial order on nodes of Theta in at-quad mode Figure 23 : Node-to-node variability for Nekbone: 512 elements, 16 polynomial order on nodes of Theta in cache-quad mode are achieved when running in the production environment (cache mode) for the base and optimized cases. We see that the mean performance improvement is 18.8% and that the 95% con dence interval for improvement is between 18.5% and 19.0%. Again, in Figure 24 , if any two random samples were selected, the application could show a performance change of between 2% and 35%.
LAMMPS
LAMMPS is a classical molecular dynamics code; it runs on single processors or in parallel using message-passing techniques and a spatial-decomposition of the simulation domain. We run the LAMMPS using the RHODO benchmark on a single KNL node on Theta. Table 2 reports the variability from run to run with di erent run con gurations, with the default con guration using all 64 cores with 2-way multi-threading. The variability ranges for this case in cache mode and at mode are 14% and 13.2%, respectively. Using just one thread per core reduced the variability to 4.9%, thus indicating that using two threads per core potentially leads to L2 cache con icts due to preferential L2 treatment between the two cores on a tile. Using core specialization and only 63 cores further reduced the variability to 2.9%.
With a reduced variability environment speci ed, we can evaluate the performance of using the native versus optimized math libraries. The base case is run under the low-variability environment with a computational loop requiring 1247.3 seconds. LAMMPS is rebuilt linking with the Intel MKL and run under the low-variability Figure 25 shows the results for running the base and optimized cases under the production environment. The base case results in a mean performance of 1025.9 seconds, and the mean performance of the optimized case is 499.2 seconds for a speedup of 2.06x. The speedup achieved for the 95% con dence interval is between 2.00x and 2.10x. The performance run in the low-variability environment accurately re ects the performance gain under the production environment. Selecting random samples from the production environment runs yields a speedup from between 1.80x to 2.31x.
CONCLUSIONS
In this paper, we have characterized and quanti ed the di erent sources of variability on KNL-based Cray XC systems, and mitigation approaches were outlined wherever possible. We have identi ed the prominent sources of potential variability at the core, tile, node, and system levels. The variability observed at the core level may be primarily attributed to OS operations. This variability is most prominent when measurement intervals are short, on the order of milliseconds or less, and approaches zero for measurement intervals longer than a second. The impact on application performance, however, can still be signi cant as the over-burdened rst core that runs OS processes can make synchronization calls expensive because of the resulting performance imbalance. Latencybound MPI collective operations can be a ected by this noise. Core specialization is observed to help mitigate this noise and reduce the overall impact on the applications. Another source of variability has been observed to occur at the tile level. The KNL uses a tile-based architecture with two cores on a tile sharing the L2 cache. The sharing of the caches by threads/processes can lead to cache contention with potential preferential treatment occurring between the two cores. This result manifests as a bimodal e ect where performance di erences occur between cores and produce around 20% core-to-core variability. While ine cient, using only one tile per core will eliminate variability from this source.
Another source of variability was found to occur at the node level when the KNL was utilized in cache mode. The direct-mapped MCDRAM cache was observed to contribute around 2X variability to the STREAM benchmark and around 20% variability to the Nekbone application. Performance variability was found to be reduced to less than 1% for bandwidth-intensive benchmarks, such as STREAM, when run in at mode. An improved Zone sort algorithm and the use of huge pages are the other potential mitigations that remain to be explored.
The source of variability at the system level is the well-known issue of inter-job contention that occurs due to a shared network resource on the dragon y topology. The level of variability on the Cray XC40 system was quanti ed using MPI benchmarks and real applications. For MPI collectives, performance was observed to vary by as high as 7X due to network contention and for the MILC application, a performance variability of around 70% from run to run was observed and attributable to variability in the communication performance.
Variability at any of the identi ed levels directly a ects the performance of an application. The extent of the impact depends on the application characteristics such as being compute-bound, memorybound, or communication-bound. Hence, it is necessary to take variation into account when quantifying the performance of the applications. Experiments should be designed and reported with su cient sample sizes to obtain statistically signi cant observations. The mitigations noted in this work can reduce the number of samples required and simplify the process of performance tuning.
Finally, given the characterization of the variability, performance tuning analysis can be carried out under a low-variability environment, which accurately predicts the expected performance improvement under the production environment without the need for gathering large sample sets to validate the statistical improvement.
A ARTIFACT DESCRIPTION: RUN-TO-RUN VARIABILITY ON XEON PHI BASED CRAY XC SYSTEMS A.1 Abstract
This artifact contains the source code, input datasets and the build instructions that can be used to reproduce the variability experiments described and presented in the paper.
A.2 Description
A.2.1 Check-list (artifact meta information).
• A.2.4 So ware dependencies. All benchmarks were built and run on Argonne's Theta using Cray's cc compiler and other default parameters. Theta uses the Cobalt job scheduler. All software dependencies speci c to each benchmark will be noted in a README le within each subdirectory.
A.2.5 Datasets. Several benchmarks use random nominal values; otherwise datasets are provided with each benchmark.
A.3 Installation
To access the code, clone the repository by running git clone https://xgitlab.cels.anl.gov/variability/sc17 Each benchmark used in the paper has a seperate directory which will contain source code, build scripts, run scripts, and output parsing scripts (if required).
• -MILC -NEKBONE -LAMMPS As these experiments are primarily designed for Theta, the build commands provided give examples that should be adapted if they were to be used on other systems.
A.4 Experiment Work ow
Each benchmark will have a "./run.sh" script which details how to run the application on Argonne's Theta system. Additionally a README le will explain the various options of the run script and output.
For example: cd ./corevariability/matrix_multiplication make clean; make all ./run.sh
A.5 Evaluation and Expected Results
For each benchmark, the metric used to calculate variability is the maximum to minimum di erence between the runtime across several runs or several cores. For example: The Sel sh benchmark is used to identify OS noise, by running the sel sh benchmark on each core and using MPI to aggregate the results across all the used cores.
Each application will have a script to extract the data from the log les created during the run. This data will automatically be created in the respective folder.
