Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments.
INTRODUCTION
Recent advances in the theory of Neural Networks (NNs), new computer hardware such as Graphic Processing Units (GPUs), availability of training data, and the ease of access through cloud have allowed Deep Learning (DL) to be increasingly adopted as a part of business-critical processes in health care, autonomous vehicles, natural language processing, and Internet of Things. Consequently, many on-line platforms that o er image-processing and speechrecognition systems leveraged by trained DL NNs are emerging to deliver various business critical services, such as IBM Watson [23] , Microsoft Project Oxford [31] , Amazon Machine Learning [2] , and Google Prediction API [16] .
Training DL NNs is a computationally intensive process. An image-processing application, for instance, might demand the analysis of millions of pixels in one of many layers of the NN that takes several hours to days of computations [10] . A promising approach to increase the levels of e ciency in processing time and power consumption of the training process is using one or more GPUs. Computing the NNs on multiple GPUs further reduces training times, enables to handle larger amounts of data, and increases the accuracy of the trained models. Hence, multiple GPUs has become a common practice for DL applications [10, 18] . Although training on multiple GPUs can deliver many advantages, it presents new challenges in workload management and scheduling for obtaining an optimal performance. The performance depends on both the GPUs and CPUs connectivity on the physical topology, and the application's tasks communication pattern.
To illustrate this issue, consider Figure 1 which shows the connectivity topology between the GPUs and CPUs for two representatives DL cognitive systems. In these systems, multiple link technologies such as PCI-e and NVLink connect GPUs to each other and GPUs to host CPUs. NVLink o ers better bandwidth and lower power consumption over PCI-e. In the gure, IBM Power8 system consists of four GPUs and two CPUs with two GPUs per CPU socket. The two GPUs on each socket are connected with dual lane NVLink to achieve up to 40GB/s unidirectional bandwidth, and each of the GPUs is also linked to the socket with two lanes of NVLink. The two CPUs are connected via the system bus. NVIDIA DGX-1 has 8 GPUs connected to two CPU sockets. The GPUs are connected over a hybrid cube-mesh topology: the 12 edges of the cube are connected via single lane NVLink, and the diagonals of two of six faces are also connected via NVLink. Each of the GPUs is also connected to a PCI-e switch so it can communicate to a GPU that is not connected to it via the NVLink and communicate to the CPU as well.
In these systems, communications can take place directly between devices, in the so-called Peer-to-Peer model (P2P), or it should be routed through the main memory of the processors containing the bus controllers. For example, in the case of DGX-1, the communication between GPU1 and GPU5 will go over the PCI-e switches and the system bus (such as quick path interconnect -QPI). As a result of these complex connectivity topologies between di erent GPUs, the application performance depends on which GPUs are allocated for computations and how the GPUs are connected to each other (via PCI-e or NVLink).
Additionally, this challenge becomes acute in shared systems, like cloud computing, where multiple applications from di erent users share the GPUs on the system. At this time, it is uncommon to share a single GPU between two applications so sharing here means di erent applications get di erent sets of GPUs. Jobs in this environment have varied GPU requirements: some need a single GPU, some need GPUs with NVLink, others need multiple GPUs but communication requirements are minimal, etc. In such environments, cloud scheduler should be able to take the communication requirements of the workloads, consider the topology of the system, consider existing applications and their GPU and link utilization and provision the GPUs for the new workload that meet the workload requirements. This enables users to get access to the resources necessary without worrying about the detailed topology of the underlying hardware. Major cloud providers such as IBM, Amazon, Google, Microsoft, and others provide multi-GPU systems as a service today via virtual machines, and most of them have systems with similar GPU topology described in Figure 1 ; so that job scheduling and resource management becomes critical at the time of running multi-GPU based applications on a shared system. Thus, those systems require the same placement functionality proposed in this work to fully exploit the capabilities of modern cognitive systems. Furthermore, both cloud and HPC systems can bene t from a GPU topology-aware schedule.
In this paper, we present an algorithm with two new scheduling policies for placing GPU workloads in modern multi-GPU systems. The foundation of the algorithm is based on the use of a new graph mapping algorithm that considers the job's performance objectives and the system topology. Applications can express their performance objectives as Service Level Objectives (SLOs) that are later translated into abstract Utility Functions. The result of using the proposed algorithm is a minimization of the communication cost, reduction of system resource contention and an increase in the system utilization.
The major contributions of this paper are: • Performance characterization of placement strategies and interference from co-scheduled jobs over a modern Power8 system composed of NVLinks. The results show that using pack instead of spread for a job with high GPU communication gives a speedup ≈1.30x. Additionally, the results indicate that co-schedule jobs with high GPU communication instead of running them solo can conduct to a slowdown of ≈30% (Section 3).
• A topology-aware placement algorithm that places jobs based on its utility with best-e ort on preventing SLO violations. Two scheduling policies are de ned: TOPO-AWARE-P allowing to postpone the placement of unsatis ed jobs, and TOPO-AWARE that always place jobs when resources are available (Section 4). • A prototype evaluation of the proposed algorithm showing the performance improvements that a topology-aware scheduling confers for DL workloads using multiple GPUs. The results show a speedup of up to ≈1.30x in the cumulative execution time and no SLO violations compared to greedy approaches (Section 5).
• A trace-driven simulation to analyze the topology-aware placement algorithm on a large-scale cluster. The results show that the proposed algorithm outperforms the greedy algorithms in the execution time, with no or fewer SLO violations (Section 5).
Section 6 discusses the state of the art and related work, and Section 7 presents summary, conclusions, and future works.
DEEP LEARNING WORKLOADS
This section presents DL frameworks and their characteristics that are relevant for topology-aware scheduling in multi-GPU executions. With the increasing popularity of the DL methods, several deep learning software frameworks have been proposed to enable e cient development and implementation of DL applications. The list of available frameworks includes, but is not limited to, Ca e, Theano, Torch, TensorFlow, DeepLearning4J, deepmat, Eblearn, Neon, PyLearn, among others [3] . While each framework develops di erent algorithms and tries to optimize various aspects of training, they share similar GPU communication algorithms [42] . This work is focused on one of the most popular frameworks at this time, Ca e, but our results are equally applicable to other frameworks. Various NN models are implemented for Ca e, including AlexNet, Ca eRef (based on AlexNet) and GoogLeNet. We use those models for evaluating the e cacy of the topology-aware scheduling algorithm presented in this paper. DL frameworks have two main approaches to divide the workload when using multiple GPUs: data-parallelization and modelparallelization. In data-parallelization, the data is partitioned and spread to di erent GPUs, and in model-parallelization, the NN model is partitioned, and di erent GPUs work on di erent parts of the model, for example, each GPU will have di erent NN layers of a multi-layer NN. However, while the model-based parallelism is expected to be more communication intensive, it is still uncommon for cloud deployments, and therefore we focused all experiments on data-parallelization. We expect that topology-aware scheduling is even more critical for model-parallelization workloads because of the higher communication requirements.
Additionally, a key parameter that plays a signi cant role in the communication is the batch size. It determines how many samples per GPU the NN will analyze in each training step, and directly impacts the amount of communication and computation in each step. The lower the batch size is, the noisier the training signal is going to be; the higher it is, the longer it will take to compute the stochastic gradient descent. Noise is an important component for solving nonlinear problems. Hence, small batches size is a new trend for training DL NNs, which also determines the level of parallelism the NN can reach since the batch size partitions the dataset [6] .
The next section presents an evaluation of the impact of di erent placement strategies on execution time with three di erent NNs (AlexNet, Ca eRef, and GoogleNet) and each NN with four di erent batch sizes (tiny, small, medium, big).
EVALUATING THE IMPACT OF PLACEMENT STRATEGIES
In this section, we evaluate two general purpose workload placement strategies: pack and spread. Later, in Section 4, we combine them into the utility function used in our proposed algorithm. The main sources of performance perturbation on multi-GPU applications are how the allocated GPUs are connected, i.e. the topology, and how much of the shared bus bandwidth other applications are utilizing. To illustrate it, Figure 2 shows di erent workload placement strategies that can be de ned on top a single machine with hardware topology composed of two sockets and two GPUs per socket (the same topology shown in Figure 1 for the Power8 system). The GPUs within the same socket are located at a "shorter" distance (from a topology perspective) than the GPUs located across sockets. Besides, GPUs on the same socket can utilize the higher bandwidth and lower latency network (e.g., NVLink) to communicate instead of going over the PCI-e and the QPI links to communicate across CPU sockets. Therefore, the rst workload placement strategy is pack, which systematically favors minimizing the distance between GPUs, to prioritize the performance of GPU-to-GPU communication. The second workload placement strategy is spread, which attempts to allocate GPUs from di erent sockets and prioritize the performance of CPU-to-GPU communication. Spread promotes better resource utilization and minimizes fragmentation.
Another factor that impacts the performance of either pack or spread placement schemes is the interference introduced by other applications sharing the system resources. For this reason, the placement algorithms should take not only the static topology of the system but also the runtime utilization metrics from currently executing applications for scheduling decisions.
Next, we describe the testing platform and evaluate the impact of the placement strategies to allocate GPUs for the DL applications outlined in section 2.
Testing Platform and Con guration
All experiments are conducted on an IBM Power8 System S822LC release, code-named as "Minsky" shown in Figure 1 . The server has 2 sockets and 8 cores per socket that run at 3.32 GHz and two NVIDIA GPU P100's per socket. Each GPU has 3584 processor cores at boot clocks from 1328 MHz to 1480 MHz, and 16 GB of memory. Each socket is connected with 256 GB of DRAM. Where the intra-socket CPU-to-GPU and GPU-to-GPU are linked via dual NVLinks that uses NVIDIA's new High-Speed Signaling interconnects (NVHS). A single link supports up to 20GB/s of unidirectional bandwidth between endpoints. A high-level illustration of the hardware topology is pictured in Figure 1 and Figure 2 .
For the software stack, this machine is con gured with Red Hat Enterprise Linux Server release 7.3 (Maipo), kernel version 3.10.0-514.el7.ppc64le, Ca e version v0.15.14-nv-ppc compiled with NCCL 1.2.3, CUDA 8.0 and CUDA driver 375.39. All Ca e workloads are con gured with a set of images from the dataset used in the 2014 ImageNet Large Scale Visual Recognition Challenge (which is one of the most well-known datasets for image classi cation and publicly available on the ImageNet competition website).
All experiments were repeated ve times. For each experiment, the maximum number of iterations is 4000, except when generating the GPU pro le where the iterations are only 40. The iterations are decreased because pro ling consumes a lot of memory, and a large pro le does not t in the GPU memory. The tool used to pro le the application was the NVIDIA nvprofile. For all workloads, the NN training batch sizes range from 1 up to 128. Figure 4 shows the relative speedup achieved when allocating GPUs within the same socket (pack) or over cross-socket (spread). When the speedup is higher than 1, the application performs better with the pack strategy. The performance depends on both the workload type and the batch size. When AlexNet is con gured with batch size 1 or 2, it has a speedup of up to ≈1.30x, but for batch sizes larger than 16 both pack or spread have even performance. GoogLeNet has a di erent behavior than the other NNs with less or no impact, which will be better detailed next.
Pack versus Spread
To better explain the cause of the performance delivered by the strategies, the application breakdown is presented in Figure 3 . The analysis shows the percentage of computation and communication represented in the whole execution time. The results indicate that larger batch sizes signi cantly increase computation time, while communication time becomes less signi cant overall.
Taking AlexNet, for instance, when con gured with tiny batch sizes, the computation time is ≈1s for 40 iterations; with big batch sizes, this time increases to ≈66s. The communication time instead remains ≈2s for all batch sizes. While NNs with a bigger batch size increases the amount of data exchanged between the GPUs, it starts to spend much longer time performing computation in the GPU for each batch step. Hence, the communication starts to be less frequent with bigger batch sizes. On the other hand, smaller batch sizes require many more steps to process the whole dataset and then require more frequent communication. This behavior can be veri ed with the NVLink bandwidth usage in Figure 5 . The communication frequency directly impacts the usage of the NVLink bandwidth. The NN con gured with a small batch size reaches higher NVLink bandwidth usage ≈40GB/s, while the NN with a bigger batch size barely reaches ≈6GB/s, as in Figure 5 (the NVLink bandwidth calculation is described later in Section 5.1).
GoogLeNet is the less intuitive case. Since this NN contains sizable neural network layers, and typically the intensity of communications depends on the amount of information exchanged between the layer, it is expected that GoogLeNet performs more communication than the other NNs. Nonetheless, GoogLeNet performs less communication because of its Inception Modules, which in consequence reduces the NN layers output by applying ltering and clustering techniques.
We have also executed the same experiments on a Power8 machine equipped with a PCI-e Gen3 bus instead of the NVLink, as well as NVIDIA K80 GPUs instead of P100. Due to space limitation, we do not include additional gures in this paper, but summarize the results as follows. The impact of pack strategy is similar between NVLink-based and PCI-e-based machines. Except for larger batch sizes, where the di erence starts to be evident. For instance, AlexNet with a batch equals one the speedup is ≈1.27x with NVLink, and ≈1.24x with PCI-e. For a batch size equals two, the speedup drops from ≈1.30x with NVLink to ≈1.21x with PCI-e. For a batch size equals eight, the speedup decreases from ≈1.20x to only ≈1.1x.
In conclusion, while the topology impact in the GPU communication performance is still signi cant in the PCI-e-based machine, improvements on the placement decision of DL workloads are even more necessary in NVLink-based machines.
Jobs in a Co-Scheduled Environment
A typical approach to increase resource utilization in a data center is co-scheduling workloads on the same machine. While it confers cost bene ts, it comes with an inherent performance impact. Although the GPUs are not shared in this work (jobs have private access to GPUs), collocated applications share the bus interconnections among other resources. Therefore, the goal of this experiment is to evaluate the performance impact of the pack and spread strategies in a co-scheduled environment. Di erently, from the previous experiment, this experiment shows application interference.
We have performed an experiment that collocated two jobs in the same machine. Each job is an AlexNet NN requesting two GPUs and varying the batch size. The results are shown in Figure 6 , where 0 represents no slowdown of co-scheduling two jobs in the same machine and a value higher than 0 accounts for the slowdown percentage. Note that, a job with high GPU communication is more sensitive to interference than a job with lower communication.
As analyzed in the previous experiment (Section 3.2) and shown in Figure 5 , the batch size plays the main role in de ning the amount of communication and the job's performance sensitiveness. For that reason, when co-scheduling two jobs with a tiny batch, the su ered slowdown is higher, which is up to ≈30%. But when collocating two jobs with a big batch, the performance interference is very small or nonexistent. This is because a job with a big batch is not sensitive to perturbations in the bandwidth since it requires low bandwidth. Nevertheless, a job composed by a big batch can cause performance interference since it still consumes bandwidth. For instance, in Figure 6 , if the rst job has a big batch and the second a tiny batch, the slowdown is ≈24%, or ≈21% if the second has a small batch.
These results evidence the necessity of a scheduling algorithm that is aware of the performance interference to provide Quality of Service (QoS) for jobs.
TOPOLOGY-AWARE SCHEDULING ALGORITHM
To overcome the problems discussed in the earlier section, we propose a topology-aware scheduling algorithm that makes decisions based on the workload's communication, the possible interference from currently running workloads, and the overall resource allocation of the system. The algorithm's core is a graph mapping mechanism: one graph represents the job's tasks and their communication requirements, and the other graph represents the physical GPU topology. The mapping algorithm produces the GPU allocation that satis es communication requirements of jobs while minimizing the resource interference and fragmentation.
Topology Representation
4.1.1 Job graph. This graph represents the communication requirements of tasks (i.e. GPUs). Vertexes represent GPUs and edges represent communication. Each edge has an associated weight denoting the communication volume, given by the average GPU-to-GPU bandwidth usage. During the mapping process, this weight is normalized by the total available bandwidth in the physical machine, where a value equal to 0 represents no communication and higher than 0 accounts for the communication level.
4.1.2 Physical system topology graph. This graph represents the GPU topology based on the underlying hardware of a machine or a set of machines connected by a network. An example of how di erent physical GPU topologies are modeled is illustrated in Figure 7 , which shows the graph of Figure 1 's topology. The physical graph can be understood as composed of multiple levels, where the rst level is the network. Just after this level, there is the machine level, as represented by the vertexes M{X}, where X is the machine ID. The next level is the socket level and is represented as S{Y}, where Y accounts for the socket ID. Other levels can exist between the socket and the GPU, such as levels representing multiple PCI-e or NVLink switches. The last level represents the GPUs.
A GPU vertex can be directly connected to the socket vertex, to an intermediate vertex, and/or directly connected to other GPUs, which represents a direct NVLink connection between the GPUs. Consequently, some GPUs will have multiple paths to communicate. The path distance is given by the sum of the weight of the edges of the path. Since the weights are de ned qualitatively, a higher level must have a larger weight to represent longer distances. For example, in Figure 7 each level right after the GPU level has weight 1, whilst at higher levels, such as the socket level, the edges have weight 20. Since the distances are qualitative, there are no constraints on how the weights are de ned, except that higher levels will have larger weights. 
Job Pro le
The pro le includes not only the job's communication graph but also a performance model de ning the level of interference the collocated jobs will su er and cause. This model is created from experimentation using historical data. Two types of experiments can be de ned. The rst approach is injecting arti cial load, using micro-benchmarks, onto the shared resources and measuring the interference, i.e. the impact on run-time of other collocated jobs. While this rst approach can be highly accurate, analyze all possible combinations might be very costly. The second one is performing a combinatorial collocation of a set of known applications. Also, performance prediction for unknown jobs using the models from known applications can enlarge the range of the analysis. The previous workload executions can feed a prediction model, such as using decision tree [14, 37] or statistical clustering [8, 22, 28] . Because of the cloud's high variability, our model does not need to be optimal; high-quality decisions will be accurate enough.
Objective Function and Constraints
Our objective function focuses on minimizing the tasks communication cost (t cc ), external resource interference (I b ), and resource fragmentation (ω d ). Formally, it can be de ned as follows:
where α cc + α b + α d = 1. All parameters t cc , I b and ω d are normalized against the corresponding worst case t w , I w and ω w (i.e., the scenario with the lowest bandwidth, the highest interference, and the highest fragmentation). For the minimum t cc , we allocate GPUs as close as possible once all constraints are met. For the minimum I b , we allocate GPUs with the lowest possible amount of bus sharing. For the minimum ω d , we map GPUs from the most fragmented domains to increase the cluster utilization. The constraints that we de ne in this paper are the resource capacity as the number of GPUs and the memory bandwidth. Formally, all possible solutions must meet the inequality constraints de ned as t pu ≤ p pu and t bw ≤ p bw , where t x and p x denote the resource requirement of a given application and the available capacity of a given node for the resource type x, respectively. Other constraints can be added for di erent scenarios than the ones we show in our experiments. while T rue do while availableResources(P ) and
Placement Algorithm
First, we de ne the premise and limitations. The algorithm behaves as a greedy algorithm since the assignment of a task to a physical GPU is never reconsidered. Hence, we perform a best-e ort approach to nd the optimal solution. The algorithm preferentially places as many tasks as possible for a job in the same node. If a job wants to get all its tasks spread across di erent nodes instead, it needs to de ne anti-collocation policies for its tasks, and in response, they will be placed on di erent nodes. Also, if a job does not support multi-node, it must be de ned with a single-node constraint in the pro le. If a job cannot be placed, its placement is postponed to the next iteration of the scheduler. To avoid starvation and enforce fairness as much as possible, the job waiting queue is sorted by the job's arrival time. Thus, the oldest jobs have priority to be placed.
We de ne two scheduling policies for the proposed algorithm. One policy is referred to TOPO-AWARE-P which allows out-oforder execution of jobs and postpone the placement that the job's utility is lower than a threshold de ned in the job's pro le. The other policy is the TOPO-AWARE, where the jobs are placed as soon as they arrive without consideration for the future jobs.
The placement process is formally de ned as a function ψ () taking the job's graph A and the physical topology P as ψ (A, P ) and transforming them into the GPU list . Where |A| is the number of requested GPUs, |P | is the number of available GPUs, and is the number of allocated GPUs to the job, being ≤ |P |.
Algorithm 1 outlines the placement process. It is a loop-based approach that each iteration attempts to place jobs while there are jobs in the waiting queue Q and available resources. Otherwise, the scheduler sleeps until a job has nished or a time interval has expired. During each iteration, the scheduler takes a job from Q and lters the available nodes, eliminating the ones that do not satisfy the constraints (e.g. resources types, anti-a nity, etc.), creating the graph P . Then, the function DRB() is called to traverse the physical graph P and de ne the GPU allocation. After that, if the utility of the solution s does not satisfy the job's requirements and the policy allows postponement, the job is added back to the waiting queue at the end of the iteration; otherwise, the placement is enforced.
The function DRB(), outlined in Algorithm 2, is based on the Hierarchical Static Mapping Dual Recursive Bi-partitioning algorithm Algorithm 2 Recursive Bi-Partitioning Mapping based in [12] 1: function DRB(A, P , C) 2: if (|A | == 0) then 3: return nil //This partition is not a candidate 4: if (|P | == 1) then 5: return ← (P, A) //Map job's task to physical GPU 6: (P 0 , P 1 ) = physicalGraphBiPartition(P) 7: (A 0 , C 0 , A 1 , C 1 ) = jobGraphBiPartition(A, P 0 , P 1 , C) (P 0 .t cc , P 1 .t cc ) ← getCommCost(t ask , P 0 , P 1 , C)
5:
(P 0 .I b , P 1 .I b ) ← getInter(t ask, P 0 , P 1 , A.pr of il e) 6:
if (U(t ask , P 0 ) ≥ U(t ask , P 1 )) and (const r aint s) then return A 0 , P 0 .t cc , A 1 , P 1 .t cc proposed by [12] and implemented by [34] . Its asymptotic complexity is de ned as Θ(|E A | * lo 2 (|V P |)) [35] , where in our case |E A | is the number of edges from the job's graph and |V P | is the number of a vertex from the physical graph.
More speci cally, during each recursive iteration of DRB() two other functions are called, physicalGraphBiPartition() to bipartition the physical graph P, and jobGraphBiPartition() to bi-partition the job's graph A. The recursion stops when A = ∅, returning ∅, or when P only has one element, returning the mapping pair (A , P ), where ∈ {0,1} partitions. The C parameter is an array that contains the communication cost of all GPUs, even the ones not into the sub-partition P . C is used to calculate the communication cost between sub-partitions.
Similarly to the implementation of DRB() in [34] , the physical graph bi-partition is performed with the well-known Fiduccia Mattheyses algorithm [15] that minimizes the cut-sets in linear time. However, di erently from [34] , we do not only account the communication cost, but also the job's preference using a utility function to bi-partition the job's graph, as shown in the function jobGraphBiPartition() outlined in Algorithm 3.
Algorithm 3 creates two sub-partitions A 0 and A 1 , where each partition can have part or all the job's tasks. Since the tasks in A 0 will be placed in P 0 and A 1 in P 1 , the function evaluates for each task which sub-partition P provides higher utility. Then, if P 1 gives better utility and has enough available resources, the task is added to A 0 . Otherwise, the task is added to A 1 .
For each task, Algorithm 3 evaluates each sub-partition via calculating the communication cost t, the workload interference I and the resource fragmentation ω, using the functions getCommCost(), getInter() and getFragmentation(), respectively. Then, with those parameters the job's utility is calculated using the utility function U , which can be de ned as the convex function in Equation 2.
Next, we describe how the U parameters are calculated. The communication cost (t) is de ned as the sum of the combinatorial shortest paths p between all GPUs within the solution as:
The level of interference (I ) is measured using the job's pro le. As described in section 4.2, the pro le is composed by the completion time of the job running solo and running with other jobs (or with arti cial loads). Therefore, the algorithm measures the average slowdown that the job su ers and causes in the currently running jobs. Thus, the average interference is calculated as follows:
System fragmentation (ω) is the average fragmentation of all sockets, which is calculated as follows:
TOPOLOGY-AWARE SCHEDULER EVALUATION
In this section, we present both a prototype implementation and a trace-driven simulation to evaluate the proposed topology-aware scheduler algorithm. The prototype evaluation was performed on a single machine with characteristics described in section 3.1. The simulation evaluates the algorithm on a large scale cluster. While the focus of this work is in learning workloads, any workload can be submitted in the prototype. Also, there is no need to change how applications are implemented in order to use the scheduler. In the future, we plan to test the proposed algorithm in a cluster manager framework like Kubernetes [17] or Mesos [21] , similar to the enhancements described in the related work [45] .
Prototype Implementation
We implemented the prototype for the scheduler using C and Python. The program continuously loads JSON les containing the necessary information about the submitted jobs. To place a job, the system creates the job's manifest, lling it with the information received from the JSON le, and uses that information to determine the placement of the job. If the algorithm decides to place the job, it enforces the decision of running the job on the given machine. Until the job nishes, the system keeps track of the execution of the job while collecting statistics including the ending time.
For the placement, the system captures various performance metrics. The DRAM memory bandwidth is calculated using the Power8 performance counters described in [1] , which are accessed using the library Perfmon2 [36] . To calculate the NVLink bandwidth (which is shown in most of the experiments), we access the NVIDIA CUDA driver API using the command nvidia-smi nvlink -i $gpu_id that returns the transmitted bytes from each link. Then, the algorithm calculates the NVLink bandwidth usage of CPU-to-GPU or GPU-to-GPU communication based on their link connections.
For discovering the topology during the system startup, it executes the nvidia-smi topo --matrix command 1 to create a matrix of GPUs, and the command numactl --hardware to include socket distance and CPU locality in the model. For enforcing the decisions, before executing any application, the system rst de nes the order of the GPU ID's by exporting the parameter CUDA_DEVICE_ORDER=PCI_BUS_ID, and then, for each application, it exposes only the speci ed GPU list from the scheduler decisions using the parameter CUDA_VISIBLE_DEVICES=$gpu_list. For preventing performance variability related to NUMA remote memory access, the applications with only GPUs in the same socket are bound to the socket using the command numactl.
To feed the performance prediction model, the application proles are experimentally generated, de ning the optimal resource allocation (best-performing) and some possible sub-optimal resource allocation (worst-performing) for both solo (when the job runs alone with no other jobs) and co-scheduled modes, as previously shown in Section 3. The pro le then contains the 95th percentile of the execution time from ve executions of each workload within di erent scenarios. A simple, but e ective performance prediction approach is then performed using the pro les, characterizing the workload slowdowns for various con gurations; we plan to extend it with more robust statistical techniques in the future. Since Ca e framework is based on data-parallelism model, all GPUs perform similar work, and then, they have a similar amount of communication between each other. Therefore, we de ne in the workload graph all GPUs communicating between each other with the same weight. However, for di erent batch sizes, di erent weights are used, ranging from 4 to 1, where 4 represents the smallest batch size and 1 the largest one.
Prototype Evaluation
We implement two well-known greedy approaches: First Come First Served (FCFS) with a FIFO queue, and Best Fit (BF) performing bin packing (i.e. allocating rst the GPUs from highly used domains) and compare them to our proposed placement algorithm with the two scheduling policies: TOPO-AWARE and TOPO-AWARE-P. Finally, we evaluate the prototype in a cloud environment, where jobs have varied GPU requirements: some needing a single GPU, some needing more than two GPUs, some requiring P2P to be fully satis ed, others needing multiple GPUs, but communication requirements are minimal. Additionally, as in a cloud environment, the jobs concurrently share any machine's resources. Table 1 : A=AlexNet, C=Ca eRef, G=GoogLeNet 5.2.1 Description of the experiment. Our rst experiment is a simple, easy-to-verify scenario, with ve jobs dynamically sharing the machine described in Section 3.1. The workload con gurations are summarized in Table 1 . Jobs' arrival time follow a Poisson distribution con gured with λ = 10 (i.e. the arrival of ten jobs per JOB'S QOS + WAITING TIME Figure 9: [Simulation] Behavioral description of the simulation performing a similar experiment to that shown for the prototype in Figure 8 .
minute), except the Job 0 which arrives at time t = 0.51s to introduce the initial load in the system. We set equal weights (0.33) to the parameters of the utility function in Equation 2 to provide equal consideration for communication cost and resource interference and fragmentation. Small batch sizes represent a reliable example of NNs that requires high GPU communication (especially for NNs using model-parallelism). Hence, we conduct this experiment using small batch sizes.
Prototype experimental results.
The results are shown in Figure 8 . In the beginning, only Job 0 is being placed. And since it requires only one GPU and there is no other job to cause interference, any placement decision fully satis es its requirements. At the 15 t h second, Job 1 arrives and the pro le indicates that it su ers interference from Job 0. Thus, the overall system utility will be lower if Job 0 and Job 1 are collocated in the same CPU socket. On the other hand, TOPO-AWARE-P prevents the undesirable collocation; it places Job 1 on a di erent socket than Job 0. When Job 3 arrives, it cannot be placed since it requires more GPUs than available. So Job 3 is only placed after Job 0 has nished, ≈70 t h second. However, at this point resource availability is non-uniform: the available GPUs are in di erent sockets.
Here is where the TOPO-AWARE-P di ers from the other approaches. If Job 3 receives the two free GPUs, one from each of the sockets, this will result in cross-socket communication over the CPU bus and results in lower performance. For this reason, the TOPO-AWARE-P delays the job placement to until it can allocate co-located GPUs, that is, when these GPUs become available. Any job with the utility lower than a threshold de ned in the job's pro le will have the placement postponed to the next scheduler iteration. As a result, the TOPO-AWARE-P performs better in execution time than the others, as shown in Figure 8 (d) vs Figures 8  (a)-(c) . For example, Job 3 had the completion time as ≈120s for the scenario with the TOPO-AWARE-P (Figure 8 (d) ), and ≈240s with the other algorithms. Note that the performance improvement is mainly related to enabling P2P over the NVLink interface to Job 3. Only the TOPO-AWARE-P provides P2P for jobs as shown in Figure  8 (d) , in all the other scenarios the GPU communication is routed through the processor's memory, which leads to higher latency, and lower bandwidth because of additional memory copies and potential contention of the shared bus.
The quality of the placement is highlighted in Figure 8 (e) and (f). Both gures show the job's slowdown compared to the ideal scenario, where the job has the fastest execution time. Also, both gures sort the jobs from worst to best-performing. While Figure  8 (e) focuses on showing the job slowdown strictly related to the placement decision, Figure 8 (f) shows the slowdown also considering the waiting time in the scheduler's queue. The results indicate that TOPO-AWARE-P is the most e cient algorithm. For instance, with TOPO-AWARE-P, jobs 1, 3, and 4 have no slowdown compared to the best-performing scenario, while these same jobs su er ≈50% slowdown when the other algorithms are making placement decisions, as shown in Figure 8 (e).
Intuitively, delaying jobs gives the impression that the queue waiting time might end up being longer. However, the results surprisingly show that TOPO-AWARE-P has a lower waiting time for some jobs than other algorithms, as shown in Figure 8 (f). This happens because having better knowledge of the requirements enables the scheduler to prevent performance interference, and then some jobs will execute faster, opening space to place other jobs sooner. This can also be seen in the cumulative execution time of the algorithms. BF nishes in ≈461.7s, FCFS in ≈456.2s, TOPO-AWARE in ≈454.2s, and TOPO-AWARE-P ≈356.9s. Hence, TOPO-AWARE-P a ords a speedup of ≈1.30x, ≈1.28x, and ≈1.27x, respectively.
Trace-Driven Simulation
Based on the logs from the prototype described in Section 5, we developed a trace-driven simulation to evaluate the scheduling algorithm in large shared clusters. In this section, we rst describe the main characteristics and con guration of the simulation. And second, we validate the simulation and perform experiments with a larger number of jobs and machines.
To evaluate the scalability, the proposed algorithm was executed to handle trace-driven simulated data at di erent scales of the system. The traces are generated by performing multiple experiments on the previously described prototype. Afterward, the trace les are parsed and transformed into a format compatible with the simulator, creating application and resource usage pro les. For generating the workloads, a Poisson distribution with arrival rate λ = 10 is used. To create the job's con guration, we used a Binomial distribution generating integer values between 0 and 3 to de ne the batch size, where 0=tiny, 1=small, 2=medium, and 3=big. And also a Binomial distribution generating integer values between 0 and 2 to determine the NN type, where 0=AlexNet, 1=Ca eRef, and 2=GoogLeNet. Additionally, all simulated machines are homogeneous and follow the hardware topology described in Section 3.1. All the jobs can run in the machines when there are enough resources.
Validation of The Simulation
We validate the reliability of the simulation system by comparing it with the same scenario as in the prototype experiments in Section 5. The simulation results are shown in Figure 9 . The algorithms behave very similarly in both prototype and the simulation, despite some expected small di erences, which are acceptable when considering the standard deviations.
Large-Scale Cluster Simulation and Results
To verify the behavior of the proposed algorithm in a large-scale environment, we use the trace-driven simulation in two di erent scenarios as follows. 5.5.1 Scenario 1: 100 jobs and 5 machines. We start the rst experiment with few machines and jobs. The results in Figure 10 (a) show that the TOPO-AWARE-P policy performs slightly better than the other; it does not violate the job's SLO. The other strategies introduce similar slowdowns in general, except FCFS that adds slowdown in more jobs.
The performance di erence between the placement strategies is more evident when analyzing the waiting time of jobs in the scheduling queue, as illustrated in Figure 10 (b) . Both TOPO-AWARE and TOPO-AWARE-P clearly outperform the greedy algorithms.
The lower performance of the greedy algorithms is explained by the fact that a sub-optimal placement decision can also limit the possible placements of other jobs. If a machine is left with only one GPU and the waiting jobs require more GPUs, the jobs must wait to be placed until enough resource becomes available. While less expressive, TOPO-AWARE-P performs better than only TOPO-AWARE. The second still presents slowdown in some jobs, and the former does not, since it allows out-of-order execution of jobs. TOPO-AWARE-P results in better performance because it does not schedule jobs to resources that do not fully satisfy its QoS. Figure 11 show that the FCFS algorithm has the worst performance, followed by BF. In summary, the new algorithm signi cantly and consistently outperforms the greedy algorithms in achieving the least slowdown and in minimizing the waiting time. The new algorithm's ability to achieve this is mainly due to its utility-based heuristics and the strategy that does not place jobs when the placement is not e cient from a communication perspective.
5.5.3
Overhead. The average time that the algorithms spend when evaluating the placement decision in scenario 2 is ≈3s for TOPO-AWARE and TOPO-AWARE-P, while for FCFS and BF it is ≈0.45s and ≈0.44s respectively. Although the proposed algorithm has higher overhead, 3 seconds on average is fast enough for scheduling learning workload on a cluster with high demands.
The proposed algorithm has a higher execution time than the greedy ones mainly because it requires more computation to provide a better decision. Note that in the worst case, our proposed algorithm will evaluate Θ(|V P |) * Θ(|E A | * lo 2 (|V P |)), where the rst Θ represents the host ltering phase and the second represents the phase to make the placement decision. Where the |E A | is the number of edges from the job's graph and |V P | is the number of a vertex from the physical graph. The other greedy algorithms have the asymptotic complexity as Θ(|E A | + |V P |) since every machine will be explored in the worst case.
RELATED WORK
Communication cost. Kindratenko et. al. [26] proposed a CUDA wrapper that works in sync with Torque batch system. The wrapper overrides some CUDA device management API calls to expose GPUs to users, taking into account the GPUs distance to provide beste orts on minimizing the communication cost. Faraji et. al. [13] propose a topology-aware GPU selection scheme to assign GPU devices to MPI processes based on the GPU-to-GPU communication pattern and the physical characteristics of a multi-GPU machine. With pro le information from the MPI application, it allocates GPUs performing a graph mapping algorithm using the SCOTCH library. While those e orts e ectively minimize the communication cost, they do not consider the potential performance interference from coscheduled jobs. In this paper, di erent from the above-related work, we further analyze and mitigate performance problem, and leverage P2P communication for multi-GPU based learning workloads in a co-scheduled environment.
Workload Collocation. Several papers investigate the performance of co-running CPU-based workload [41] , [20] , [24] , [27] , [11] , and GPU-based workloads [25] and [44] . In addition, several papers proposed scheduling algorithms to avoid problematic collocation within the same machine [33] , [9] , [32] , [7] , or with best-e orts on minimize the CPU resources interference performing low-level resource partitioning [38] , [4] , [19] , and [29] . While those papers describe the performance bottlenecks for CPU-only application and/or providing best-e orts on mitigating workload interference, they neither directly show the performance constraints of mixing multiple GPU-based learning workloads, nor do they propose a GPU-topology-aware scheduling algorithm.
Mapping Algorithm. Several researchers have been proposing heuristics for graph mapping such as graph contraction [5] , and graph embedding [39] , [43] , [30] , [40] and recursive bi-partitioning algorithm [12] that has been implemented in the software package SCOTCH [34] . While those methods have been proved to be an e ective approach, most of them are contiguous with static allocation approaches leading to resource fragmentation and focus only minimizing the communication cost, not considering the other characteristics, such as the resource sharing interference. In contrast, our work considers a utility function during the mapping phase, which captures the application's preference on di erent scenarios, and therefore, preventing SLO violations.
CONCLUSIONS
Multi-GPU applications are becoming popular because they can deliver performance improvements and increased energy e ciency. But at the same time, they present new challenges as they usually require inter-GPU communications. Such communications can take place directly between devices (with P2P) or may need to be routed through the processors' main memory, depending on the system topology and the resource allocations for the existing jobs.
In this paper, we presented a new topology-aware placement algorithm for scheduling workloads in modern multi-GPU systems. The foundation of this approach is based on the use of a new graph mapping algorithm built from application objectives and the system topology. Applications can express their performance objectives as SLOs that are later translated into abstract utility functions to drive the placement decisions. The algorithm has been validated through the construction of a real prototype on top of an IBM Power8 system enabled with 4 NVIDIA Tesla P100 cards, as well as through large-scale simulations.
Our experiments show that our algorithm e ectively reduces the communication cost while preventing interference related to resource contention, mainly for the scheduling policy that allows postponing the placement of unsatis ed jobs. In particular, with this policy, the performance impact of minimizing the GPU communication cost and avoiding interference re ects in a speedup of up to ≈1.30x in the cumulative execution time, and no SLO violations. Finally, a trace-driven simulation of a large-scale cluster reveals that compared with greed approaches our algorithm produces solutions that satisfy more jobs, minimizes the SLO violations and improves the job's execution time even in a heavily loaded scenario.
In the future, we plan to extend this work to transparently scale learning applications to multiple disaggregated GPUs across the cluster and test the implementation of our algorithm in popular resource management systems such as Kubernetes and Mesos. . We thank our IBM Research colleagues Alaa Youssef and Asser Tantawi for the valuable discussions. We also thank SC17 committee member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.
ACKNOWLEDGMENTS

