2 research outputs found
Non-clairvoyant Scheduling of Coflows
The coflow scheduling problem is considered: given an input/output switch
with each port having a fixed capacity, find a scheduling algorithm that
minimizes the weighted sum of the coflow completion times respecting the port
capacities, where each flow of a coflow has a demand per input/output port, and
coflow completion time is the finishing time of the last flow of the coflow.
The objective of this paper is to present theoretical guarantees on
approximating the sum of coflow completion time in the non-clairvoyant setting,
where on a coflow arrival, only the number of flows, and their input-output
port is revealed, while the critical demand volumes for each flow on the
respective input-output port is unknown. The main result of this paper is to
show that the proposed BlindFlow algorithm is -approximate, where is
the largest number of input-output port pairs that a coflow uses. This result
holds even in the online case, where coflows arrive over time and the scheduler
has to use only causal information. Simulations reveal that the experimental
performance of BlindFlow is far better than the theoretical guarantee.Comment: To Appear in Proc. WiOpt 202
Efficient Resource Management for Deep Learning Clusters
Deep Learning (DL) is gaining rapid popularity in various domains, such as computer vision, speech recognition, etc. With the increasing demands, large clusters have been built to develop DL models (i.e., data preparation and model training). DL jobs have some unique features ranging from their hardware requirements to execution patterns. However, the resource management techniques applied in existing DL clusters have not yet been adapted to those new features, which leads to resource inefficiency and hurts the performance of DL jobs.
We observed three major challenges brought by DL jobs. First, data preparation jobs, which prepare training datasets from a large volume of raw data, are memory intensive. DL clusters often over-allocate memory resource to those jobs for protecting their performance, which causes memory underutilization in DL clusters. Second, the execution time of a DL training job is often unknown before job completion. Without such information, existing cluster schedulers are unable to minimize the average Job Completion Time (JCT) of those jobs. Third, model aggregations in Distributed Deep Learning (DDL) training are often assigned with a fixed group of CPUs. However, a large portion of those CPUs are wasted because the bursty model aggregations can not saturate them all the time.
In this thesis, we propose a suite of techniques to eliminate the mismatches between DL jobs and resource management in DL clusters. First, we bring the idea of memory disaggregation to enhance the memory utilization of DL clusters. The unused memory in data preparation jobs is exposed as remote memory to other machines that are running out of local memory. Second, we design a two-dimensional attained-service-based scheduler to optimize the average JCT of DL training jobs. This scheduler takes the temporal and spatial characteristics of DL training jobs into consideration and can efficiently schedule them without knowing their execution time. Third, we define a shared model aggregation service to reduce the CPU cost of DDL training. Using this service, model aggregations from different DDL training jobs are carefully packed together and use the same group of CPUs in a time-sharing manner. With these techniques, we demonstrate that huge improvements in resource efficiency and job performance can be obtained when the cluster’s resource management matches with the features of DL jobs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169955/1/jcgu_1.pd