131 research outputs found
Towards providing reliable job completion time predictions using PCS
In this paper we build a case for providing job completion time predictions
to cloud users, similar to the delivery date of a package or arrival time of a
booked ride. Our analysis reveals that providing predictability can come at the
expense of performance and fairness. Existing cloud scheduling systems optimize
for extreme points in the trade-off space, making them either extremely
unpredictable or impractical.
To address this challenge, we present PCS, a new scheduling framework that
aims to provide predictability while balancing other traditional objectives.
The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a
suitable configuration of different WFQ parameters (e.g., class weights) that
meets specific goals for predictability. It uses a simulation-aided search
strategy, to efficiently discover WFQ configurations that lie on the Pareto
front of the trade-off space between these objectives. We implement and
evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a
small scale GPU testbed and larger-scale simulations, shows that PCS can
provide accurate completion time estimates while marginally compromising on
performance and fairness
vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training
As large language models (LLMs) become widespread in various application
domains, a critical challenge the AI community is facing is how to train these
large AI models in a cost-effective manner. Existing LLM training plans
typically employ a heuristic based parallel training strategy which is based on
empirical observations rather than grounded upon a thorough examination of the
search space of LLM parallelization. Such limitation renders existing systems
to leave significant performance left on the table, wasting millions of dollars
worth of training cost. This paper presents our profiling-driven simulator
called vTrain, providing AI practitioners a fast yet accurate software
framework to determine an efficient and cost-effective LLM training system
configuration. We demonstrate vTrain's practicality through several case
studies, e.g., effectively evaluating optimal training parallelization
strategies that balances training time and its associated training cost,
efficient multi-tenant GPU cluster schedulers targeting multiple LLM training
jobs, and determining a compute-optimal LLM model architecture given a fixed
compute budget
Energy-Efficient GPU Clusters Scheduling for Deep Learning
Training deep neural networks (DNNs) is a major workload in datacenters
today, resulting in a tremendously fast growth of energy consumption. It is
important to reduce the energy consumption while completing the DL training
jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters
scheduler that reduces the average Job Completion Time (JCT) under an energy
budget. We first present performance models for DL training jobs to predict the
throughput and energy consumption performance with different configurations.
Based on the performance models, PowerFlow dynamically allocates GPUs and
adjusts the GPU-level or job-level configurations of DL training jobs.
PowerFlow applies network packing and buddy allocation to job placement, thus
avoiding extra energy consumed by cluster fragmentations. Evaluation results
show that under the same energy consumption, PowerFlow improves the average JCT
by 1.57 - 3.39 x at most, compared to competitive baselines
Network Contention-Aware Cluster Scheduling with Reinforcement Learning
With continuous advances in deep learning, distributed training is becoming
common in GPU clusters. Specifically, for emerging workloads with diverse
amounts, ratios, and patterns of communication, we observe that network
contention can significantly degrade training throughput. However, widely used
scheduling policies often face limitations as they are agnostic to network
contention between jobs. In this paper, we present a new approach to mitigate
network contention in GPU clusters using reinforcement learning. We formulate
GPU cluster scheduling as a reinforcement learning problem and opt to learn a
network contention-aware scheduling policy that efficiently captures contention
sensitivities and dynamically adapts scheduling decisions through continuous
evaluation and improvement. We show that compared to widely used scheduling
policies, our approach reduces average job completion time by up to 18.2\% and
effectively cuts the tail job completion time by up to 20.7\% while allowing a
preferable trade-off between average job completion time and resource
utilization
MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning
GPU technology has been improving at an expedited pace in terms of size and
performance, empowering HPC and AI/ML researchers to advance the scientific
discovery process. However, this also leads to inefficient resource usage, as
most GPU workloads, including complicated AI/ML models, are not able to utilize
the GPU resources to their fullest extent -- encouraging support for GPU
multi-tenancy. We propose MISO, a technique to exploit the Multi-Instance GPU
(MIG) capability on the latest NVIDIA datacenter GPUs (e.g., A100, H100) to
dynamically partition GPU resources among co-located jobs. MISO's key insight
is to use the lightweight, more flexible Multi-Process Service (MPS) capability
to predict the best MIG partition allocation for different jobs, without
incurring the overhead of implementing them during exploration. Due to its
ability to utilize GPU resources more efficiently, MISO achieves 49% and 16%
lower average job completion time than the unpartitioned and optimal static GPU
partition schemes, respectively
POLCA: Power Oversubscription in LLM Cloud Providers
Recent innovation in large language models (LLMs), and their myriad use-cases
have rapidly driven up the compute capacity demand for datacenter GPUs. Several
cloud providers and other enterprises have made substantial plans of growth in
their datacenters to support these new workloads. One of the key bottleneck
resources in datacenters is power, and given the increasing model sizes of
LLMs, they are becoming increasingly power intensive. In this paper, we show
that there is a significant opportunity to oversubscribe power in LLM clusters.
Power oversubscription improves the power efficiency of these datacenters,
allowing more deployable servers per datacenter, and reduces the deployment
time, since building new datacenters is slow.
We extensively characterize the power consumption patterns of a variety of
LLMs and their configurations. We identify the differences between the
inference and training power consumption patterns. Based on our analysis of
these LLMs, we claim that the average and peak power utilization in LLM
clusters for inference should not be very high. Our deductions align with the
data from production LLM clusters, revealing that inference workloads offer
substantial headroom for power oversubscription. However, the stringent set of
telemetry and controls that GPUs offer in a virtualized environment, makes it
challenging to have a reliable and robust power oversubscription mechanism.
We propose POLCA, our framework for power oversubscription that is robust,
reliable, and readily deployable for GPU clusters. Using open-source models to
replicate the power patterns observed in production, we simulate POLCA and
demonstrate that we can deploy 30% more servers in the same GPU cluster for
inference, with minimal performance los
Towards GPU Utilization Prediction for Cloud Deep Learning
Understanding the GPU utilization of Deep Learning (DL) workloads is important for enhancing resource-efficiency and cost-benefit decision making for DL frameworks in the cloud. Current approaches to determine DL workload GPU utilization rely on online profiling within isolated GPU devices, and must be performed for every unique DL workload submission resulting in resource under-utilization and reduced service availability. In this paper, we propose a prediction engine to proactively determine the GPU utilization of heterogeneous DL workloads without the need for in-depth or isolated online profiling. We demonstrate that it is possible to predict DL workload GPU utilization via extracting information from its model computation graph. Our experiments show that the prediction engine achieves an RMSLE of 0.154, and can be exploited by DL schedulers to achieve up to 61.5% improvement to GPU cluster utilization
- …