8 research outputs found
MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Training and deploying large machine learning (ML) models is time-consuming
and requires significant distributed computing infrastructures. Based on
real-world large model training on datacenter-scale infrastructures, we show
14~32% of all GPU hours are spent on communication with no overlapping
computation. To minimize the outstanding communication latency, in this work,
we develop an agile performance modeling framework to guide parallelization and
hardware-software co-design strategies. Using the suite of real-world large ML
models on state-of-the-art GPU training hardware, we demonstrate 2.24x and
5.27x throughput improvement potential for pre-training and inference
scenarios, respectively
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Mixture-of-Experts (MoE) models have gained popularity in achieving
state-of-the-art performance in a wide range of tasks in computer vision and
natural language processing. They effectively expand the model capacity while
incurring a minimal increase in computation cost during training. However,
deploying such models for inference is difficult due to their large size and
complex communication pattern. In this work, we provide a characterization of
two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT)
and identify their sources of inefficiencies at deployment. We propose three
optimization techniques to mitigate sources of inefficiencies, namely (1)
Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show
that dynamic gating improves maximum throughput by 6.21-11.23 for LM,
5.75-10.98 for MT Encoder and 2.58-5.71 for MT Decoder. It also
reduces memory usage by up to 1.36 for LM and up to 1.1 for MT.
We further propose Expert Buffering, a new caching mechanism that only keeps
hot, active experts in GPU memory while buffering the rest in CPU memory. This
reduces static memory allocation by up to 1.47. We finally propose a
load balancing methodology that provides additional scalability to the
workload
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
Estimating GPU Speedups for Programs Without Writing a Single Line of GPU Code
Heterogeneous processing using GPUs is here to stay and today spans
mobile devices, laptops, and supercomputers. Although modern
software development frameworks like OpenCL and CUDA serve as a high
productivity environment, software development for GPUs is time
consuming. First, much work needs to be done to restructure software
and data organization to match the GPU's many-threaded programming
model. Second, code optimization is quite time consuming and
performance analysis tools require significant expertise to use
effectively. Third, until the final optimized code has been derived,
it is almost impossible today to know what performance advantage
will be provided by porting a code to a GPU. This paper focuses on
this last question and seeks to develop an automated ``performance
prediction'' tool that can provide accurate estimate of GPU speedup
when provided a piece of CPU code prior to developing the GPU
code.
Our paper is built on two insights: i) Ultimately speedup
on a GPU for a piece of code is dependent on fundamental
microarchitecture-independent program properties like available
parallelism, branching behavior etc. ii) By examining a vast array
of previously implemented GPU codes along-with their CPU
counterpart, we can use machine learning to learn this correlation
between program properties and GPU speedup. In this paper, we use
linear regression, specifically, a technique inspired by
regularized regression, to build a model for speedup prediction for GPUs.
When applied to a never-seen test data selected randomly
from Rodinia, Parboil, Lonestar and Parsec benchmark suites,
as test data (speedup range of 5.9X to $276X our tool makes accurate predictions with an average
weighted error of 32%. Our technique is also
robust - the errors remain similar across other ``unseen'' GPU
platforms we test on. Essentially, we deliver an automated tool that
programmers can use to estimate potential GPU speedup before writing
any GPU code