Search CORE

8 research outputs found

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Author: Acun Bilge
Ardalani Newsha
Brooks David
DeVito Zachary
Golden Alicia
Hsia Samuel
Wei Gu-Yeon
Wu Carole-Jean
Publication venue
Publication date: 18/10/2023
Field of study

Training and deploying large machine learning (ML) models is time-consuming and requires significant distributed computing infrastructures. Based on real-world large model training on datacenter-scale infrastructures, we show 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize the outstanding communication latency, in this work, we develop an agile performance modeling framework to guide parallelization and hardware-software co-design strategies. Using the suite of real-world large ML models on state-of-the-art GPU training hardware, we demonstrate 2.24x and 5.27x throughput improvement potential for pre-training and inference scenarios, respectively

arXiv.org e-Print Archive

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Author: Ardalani Newsha
Bhosale Shruti
Huang Haiyang
Ke Liu
Lee Benjamin
Lee Hsien-Hsin S.
Sridhar Anjali
Sun Anna
Wu Carole-Jean
Publication venue
Publication date: 17/06/2023
Field of study

Mixture-of-Experts (MoE) models have gained popularity in achieving state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves maximum throughput by 6.21-11.23

\times

for LM, 5.75-10.98

\times

for MT Encoder and 2.58-5.71

\times

for MT Decoder. It also reduces memory usage by up to 1.36

\times

for LM and up to 1.1

\times

for MT. We further propose Expert Buffering, a new caching mechanism that only keeps hot, active experts in GPU memory while buffering the rest in CPU memory. This reduces static memory allocation by up to 1.47

\times

. We finally propose a load balancing methodology that provides additional scalability to the workload

arXiv.org e-Print Archive

DataPerf: Benchmarks for Data-Centric AI Development

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac

arXiv.org e-Print Archive

Estimating GPU Speedups for Programs Without Writing a Single Line of GPU Code

Author: Ardalani Newsha
Sankaralingam Karthikeyan
Zhu Xiaojin
Publication venue
Publication date: 15/08/2014
Field of study

Heterogeneous processing using GPUs is here to stay and today spans mobile devices, laptops, and supercomputers. Although modern software development frameworks like OpenCL and CUDA serve as a high productivity environment, software development for GPUs is time consuming. First, much work needs to be done to restructure software and data organization to match the GPU's many-threaded programming model. Second, code optimization is quite time consuming and performance analysis tools require significant expertise to use effectively. Third, until the final optimized code has been derived, it is almost impossible today to know what performance advantage will be provided by porting a code to a GPU. This paper focuses on this last question and seeks to develop an automated ``performance prediction'' tool that can provide accurate estimate of GPU speedup when provided a piece of CPU code prior to developing the GPU code. Our paper is built on two insights: i) Ultimately speedup on a GPU for a piece of code is dependent on fundamental microarchitecture-independent program properties like available parallelism, branching behavior etc. ii) By examining a vast array of previously implemented GPU codes along-with their CPU counterpart, we can use machine learning to learn this correlation between program properties and GPU speedup. In this paper, we use linear regression, specifically, a technique inspired by regularized regression, to build a model for speedup prediction for GPUs. When applied to a never-seen test data selected randomly from Rodinia, Parboil, Lonestar and Parsec benchmark suites, as test data (speedup range of 5.9X to $276X our tool makes accurate predictions with an average weighted error of 32%. Our technique is also robust - the errors remain similar across other ``unseen'' GPU platforms we test on. Essentially, we deliver an automated tool that programmers can use to estimate potential GPU speedup before writing any GPU code

Minds@University of Wisconsin

Time and the Value of Data

Author: Ehsan Valavi
Joel Hestness
Marco Iansiti
Newsha Ardalani
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Crossref