30 research outputs found
A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training
A new neural network architecture called Mixture-of-Experts (MoE) has been
proposed recently that increases the parameters of a neural network (the base
model) by adding sparsely activated expert blocks, without changing the total
number of floating point operations for training or inference. In theory, this
architecture allows us to train arbitrarily large models while keeping the
computational costs same as that of the base model. However, beyond 64 to 128
experts blocks, prior work has observed diminishing returns in the test
accuracies of these MoE models. Thus, training high quality MoE models requires
us to scale the size of the base models, along with the number of expert
blocks. In this work, we propose a novel, three-dimensional, hybrid parallel
algorithm that combines tensor, expert, and data parallelism to enable the
training of MoE models with 4-8x larger base models than the current
state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the
optimizer step, and communication optimizations that eliminate redundant
movement of data. Removing these redundancies provides a speedup of nearly 21%.
When training a 40 billion parameter MoE model (6.7 billion base model with 16
experts) on 128 V100 GPUs, our optimizations significantly improve the peak
half precision flop/s from 20% to 27%
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Most of the existing multi-modal models, hindered by their incapacity to
adeptly manage interleaved image-and-text inputs in multi-image, multi-round
dialogues, face substantial constraints in resource allocation for training and
data accessibility, impacting their adaptability and scalability across varied
interaction realms. To address this, we present the DeepSpeed-VisualChat
framework, designed to optimize Large Language Models (LLMs) by incorporating
multi-modal capabilities, with a focus on enhancing the proficiency of Large
Vision and Language Models in handling interleaved inputs. Our framework is
notable for (1) its open-source support for multi-round and multi-image
dialogues, (2) introducing an innovative multi-modal causal attention
mechanism, and (3) utilizing data blending techniques on existing datasets to
assure seamless interactions in multi-round, multi-image conversations.
Compared to existing frameworks, DeepSpeed-VisualChat shows superior
scalability up to 70B parameter language model size, representing a significant
advancement in multi-modal language models and setting a solid foundation for
future explorations
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large
language models on massive GPUs clusters due to its ease of use, efficiency,
and good scalability. However, when training on low-bandwidth clusters, or at
scale which forces batch size per GPU to be small, ZeRO's effective throughput
is limited because of high communication volume from gathering weights in
forward pass, backward pass, and averaging gradients. This paper introduces
three communication volume reduction techniques, which we collectively refer to
as ZeRO++, targeting each of the communication collectives in ZeRO. First is
block-quantization based all-gather. Second is data remapping that trades-off
communication for more memory. Third is a novel all-to-all based quantized
gradient averaging paradigm as replacement of reduce-scatter collective, which
preserves accuracy despite communicating low precision data. Collectively,
ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better
throughput at 384 GPU scale.Comment: 12 page
LSTM-Sharp: An Adaptable, Energy-Efficient Hardware Accelerator for Long Short-Term Memory
The effectiveness of LSTM neural networks for popular tasks such as Automatic
Speech Recognition has fostered an increasing interest in LSTM inference
acceleration. Due to the recurrent nature and data dependencies of LSTM
computations, designing a customized architecture specifically tailored to its
computation pattern is crucial for efficiency. Since LSTMs are used for a
variety of tasks, generalizing this efficiency to diverse configurations, i.e.,
adaptiveness, is another key feature of these accelerators. In this work, we
first show the problem of low resource-utilization and adaptiveness for the
state-of-the-art LSTM implementations on GPU, FPGA and ASIC architectures. To
solve these issues, we propose an intelligent tiled-based dispatching mechanism
that efficiently handles the data dependencies and increases the adaptiveness
of LSTM computation. To do so, we propose LSTM-Sharp as a hardware accelerator,
which pipelines LSTM computation using an effective scheduling scheme to hide
most of the dependent serialization. Furthermore, LSTM-Sharp employs dynamic
reconfigurable architecture to adapt to the model's characteristics. LSTM-Sharp
achieves 1.5x, 2.86x, and 82x speedups on average over the state-of-the-art
ASIC, FPGA, and GPU implementations respectively, for different LSTM models and
resource budgets. Furthermore, we provide significant energy-reduction with
respect to the previous solutions, due to the low power dissipation of
LSTM-Sharp (383 GFLOPs/Watt)
Flexible hardware acceleration for instruction-grain program monitoring
Instruction-grain program monitoring tools, which check and analyze executing programs at the granularity of individual instructions, are invaluable for quickly detecting bugs and security attacks and then limiting their damage (via containment and/or recovery). Unfortunately, their fine-grain nature implies very high monitoring overheads for software-only tools, which are typically based on dynamic binary instrumentation. Previous hardware proposals either focus on mechanisms that target specific bugs or address only the cost of binary instrumentation. In this paper, we propose a flexible hardware solution for accelerating a wide range of instruction-grain monitoring tools. By examining a number of diverse tools (for memory checking, security tracking, and data race detection), we identify three significant common sources of overheads and then propose three novel hardware techniques for addressing these overheads; Inheritance Tracking, Idempotent Filters, and Metadata-TLBs. Together, these constitute a general-purpose hardware acceleration framework. Experimental results show our framework reduces overheads by 2-3X over the previous state-of-the-art, while supporting the needed flexibility. © 2008 IEEE
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
ChatGPT-like models have revolutionized various applications in artificial
intelligence, from summarization and coding to translation, matching or even
surpassing human performance. However, the current landscape lacks an
accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement
Learning with Human Feedback) training pipeline for these powerful models,
particularly when training at the scale of billions of parameters. This paper
introduces DeepSpeed-Chat, a novel system that democratizes RLHF training,
making it accessible to the AI community. DeepSpeed-Chat offers three key
capabilities: an easy-to-use training and inference experience for ChatGPT-like
models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from
InstructGPT, and a robust DeepSpeed-RLHF system that combines various
optimizations for training and inference in a unified way. The system delivers
unparalleled efficiency and scalability, enabling training of models with
hundreds of billions of parameters in record time and at a fraction of the
cost. With this development, DeepSpeed-Chat paves the way for broader access to
advanced RLHF training, even for data scientists with limited resources,
thereby fostering innovation and further development in the field of AI.Comment: 14 pages, 7 figure
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License