16 research outputs found

    ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

    Full text link
    Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.Comment: 12 page

    DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

    Full text link
    Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations

    DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

    Full text link
    ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI.Comment: 14 pages, 7 figure

    Scalable and Efficient Machine Learning as a Service

    No full text
    Driven by the sustained advances of machine learning and its application to multiple domains ranging from image recognition, text prediction to translation and autonomous driving, the past few years have witnessed a surging demand for Machine-Learning-as-a-Service (MLaaS). MLaaS is an emerging computing paradigm that facilitates machine learning model design, model training, inference serving and provides optimized executions of machine learning tasks in an automated, scalable, and efficient manner.This dissertation proposes three novel approaches for MLaaS, namely SimiGrad, DistQuant, and RRL, to improve the scale and efficiency of MLaaS training and inference, respectively.For MLaaS training, we propose SimiGrad, a fine-grained adaptive batching approach for large scale training using gradient similarity measurement. Large scale training requires massive parallelism to finish the training within a reasonable amount of time. To support massive parallelism, large batch training is the key enabler but often at the cost of generalization performance. We propose a fully automated and lightweight adaptive batching methodology to enable fine-grained batch size adaption (e.g., at a mini-batch level) that can achieve state-of-the-art performance with record breaking batch sizes. The core component of our method is a lightweight yet efficient representation of the critical gradient noise information. We open-source the proposed methodology and extensive evaluations on popular benchmarks (e.g., CIFAR10, ImageNet, and BERT-Large) demonstrate that the proposed methodology outperforms state-of-the-art methodologies using adaptive batching approaches or hand-tuned static strategies in both performance and batch size. Particularly, we achieve a new state-of-the-art batch size of 78K in BERT-Large pretraining with a SQuAD score of 90.69 compared to 90.58 reported in previous state-of-the-art with 59K batch size.Another key challenge for MLaaS training is the communication cost which limits how much the training can scale. Quantization is a popular method for reducing communication cost yet it imposes non-trivial encoding and decoding overheads and may lead to degraded model performance. Our key observation is that model weights are partitioned and cached in GPU memory in common distributed training methods such as model and pipeline parallelism. If quantization can be performed on the partitioned weights in parallel while cached in GPU memory, the quantization speed can be significantly improved and we can further reduce the communication overhead for weights gathering. To this end, we propose DistQuant, a distributed quantization scheme for compressing partitioned weights during distributed training. DistQuant preserves model performance by canceling out the noise introduced by quantization and is transparent to training pipelines. We both theoretically and empirically show that DistQuant can achieve much higher precision than state-of-the-art quantization approaches. Evaluation on large-scale models including BERT and GPT2 indicates that DistQuant reduces the communication cost of MLaaS training by half without compromising model performance.For MLaaS serving, we propose RRL, a swift machine learning model serving system powered by a region-based reinforcement learning approach. To meet latency Service-Level-Objective (SLO), judicious parallelization at both request and operation levels is utterly important. However, existing ML systems (e.g., Tensorflow) and cloud ML serving platforms (e.g., SageMaker) are SLO-agnostic and rely on users to manually configure the parallelism. To provide low latency MLaaS serving, we propose a swift machine learning serving scheduling framework with a novel Region-based Reinforcement Learning (RRL) approach. RRL can efficiently identify the optimal parallelism configuration under different workloads by estimating performance of similar configurations with that of the known ones. We both theoretically and experimentally show that the RRL approach can outperform state-of-the-art approaches by finding near-optimal solutions over 8 times faster while reducing inference latency up to 79.0% and reducing SLO violation up to 49.9%

    Nemo: An Open-Source Transformer-Supercharged Benchmark for Fine-Grained Wildfire Smoke Detection

    No full text
    Deep-learning (DL)-based object detection algorithms can greatly benefit the community at large in fighting fires, advancing climate intelligence, and reducing health complications caused by hazardous smoke particles. Existing DL-based techniques, which are mostly based on convolutional networks, have proven to be effective in wildfire detection. However, there is still room for improvement. First, existing methods tend to have some commercial aspects, with limited publicly available data and models. In addition, studies aiming at the detection of wildfires at the incipient stage are rare. Smoke columns at this stage tend to be small, shallow, and often far from view, with low visibility. This makes finding and labeling enough data to train an efficient deep learning model very challenging. Finally, the inherent locality of convolution operators limits their ability to model long-range correlations between objects in an image. Recently, encoder–decoder transformers have emerged as interesting solutions beyond natural language processing to help capture global dependencies via self- and inter-attention mechanisms. We propose Nemo: a set of evolving, free, and open-source datasets, processed in standard COCO format, and wildfire smoke and fine-grained smoke density detectors, for use by the research community. We adapt Facebook’s DEtection TRansformer (DETR) to wildfire detection, which results in a much simpler technique, where the detection does not rely on convolution filters and anchors. Nemo is the first open-source benchmark for wildfire smoke density detection and Transformer-based wildfire smoke detection tailored to the early incipient stage. Two popular object detection algorithms (Faster R-CNN and RetinaNet) are used as alternatives and baselines for extensive evaluation. Our results confirm the superior performance of the transformer-based method in wildfire smoke detection across different object sizes. Moreover, we tested our model with 95 video sequences of wildfire starts from the public HPWREN database. Our model detected 97.9% of the fires in the incipient stage and 80% within 5 min from the start. On average, our model detected wildfire smoke within 3.6 min from the start, outperforming the baselines

    When Bioelectrochemical Systems Meet Forward Osmosis: Accomplishing Wastewater Treatment and Reuse through Synergy

    Get PDF
    Bioelectrochemical systems (BES) and forward osmosis (FO) are two emerging technologies with great potential for energy-efficient water/wastewater treatment. BES takes advantage of microbial interaction with a solid electron acceptor/donor to accomplish bioenergy recovery from organic compounds, and FO can extract high-quality water driven by an osmotic pressure. The strong synergy between those two technologies may complement each other and collaboratively address water-energy nexus. FO can assist BES with achieving water recovery (for future reuse), enhancing electricity generation, and supplying energy for accomplishing the cathode reactions; while BES may help FO with degrading organic contaminants, providing sustainable draw solute, and stabilizing water flux. This work has reviewed the recent development that focuses on the synergy between BES and FO, analyzed the advantages of each combination, and provided perspectives for future research. The findings encourage further investigation and development for efficient coordination between BES and FO towards an integrated system for wastewater treatment and reuse

    Life Cycle Assessment of Fuel Ethanol Production from Food Waste in Consideration of By-Product Utilization

    No full text
    In this paper, a life cycle assessment was used to evaluate fuel ethanol production from food waste with a capacity of 20 tons/day. The energy and pollution emissions during the whole process were recorded and compared by the method of electricity conversion to standard coal. Different indicators, such as GWP (global warming potential), ODP (ozone depletion potential), AP (acid potential), EP (possibility of eutrophication), POCP (photochemical oxidation potential), and DUST (dust), were used to perform an environmental impact analysis with and without by-product utilization. The result shows that the indicator sequence under the weighted factor sequence was AP > DUST > GWP > ODP > EP > POCP. The consideration of by-products decreased the values of GWP, AP, and DUST significantly; EP declined slightly; ODP and POCP increased; and the overall energy output was negative. The consideration of by-product utilization was determined to be environmentally friendly

    Constrained Optimal Control for Nonlinear Multi-Input Safety-Critical Systems with Time-Varying Safety Constraints

    No full text
    In this paper, we investigate the constrained optimal control problem of nonlinear multi-input safety-critical systems with uncertain disturbances and time-varying safety constraints. By utilizing a barrier function transformation, together with a new disturbance-related term and a smooth safety boundary function, a nominal system-dependent multi-input barrier transformation architecture is developed to deal with the time-varying safety constraints and uncertain disturbances. Based on the obtained transformation system, the coupled Hamilton–Jacobi–Bellman (HJB) function is established to obtain the constrained Nash equilibrium solution. In addition, due to the fact that it is difficult to solve the HJB function directly, the single critic neural network (NN) is constructed to approximate the optimal performance index function of different control inputs, respectively. It is proved theoretically that, under the influence of uncertain disturbances and time-varying safety constraints, the system states and neural network parameters can be uniformly ultimately bounded (UUB) by the proposed neural network approximation method. Finally, the effectiveness of the proposed method is verified by two nonlinear simulation examples

    Effects of Different Injection Strategies on Combustion and Emission Characteristics of Diesel Engine Fueled with Dual Fuel

    No full text
    In this work, an effective numerical simulation method was developed and used to analyze the effects of natural gas mixing ratio and pilot-main injection, main-post injection, and pilot-main-post injection strategies on the combustion and emission characteristics of diesel engine fueled with dual fuel. Firstly, the one-dimensional calculation model and three-dimensional CFD model of the engine were established by AVL-BOOST and AVL-Fire, respectively. In addition, the simplified chemical kinetics mechanism was adopted, which could accurately calculate the combustion and emission characteristics of the engine. The results show that the cylinder pressure and heat release rate decrease with the increase of the natural gas mixing ratio and the NOx emission is reduced. When the NG mixing ratio is 50%, the NOx and CO emission are reduced by 47% and 45%, respectively. When the SODI3 is 24 °CA ATDC, the NOx emission is reduced by 29.6%. In addition, with suitable pilot-main injection and pilot-main-post injection strategies, the combustion in the cylinder can be improved and the trade-off relationship between NOx and soot can be relaxed. Thus, the proper main-post injection strategy can improve the combustion and emission characteristics, especially the reduction in the NOx and CO emissions
    corecore