95 research outputs found

    CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training

    Full text link
    Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: While it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms baselines empirically, including distributed RR, on a variety of benchmark tasks

    OCCL: a Deadlock-free Library for GPU Collective Communication

    Full text link
    Various distributed deep neural network (DNN) training technologies lead to increasingly complicated use of collective communications on GPU. The deadlock-prone collectives on GPU force researchers to guarantee that collectives are enqueued in a consistent order on each GPU to prevent deadlocks. In complex distributed DNN training scenarios, manual hardcoding is the only practical way for deadlock prevention, which poses significant challenges to the development of artificial intelligence. This paper presents OCCL, which is, to the best of our knowledge, the first deadlock-free collective communication library for GPU supporting dynamic decentralized preemption and gang-scheduling for collectives. Leveraging the preemption opportunity of collectives on GPU, OCCL dynamically preempts collectives in a decentralized way via the deadlock-free collective execution framework and allows dynamic decentralized gang-scheduling via the stickiness adjustment scheme. With the help of OCCL, researchers no longer have to struggle to get all GPUs to launch collectives in a consistent order to prevent deadlocks. We implement OCCL with several optimizations and integrate OCCL with a distributed deep learning framework OneFlow. Experimental results demonstrate that OCCL achieves comparable or better latency and bandwidth for collectives compared to NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can improve the peak training throughput by up to 78% compared to statically sequenced NCCL, while introducing overheads of less than 6.5% across various distributed DNN training approaches

    Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Full text link
    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

    Spartan Daily, September 23, 1986

    Get PDF
    Volume 87, Issue 18https://scholarworks.sjsu.edu/spartandaily/7475/thumbnail.jp

    Ongoing data reduction, theoretical studies

    Get PDF
    A nonspecific review of theory, correlative date analysis and supporting research and technology is presented. Title pages in some of the following areas are included: (1) magnetosphere boundary observations; (2) venus ionosphere and solar wind interaction; (3) ISEE-C plasma wave investigation, and (4) solar system plasmas

    The joint US/UK 1990 epoch world magnetic model

    Get PDF
    A detailed summary of the data used, analyses performed, modeling techniques employed, and results obtained in the course of the 1990 Epoch World Magnetic Modeling effort are given. Also, use and limitations of the GEOMAG algorithm are presented. Charts and tables related to the 1990 World Magnetic Model (WMM-90) for the Earth's main field and secular variation in Mercator and polar stereographic projections are presented along with useful tables of several magnetic field components and their secular variation on a 5-degree worldwide grid

    TorchRL: A data-driven decision-making library for PyTorch

    Full text link
    Striking a balance between integration and modularity is crucial for a machine learning library to be versatile and user-friendly, especially in handling decision and control tasks that involve large development teams and complex, real-world data, and environments. To address this issue, we propose TorchRL, a generalistic control library for PyTorch that provides well-integrated, yet standalone components. With a versatile and robust primitive design, TorchRL facilitates streamlined algorithm development across the many branches of Reinforcement Learning (RL) and control. We introduce a new PyTorch primitive, TensorDict, as a flexible data carrier that empowers the integration of the library's components while preserving their modularity. Hence replay buffers, datasets, distributed data collectors, environments, transforms and objectives can be effortlessly used in isolation or combined. We provide a detailed description of the building blocks, supporting code examples and an extensive overview of the library across domains and tasks. Finally, we show comparative benchmarks to demonstrate its computational efficiency. TorchRL fosters long-term support and is publicly available on GitHub for greater reproducibility and collaboration within the research community. The code is opensourced on https://github.com/pytorch/rl

    The Use of Teams Games Tournament (TGT) to Develop Students’ Reading Skill at the First Grade of SMAN 4 Bone

    Get PDF
    The instrument of this research was reading test. In reading test, the result of the data indicated that there was a significant difference between students’ posttest in both experimental and controlled class. In experimental class, the total mean score of post-test was 72.02 was greater than the total mean score in controlled class was 61.62. From the t-test, the researcher found that the value the t-test in the post test was greater than the t-table (5.94> 2.000)
    • …
    corecore