95 research outputs found
CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training
Recent research on online Gradient Balancing (GraB) has revealed that there
exist permutation-based example orderings that are guaranteed to outperform
random reshuffling (RR). Whereas RR arbitrarily permutes training examples,
GraB leverages stale gradients from prior epochs to order examples -- achieving
a provably faster convergence rate than RR. However, GraB is limited by design:
While it demonstrates an impressive ability to scale-up training on centralized
data, it does not naturally extend to modern distributed ML workloads. We
therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights
from prior work on kernel thinning to translate the benefits of provably faster
permutation-based example ordering to distributed settings. With negligible
overhead, CD-GraB exhibits a linear speedup in convergence rate over
centralized GraB and outperforms baselines empirically, including distributed
RR, on a variety of benchmark tasks
OCCL: a Deadlock-free Library for GPU Collective Communication
Various distributed deep neural network (DNN) training technologies lead to
increasingly complicated use of collective communications on GPU. The
deadlock-prone collectives on GPU force researchers to guarantee that
collectives are enqueued in a consistent order on each GPU to prevent
deadlocks. In complex distributed DNN training scenarios, manual hardcoding is
the only practical way for deadlock prevention, which poses significant
challenges to the development of artificial intelligence. This paper presents
OCCL, which is, to the best of our knowledge, the first deadlock-free
collective communication library for GPU supporting dynamic decentralized
preemption and gang-scheduling for collectives. Leveraging the preemption
opportunity of collectives on GPU, OCCL dynamically preempts collectives in a
decentralized way via the deadlock-free collective execution framework and
allows dynamic decentralized gang-scheduling via the stickiness adjustment
scheme. With the help of OCCL, researchers no longer have to struggle to get
all GPUs to launch collectives in a consistent order to prevent deadlocks. We
implement OCCL with several optimizations and integrate OCCL with a distributed
deep learning framework OneFlow. Experimental results demonstrate that OCCL
achieves comparable or better latency and bandwidth for collectives compared to
NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can
improve the peak training throughput by up to 78% compared to statically
sequenced NCCL, while introducing overheads of less than 6.5% across various
distributed DNN training approaches
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing
samples across nodes usually yields the best performance, but poses scaling
challenges due to global information dissemination and load imbalance across
uneven sample lengths. State-of-the-art decentralized optimizers mitigate the
problem, but require more iterations to achieve the same accuracy as their
globally-communicating counterparts. We present Wait-Avoiding Group Model
Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global
communication via subgroup weight exchange. The key insight is a combination of
algorithmic changes to the averaging scheme and the use of a group allreduce
operation. We prove the convergence of WAGMA-SGD, and empirically show that it
retains convergence rates similar to Allreduce-SGD. For evaluation, we train
ResNet-50 on ImageNet; Transformer for machine translation; and deep
reinforcement learning for navigation at scale. Compared with state-of-the-art
decentralized SGD variants, WAGMA-SGD significantly improves training
throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves
the fastest time-to-solution (e.g., the highest score using the shortest
training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems
(IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202
Spartan Daily, September 23, 1986
Volume 87, Issue 18https://scholarworks.sjsu.edu/spartandaily/7475/thumbnail.jp
Ongoing data reduction, theoretical studies
A nonspecific review of theory, correlative date analysis and supporting research and technology is presented. Title pages in some of the following areas are included: (1) magnetosphere boundary observations; (2) venus ionosphere and solar wind interaction; (3) ISEE-C plasma wave investigation, and (4) solar system plasmas
The joint US/UK 1990 epoch world magnetic model
A detailed summary of the data used, analyses performed, modeling techniques employed, and results obtained in the course of the 1990 Epoch World Magnetic Modeling effort are given. Also, use and limitations of the GEOMAG algorithm are presented. Charts and tables related to the 1990 World Magnetic Model (WMM-90) for the Earth's main field and secular variation in Mercator and polar stereographic projections are presented along with useful tables of several magnetic field components and their secular variation on a 5-degree worldwide grid
TorchRL: A data-driven decision-making library for PyTorch
Striking a balance between integration and modularity is crucial for a
machine learning library to be versatile and user-friendly, especially in
handling decision and control tasks that involve large development teams and
complex, real-world data, and environments. To address this issue, we propose
TorchRL, a generalistic control library for PyTorch that provides
well-integrated, yet standalone components. With a versatile and robust
primitive design, TorchRL facilitates streamlined algorithm development across
the many branches of Reinforcement Learning (RL) and control. We introduce a
new PyTorch primitive, TensorDict, as a flexible data carrier that empowers the
integration of the library's components while preserving their modularity.
Hence replay buffers, datasets, distributed data collectors, environments,
transforms and objectives can be effortlessly used in isolation or combined. We
provide a detailed description of the building blocks, supporting code examples
and an extensive overview of the library across domains and tasks. Finally, we
show comparative benchmarks to demonstrate its computational efficiency.
TorchRL fosters long-term support and is publicly available on GitHub for
greater reproducibility and collaboration within the research community. The
code is opensourced on https://github.com/pytorch/rl
The Use of Teams Games Tournament (TGT) to Develop Students’ Reading Skill at the First Grade of SMAN 4 Bone
The instrument of this research was reading test. In reading test, the result of the data indicated that there was a significant difference between students’ posttest
in both experimental and controlled class. In experimental class, the total mean score of post-test was 72.02 was greater than the total mean score in controlled class was 61.62. From the t-test, the researcher found that the value the t-test in the post test was greater than the t-table
(5.94> 2.000)
- …