46 research outputs found
Automating Topology Aware Mapping for Supercomputers
Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect topologies and large network diameters. Computation is cheap and communication on the network is becoming the
bottleneck for scaling of parallel applications. Network contention, specifically, is becoming an increasingly important factor affecting overall performance. The broad goal of this dissertation is performance optimization of
parallel applications through reduction of network contention.
Most parallel applications have a certain communication topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on a machine can potentially lead to performance improvements. Mapping of the communication graph for an application on to the interconnect topology of a machine while trying to localize communication is the research problem under consideration.
The farther different messages travel on the network, greater is the chance of resource sharing between messages. This can create contention on the network for networks commonly used today. Evaluative studies in this dissertation show that on IBM Blue Gene and Cray XT machines, message latencies can be severely affected under contention. Realizing this fact, application developers have started paying attention to the mapping of tasks to physical processors to minimize contention. Placement of communicating tasks on nearby physical processors can minimize the distance traveled by messages and reduce the chances of contention.
Performance improvements through topology aware placement for applications such as NAMD and OpenAtom are used to motivate this work. Building on these ideas, the dissertation proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. The effect of contention on message latencies is studied in depth to guide the
design of mapping algorithms. The hop-bytes metric is proposed for the evaluation of mapping algorithms as a better metric than the previously used maximum dilation metric. The main focus of this dissertation is on
developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns. The automatic mapping framework is a suite of such algorithms with capabilities to choose the best mapping for a problem with a given communication graph. The dissertation also briefly discusses completely distributed mapping techniques which will be imperative
for machines of the future.published or submitted for publicationnot peer reviewe
Mapping applications with collectives over sub-communicators on torus networks
pre-printThe placement of tasks in a parallel application on specific nodes of a supercomputer can significantly impact performance. Traditionally, this task mapping has focused on reducing the distance between communicating tasks on the physical network. This minimizes the number of hops that point-to-point messages travel and thus reduces link sharing between messages and contention. However, for applications that use collectives over sub-communicators, this heuristic may not be optimal. Many collectives can benefit from an increase in bandwidth even at the cost of an increase in hop count, especially when sending large messages. For example, placing communicating tasks in a cube configuration rather than a plane or a line on a torus network increases the number of possible paths messages might take. This increases the available bandwidth which can lead to significant performance gains. We have developed Rubik, a tool that provides a simple and intuitive interface to create a wide variety of mappings for structured communication patterns. Rubik supports a number of elementary operations such as splits, tilts, or shifts, that can be combined into a large number of unique patterns. Each operation can be applied to disjoint groups of processes involved in collectives to increase the effective bandwidth. We demonstrate the use of Rubik for improving performance of two parallel codes, pF3D and Qbox, which use collectives over sub-communicators
AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning
In the last few years, the memory requirements to train state-of-the-art
neural networks have far exceeded the DRAM capacities of modern hardware
accelerators. This has necessitated the development of efficient algorithms to
train these neural networks in parallel on large-scale GPU-based clusters.
Since computation is relatively inexpensive on modern GPUs, designing and
implementing extremely efficient communication in these parallel training
algorithms is critical for extracting the maximum performance. This paper
presents AxoNN, a parallel deep learning framework that exploits asynchrony and
message-driven execution to schedule neural network operations on each GPU,
thereby reducing GPU idle time and maximizing hardware efficiency. By using the
CPU memory as a scratch space for offloading data periodically during training,
AxoNN is able to reduce GPU memory consumption by four times. This allows us to
increase the number of parameters per GPU by four times, thus reducing the
amount of communication and increasing performance by over 13%. When tested
against large transformer models with 12-100 billion parameters on 48-384
NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of
theoretical peak and reduces the training time by 22-37 days (15-25% speedup)
as compared to the state-of-the-art.Comment: Proceedings of the IEEE International Parallel & Distributed
Processing Symposium (IPDPS). IEEE Computer Society, May 202
Communication-minimizing Asynchronous Tensor Parallelism
As state-of-the-art neural networks scale to billions of parameters,
designing parallel algorithms that can train these networks efficiently on
multi-GPU clusters has become critical. This paper presents Tensor3D, a novel
three-dimensional (3D) approach to parallelize tensor computations, that
strives to minimize the idle time incurred due to communication in parallel
training of large multi-billion parameter models. First, we introduce an
intelligent distribution of neural network parameters across GPUs that
eliminates communication required for satisfying data dependencies of
individual layers. Then, we propose a novel overdecomposition of the parallel
training process, using which we achieve significant overlap of communication
with computation, thereby reducing GPU idle time. Finally, we present a
communication model, which helps users identify communication optimal
decompositions of available hardware resources for a given neural network. For
a 28B parameter CNN on 256 A100 GPUs, Tensor3D improves the training time by
nearly 60% as compared to Megatron-LM
Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization
Despite their better convergence properties compared to first-order
optimizers, second-order optimizers for deep learning have been less popular
due to their significant computational costs. The primary efficiency bottleneck
in such optimizers is matrix inverse calculations in the preconditioning step,
which are expensive to compute on GPUs. In this paper, we introduce Jorge, a
second-order optimizer that promises the best of both worlds -- rapid
convergence benefits of second-order methods, and high computational efficiency
typical of first-order methods. We address the primary computational bottleneck
of computing matrix inverses by completely eliminating them using an
approximation of the preconditioner computation. This makes Jorge extremely
efficient on GPUs in terms of wall-clock time. Further, we describe an approach
to determine Jorge's hyperparameters directly from a well-tuned SGD baseline,
thereby significantly minimizing tuning efforts. Our empirical evaluations
demonstrate the distinct advantages of using Jorge, outperforming
state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple
deep learning models, both in terms of sample efficiency and wall-clock time
A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks
The field of deep learning has witnessed a remarkable shift towards extremely
compute- and memory-intensive neural networks. These newer larger models have
enabled researchers to advance state-of-the-art tools across a variety of
fields. This phenomenon has spurred the development of algorithms for
distributed training of neural networks over a larger number of hardware
accelerators. In this paper, we discuss and compare current state-of-the-art
frameworks for large scale distributed deep learning. First, we survey current
practices in distributed learning and identify the different types of
parallelism used. Then, we present empirical results comparing their
performance on large image and language training tasks. Additionally, we
address their statistical efficiency and memory consumption behavior. Based on
our results, we discuss algorithmic and implementation portions of each
framework which hinder performance
ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems
Parallel applications can spend a significant amount of time performing I/O
on large-scale supercomputers. Fast near-compute storage accelerators called
burst buffers can reduce the time a processor spends performing I/O and
mitigate I/O bottlenecks. However, determining if a given application could be
accelerated using burst buffers is not straightforward even for storage
experts. The relationship between an application's I/O characteristics (such as
I/O volume, processes involved, etc.) and the best storage sub-system for it
can be complicated. As a result, adapting parallel applications to use burst
buffers efficiently is a trial-and-error process. In this work, we present a
Python-based tool called PrismIO that enables programmatic analysis of I/O
traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel
file systems and explain why certain I/O patterns perform poorly. Further, we
use machine learning to model the relationship between I/O characteristics and
burst buffer selections. We run IOR (an I/O benchmark) with various I/O
characteristics on different storage systems and collect performance data. We
use the data as the input for training the model. Our model can predict if a
file of an application should be placed on BBs for unseen IOR scenarios with an
accuracy of 94.47% and for four real applications with an accuracy of 95.86%
A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training
A new neural network architecture called Mixture-of-Experts (MoE) has been
proposed recently that increases the parameters of a neural network (the base
model) by adding sparsely activated expert blocks, without changing the total
number of floating point operations for training or inference. In theory, this
architecture allows us to train arbitrarily large models while keeping the
computational costs same as that of the base model. However, beyond 64 to 128
experts blocks, prior work has observed diminishing returns in the test
accuracies of these MoE models. Thus, training high quality MoE models requires
us to scale the size of the base models, along with the number of expert
blocks. In this work, we propose a novel, three-dimensional, hybrid parallel
algorithm that combines tensor, expert, and data parallelism to enable the
training of MoE models with 4-8x larger base models than the current
state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the
optimizer step, and communication optimizations that eliminate redundant
movement of data. Removing these redundancies provides a speedup of nearly 21%.
When training a 40 billion parameter MoE model (6.7 billion base model with 16
experts) on 128 V100 GPUs, our optimizations significantly improve the peak
half precision flop/s from 20% to 27%
Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers
Abstract — Large parallel machines with hundreds of thou-sands of processors are being built. Recent studies have shown that ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to yield poor load balance on very large machines. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scala-bility challenges of centralized schemes and poor solutions of traditional distributed schemes. This is done by creating multiple levels of aggressive load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We present techniques to deal with scalability challenges of load balancing at very large scale. We show performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at TACC) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD with results on the Blue Gene/P machine at ANL. I