23 research outputs found
SparCML: High-Performance Sparse Communication for Machine Learning
Applying machine learning techniques to the quickly growing data in science
and industry requires highly-scalable algorithms. Large datasets are most
commonly processed "data parallel" distributed across many nodes. Each node's
contribution to the overall gradient is summed using a global allreduce. This
allreduce is the single communication and thus scalability bottleneck for most
machine learning workloads. We observe that frequently, many gradient values
are (close to) zero, leading to sparse of sparsifyable communications. To
exploit this insight, we analyze, design, and implement a set of
communication-efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute arbitrary sparse input data vectors. Our generic
communication library, SparCML, extends MPI to support additional features,
such as non-blocking (asynchronous) operations and low-precision data
representations. As such, SparCML and its techniques will form the basis of
future highly-scalable machine learning frameworks
Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems
Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models
Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort
MPI uses the concept of communicators to connect groups of processes. It
provides nonblocking collective operations on communicators to overlap
communication and computation. Flexible algorithms demand flexible
communicators. E.g., a process can work on different subproblems within
different process groups simultaneously, new process groups can be created, or
the members of a process group can change. Depending on the number of
communicators, the time for communicator creation can drastically increase the
running time of the algorithm. Furthermore, a new communicator synchronizes all
processes as communicator creation routines are blocking collective operations.
We present RBC, a communication library based on MPI, that creates
range-based communicators in constant time without communication. These RBC
communicators support (non)blocking point-to-point communication as well as
(non)blocking collective operations. Our experiments show that the library
reduces the time to create a new communicator by a factor of more than 400
whereas the running time of collective operations remains about the same. We
propose Janus Quicksort, a distributed sorting algorithm that avoids any load
imbalances. We improved the performance of this algorithm by a factor of 15 for
moderate inputs by using RBC communicators. Finally, we discuss different
approaches to bring nonblocking (local) communicator creation of lightweight
(range-based) communicators into MPI
Optimizing computation-communication overlap in asynchronous task-based programs
Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.Peer ReviewedPostprint (author's final draft
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202