32 research outputs found
Accelerating Data Loading in Deep Neural Network Training
Data loading can dominate deep neural network training time on large-scale
systems. We present a comprehensive study on accelerating data loading
performance in large-scale distributed training. We first identify performance
and scalability issues in current data loading implementations. We then propose
optimizations that utilize CPU resources to the data loader design. We use an
analytical model to characterize the impact of data loading on the overall
training time and establish the performance trend as we scale up distributed
training. Our model suggests that I/O rate limits the scalability of
distributed training, which inspires us to design a locality-aware data loading
method. By utilizing software caches, our method can drastically reduce the
data loading communication volume in comparison with the original data loading
implementation. Finally, we evaluate the proposed optimizations with various
experiments. We achieved more than 30x speedup in data loading using 256 nodes
with 1,024 learners.Comment: 11 pages, 12 figures, accepted for publication in IEEE International
Conference on High Performance Computing, Data and Analytics (HiPC) 201
An Empirical Analysis of Parallel Random Permutation Algorithms on SMPs
We compare parallel algorithms for random permutation generation on symmetric multiprocessors
(SMPs). Algorithms considered are the sorting-based algorithm, Anderson's shuffling
algorithm, the dart-throwing algorithm, and Sanders' algorithm. We investigate the impact
of synchronization method, memory access pattern, cost of generating random numbers
and other parameters on the performance of the algorithms. Within the range of inputs used and
processors employed, Anderson's algorithm is preferable due to its simplicity when random
number generation is relatively costly, while Sanders' algorithm has superior performance due
to good cache performance when a fast random number generator is available. There is no definite
winner across all settings. In fact we predict our new dart-throwing algorithm performs
best when synchronization among processors becomes costly and memory access is relatively
fast.
We also compare the performance of our parallel implementations with the sequential implementation.
It is unclear without extensive experimental studies whether fast parallel algorithms
beat efficient sequential algorithms due to mismatch between model and architecture.
Our implementations achieve speedups up to 6 with 12 processors on the Sun E4500.This work was supported in part by NSF Grants CAREER ACI-00-93039, NSF DBI-0420513, ITR ACI-00-
81404, DEB-99-10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, and ITR EF/BIO 03-31654; and DARPA
Contract NBCH30390004
An Experimental Study of Parallel Biconnected Components Algorithms on Symmetric Multiprocessors (SMPs)
We present an experimental study of parallel biconnected components algorithms
employing several fundamental parallel primitives, e.g., prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations. Previous experimental studies
of these primitives demonstrate reasonable parallel speedups. However, when these
algorithms are used as subroutines to solve higher-level problems, there are two factors that hinder fast parallel implementations. One is parallel overhead, i.e., the large
constant factors hidden in the asymptotic bounds; the other is the discrepancy among
the data structures used in the primitives that brings non-negligible conversion cost.
We present various optimization techniques and a new parallel algorithm that significantly improve the performance of finding biconnected components of a graph
on symmetric multiprocessors (SMPs). Finding biconnected components has application in fault-tolerant network design, and is also used in graph planarity testing.
Our parallel implementation achieves speedups up to 4 using 12 processors on a Sun
E4500 for large, sparse graphs, and the source code is freely-available at our web site
http://www.ece.unm.edu/~dbader.This work was supported in part by NSF Grants CAREER ACI-00-93039, ITR ACI-00-81404, DEB-99-
10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, DBI-0420513, ITR EF/BIO 03-31654 and DBI-04-
20513; and DARPA Contract NBCH30390004
Prediction of Adsorption in Nano-Pores with Graph Neural Networks
We investigate the graph-based convolutional neural network approach for
predicting and ranking gas adsorption properties of crystalline Metal-Organic
Framework (MOF) adsorbents for application in post-combustion capture of
. Our model is based solely on standard structural input files
containing atomistic descriptions of the adsorbent material candidates. We
construct novel methodological extensions to match the prediction accuracy of
classical machine learning models that were built with hundreds of features at
much higher computational cost. Our approach can be more broadly applied to
optimize gas capture processes at industrial scale.Comment: AAAI Conference on Artificial Intelligence (2022
Designing Irregular Parallel Algorithms With Mutual Exclusion and Lock-free Protocols
Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths.
Within an irregular application, fine-grained synchronization is one technique for managing the coordination of work; but in practice the actual performance for irregular
problems depends on the input, the access pattern to shared data structures, the relative speed of processors, and the hardware support of synchronization primitives. In
this paper, we focus on lock-free and mutual exclusion protocols for handling fine-
grained synchronization. Mutual exclusion and lock-free protocols have received a fair
amount of attention in coordinating accesses to shared data structures from concurrent
processes. Mutual exclusion offers a simple programming abstraction, while lock-free
data structures provide better fault tolerance and eliminate problems associated with
critical sections such as priority inversion and deadlock. These synchronization protocols, however, are seldom used in parallel algorithm designs, especially for algorithms
under the SPMD paradigm, as their implementations are highly hardware dependent
and their costs are hard to characterize. Using graph-theoretic algorithms for illustrative purposes, we show experimental results on two shared-memory multiprocessors,
the IBM pSeries 570 and the Sun Enterprise 4500, that irregular parallel algorithms
with efficient fine-grained synchronization may yield good performance.This work was supported in part by NSF Grants CAREER ACI-00-93039, ITR ACI-00-81404, DEB-99-
10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, DBI-0420513, ITR EF/BIO 03-31654; and DARPA
Contract NBCH30390004