32 research outputs found

    Accelerating Data Loading in Deep Neural Network Training

    Full text link
    Data loading can dominate deep neural network training time on large-scale systems. We present a comprehensive study on accelerating data loading performance in large-scale distributed training. We first identify performance and scalability issues in current data loading implementations. We then propose optimizations that utilize CPU resources to the data loader design. We use an analytical model to characterize the impact of data loading on the overall training time and establish the performance trend as we scale up distributed training. Our model suggests that I/O rate limits the scalability of distributed training, which inspires us to design a locality-aware data loading method. By utilizing software caches, our method can drastically reduce the data loading communication volume in comparison with the original data loading implementation. Finally, we evaluate the proposed optimizations with various experiments. We achieved more than 30x speedup in data loading using 256 nodes with 1,024 learners.Comment: 11 pages, 12 figures, accepted for publication in IEEE International Conference on High Performance Computing, Data and Analytics (HiPC) 201

    An Empirical Analysis of Parallel Random Permutation Algorithms on SMPs

    Get PDF
    We compare parallel algorithms for random permutation generation on symmetric multiprocessors (SMPs). Algorithms considered are the sorting-based algorithm, Anderson's shuffling algorithm, the dart-throwing algorithm, and Sanders' algorithm. We investigate the impact of synchronization method, memory access pattern, cost of generating random numbers and other parameters on the performance of the algorithms. Within the range of inputs used and processors employed, Anderson's algorithm is preferable due to its simplicity when random number generation is relatively costly, while Sanders' algorithm has superior performance due to good cache performance when a fast random number generator is available. There is no definite winner across all settings. In fact we predict our new dart-throwing algorithm performs best when synchronization among processors becomes costly and memory access is relatively fast. We also compare the performance of our parallel implementations with the sequential implementation. It is unclear without extensive experimental studies whether fast parallel algorithms beat efficient sequential algorithms due to mismatch between model and architecture. Our implementations achieve speedups up to 6 with 12 processors on the Sun E4500.This work was supported in part by NSF Grants CAREER ACI-00-93039, NSF DBI-0420513, ITR ACI-00- 81404, DEB-99-10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, and ITR EF/BIO 03-31654; and DARPA Contract NBCH30390004

    An Experimental Study of Parallel Biconnected Components Algorithms on Symmetric Multiprocessors (SMPs)

    Get PDF
    We present an experimental study of parallel biconnected components algorithms employing several fundamental parallel primitives, e.g., prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations. Previous experimental studies of these primitives demonstrate reasonable parallel speedups. However, when these algorithms are used as subroutines to solve higher-level problems, there are two factors that hinder fast parallel implementations. One is parallel overhead, i.e., the large constant factors hidden in the asymptotic bounds; the other is the discrepancy among the data structures used in the primitives that brings non-negligible conversion cost. We present various optimization techniques and a new parallel algorithm that significantly improve the performance of finding biconnected components of a graph on symmetric multiprocessors (SMPs). Finding biconnected components has application in fault-tolerant network design, and is also used in graph planarity testing. Our parallel implementation achieves speedups up to 4 using 12 processors on a Sun E4500 for large, sparse graphs, and the source code is freely-available at our web site http://www.ece.unm.edu/~dbader.This work was supported in part by NSF Grants CAREER ACI-00-93039, ITR ACI-00-81404, DEB-99- 10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, DBI-0420513, ITR EF/BIO 03-31654 and DBI-04- 20513; and DARPA Contract NBCH30390004

    Prediction of CO2\textrm{CO}_2 Adsorption in Nano-Pores with Graph Neural Networks

    Full text link
    We investigate the graph-based convolutional neural network approach for predicting and ranking gas adsorption properties of crystalline Metal-Organic Framework (MOF) adsorbents for application in post-combustion capture of CO2\textrm{CO}_2. Our model is based solely on standard structural input files containing atomistic descriptions of the adsorbent material candidates. We construct novel methodological extensions to match the prediction accuracy of classical machine learning models that were built with hundreds of features at much higher computational cost. Our approach can be more broadly applied to optimize gas capture processes at industrial scale.Comment: AAAI Conference on Artificial Intelligence (2022

    Designing Irregular Parallel Algorithms With Mutual Exclusion and Lock-free Protocols

    Get PDF
    Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths. Within an irregular application, fine-grained synchronization is one technique for managing the coordination of work; but in practice the actual performance for irregular problems depends on the input, the access pattern to shared data structures, the relative speed of processors, and the hardware support of synchronization primitives. In this paper, we focus on lock-free and mutual exclusion protocols for handling fine- grained synchronization. Mutual exclusion and lock-free protocols have received a fair amount of attention in coordinating accesses to shared data structures from concurrent processes. Mutual exclusion offers a simple programming abstraction, while lock-free data structures provide better fault tolerance and eliminate problems associated with critical sections such as priority inversion and deadlock. These synchronization protocols, however, are seldom used in parallel algorithm designs, especially for algorithms under the SPMD paradigm, as their implementations are highly hardware dependent and their costs are hard to characterize. Using graph-theoretic algorithms for illustrative purposes, we show experimental results on two shared-memory multiprocessors, the IBM pSeries 570 and the Sun Enterprise 4500, that irregular parallel algorithms with efficient fine-grained synchronization may yield good performance.This work was supported in part by NSF Grants CAREER ACI-00-93039, ITR ACI-00-81404, DEB-99- 10123, ITR EIA-01-21377, Biocomplexity DEB-01-20709, DBI-0420513, ITR EF/BIO 03-31654; and DARPA Contract NBCH30390004
    corecore