4 research outputs found

    Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

    Full text link
    In this paper, we evaluate training of deep recurrent neural networks with half-precision floats. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100)O(100) workers. Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show linear runtime and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions, and the benchmark Large Movie Review Dataset~\cite{imdb}. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision

    Convergence of Artificial Intelligence and High Performance Computing on NSF-supported Cyberinfrastructure

    Full text link
    Significant investments to upgrade or construct large-scale scientific facilities demand commensurate investments in R&D to design algorithms and computing approaches to enable scientific and engineering breakthroughs in the big data era. The remarkable success of Artificial Intelligence (AI) algorithms to turn big-data challenges in industry and technology into transformational digital solutions that drive a multi-billion dollar industry, which play an ever increasing role shaping human social patterns, has promoted AI as the most sought after signal processing tool in big-data research. As AI continues to evolve into a computing tool endowed with statistical and mathematical rigor, and which encodes domain expertise to inform and inspire AI architectures and optimization algorithms, it has become apparent that single-GPU solutions for training, validation, and testing are no longer sufficient. This realization has been driving the confluence of AI and high performance computing (HPC) to reduce time-to-insight and to produce robust, reliable, trustworthy, and computationally efficient AI solutions. In this white paper, we present a summary of recent developments in this field, and discuss avenues to accelerate and streamline the use of HPC platforms to design accelerated AI algorithms.Comment: White paper accepted to the NSF Workshop on Smart Cyberinfrastructure, February 25-27, 2020 http://smartci.sci.utah.edu

    EZLDA: Efficient and Scalable LDA on GPUs

    Full text link
    LDA is a statistical approach for topic modeling with a wide range of applications. However, there exist very few attempts to accelerate LDA on GPUs which come with exceptional computing and memory throughput capabilities. To this end, we introduce EZLDA which achieves efficient and scalable LDA training on GPUs with the following three contributions: First, EZLDA introduces three-branch sampling method which takes advantage of the convergence heterogeneity of various tokens to reduce the redundant sampling task. Second, to enable sparsity-aware format for both D and W on GPUs with fast sampling and updating, we introduce hybrid format for W along with corresponding token partition to T and inverted index designs. Third, we design a hierarchical workload balancing solution to address the extremely skewed workload imbalance problem on GPU and scaleEZLDA across multiple GPUs. Taken together, EZLDA achieves superior performance over the state-of-the-art attempts with lower memory consumption

    A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions

    Full text link
    Deep Neural Network (DNN) has gained unprecedented performance due to its automated feature extraction capability. This high order performance leads to significant incorporation of DNN models in different Internet of Things (IoT) applications in the past decade. However, the colossal requirement of computation, energy, and storage of DNN models make their deployment prohibitive on resource constraint IoT devices. Therefore, several compression techniques were proposed in recent years for reducing the storage and computation requirements of the DNN model. These techniques on DNN compression have utilized a different perspective for compressing DNN with minimal accuracy compromise. It encourages us to make a comprehensive overview of the DNN compression techniques. In this paper, we present a comprehensive review of existing literature on compressing DNN model that reduces both storage and computation requirements. We divide the existing approaches into five broad categories, i.e., network pruning, sparse representation, bits precision, knowledge distillation, and miscellaneous, based upon the mechanism incorporated for compressing the DNN model. The paper also discussed the challenges associated with each category of DNN compression techniques. Finally, we provide a quick summary of existing work under each category with the future direction in DNN compression.Comment: 19 pages, 9 figure
    corecore