4 research outputs found
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters
In this paper, we evaluate training of deep recurrent neural networks with
half-precision floats. We implement a distributed, data-parallel, synchronous
training algorithm by integrating TensorFlow and CUDA-aware MPI to enable
execution across multiple GPU nodes and making use of high-speed interconnects.
We introduce a learning rate schedule facilitating neural network convergence
at up to workers.
Strong scaling tests performed on clusters of NVIDIA Pascal P100 GPUs show
linear runtime and logarithmic communication time scaling for both single and
mixed precision training modes. Performance is evaluated on a scientific
dataset taken from the Joint European Torus (JET) tokamak, containing
multi-modal time series of sensory measurements leading up to deleterious
events called plasma disruptions, and the benchmark Large Movie Review
Dataset~\cite{imdb}. Half-precision significantly reduces memory and network
bandwidth, allowing training of state-of-the-art models with over 70 million
trainable parameters while achieving a comparable test set performance as
single precision
Convergence of Artificial Intelligence and High Performance Computing on NSF-supported Cyberinfrastructure
Significant investments to upgrade or construct large-scale scientific
facilities demand commensurate investments in R&D to design algorithms and
computing approaches to enable scientific and engineering breakthroughs in the
big data era. The remarkable success of Artificial Intelligence (AI) algorithms
to turn big-data challenges in industry and technology into transformational
digital solutions that drive a multi-billion dollar industry, which play an
ever increasing role shaping human social patterns, has promoted AI as the most
sought after signal processing tool in big-data research. As AI continues to
evolve into a computing tool endowed with statistical and mathematical rigor,
and which encodes domain expertise to inform and inspire AI architectures and
optimization algorithms, it has become apparent that single-GPU solutions for
training, validation, and testing are no longer sufficient. This realization
has been driving the confluence of AI and high performance computing (HPC) to
reduce time-to-insight and to produce robust, reliable, trustworthy, and
computationally efficient AI solutions. In this white paper, we present a
summary of recent developments in this field, and discuss avenues to accelerate
and streamline the use of HPC platforms to design accelerated AI algorithms.Comment: White paper accepted to the NSF Workshop on Smart
Cyberinfrastructure, February 25-27, 2020 http://smartci.sci.utah.edu
EZLDA: Efficient and Scalable LDA on GPUs
LDA is a statistical approach for topic modeling with a wide range of
applications. However, there exist very few attempts to accelerate LDA on GPUs
which come with exceptional computing and memory throughput capabilities. To
this end, we introduce EZLDA which achieves efficient and scalable LDA training
on GPUs with the following three contributions: First, EZLDA introduces
three-branch sampling method which takes advantage of the convergence
heterogeneity of various tokens to reduce the redundant sampling task. Second,
to enable sparsity-aware format for both D and W on GPUs with fast sampling and
updating, we introduce hybrid format for W along with corresponding token
partition to T and inverted index designs. Third, we design a hierarchical
workload balancing solution to address the extremely skewed workload imbalance
problem on GPU and scaleEZLDA across multiple GPUs. Taken together, EZLDA
achieves superior performance over the state-of-the-art attempts with lower
memory consumption
A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions
Deep Neural Network (DNN) has gained unprecedented performance due to its
automated feature extraction capability. This high order performance leads to
significant incorporation of DNN models in different Internet of Things (IoT)
applications in the past decade. However, the colossal requirement of
computation, energy, and storage of DNN models make their deployment
prohibitive on resource constraint IoT devices. Therefore, several compression
techniques were proposed in recent years for reducing the storage and
computation requirements of the DNN model. These techniques on DNN compression
have utilized a different perspective for compressing DNN with minimal accuracy
compromise. It encourages us to make a comprehensive overview of the DNN
compression techniques. In this paper, we present a comprehensive review of
existing literature on compressing DNN model that reduces both storage and
computation requirements. We divide the existing approaches into five broad
categories, i.e., network pruning, sparse representation, bits precision,
knowledge distillation, and miscellaneous, based upon the mechanism
incorporated for compressing the DNN model. The paper also discussed the
challenges associated with each category of DNN compression techniques.
Finally, we provide a quick summary of existing work under each category with
the future direction in DNN compression.Comment: 19 pages, 9 figure