30 research outputs found
Truly Sparse Neural Networks at Scale
Recently, sparse training methods have started to be established as a de
facto approach for training and inference efficiency in artificial neural
networks. Yet, this efficiency is just in theory. In practice, everyone uses a
binary mask to simulate sparsity since the typical deep learning software and
hardware are optimized for dense matrix operations. In this paper, we take an
orthogonal approach, and we show that we can train truly sparse neural networks
to harvest their full potential. To achieve this goal, we introduce three novel
contributions, specially designed for sparse neural networks: (1) a parallel
training algorithm and its corresponding sparse implementation from scratch,
(2) an activation function with non-trainable parameters to favour the gradient
flow, and (3) a hidden neurons importance metric to eliminate redundancies. All
in one, we are able to break the record and to train the largest neural network
ever trained in terms of representational power -- reaching the bat brain size.
The results show that our approach has state-of-the-art performance while
opening the path for an environmentally friendly artificial intelligence era.Comment: 30 pages, 17 figure
Performance modelling for scalable deep learning
Performance modelling for scalable deep learning is very important to quantify the
efficiency of large parallel workloads. Performance models are used to obtain run-time
estimates by modelling various aspects of an application on a target system. Designing
performance models requires comprehensive analysis in order to build accurate models.
Limitations of current performance models include poor explainability in the computation
time of the internal processes of a neural network model and limited applicability to
particular architectures.
Existing performance models in deep learning have been proposed, which are broadly
categorized into two methodologies: analytical modelling and empirical modelling. Analytical
modelling utilizes a transparent approach that involves converting the internal
mechanisms of the model or applications into a mathematical model that corresponds to
the goals of the system. Empirical modelling predicts outcomes based on observation and
experimentation, characterizes algorithm performance using sample data, and is a good alternative
to analytical modelling. However, both these approaches have limitations, such
as poor explainability in the computation time of the internal processes of a neural network
model and poor generalisation. To address these issues, hybridization of the analytical and
empirical approaches has been applied, leading to the development of a novel generic performance
model that provides a general expression of a deep neural network framework
in a distributed environment, allowing for accurate performance analysis and prediction.
The contributions can be summarized as follows:
In the initial study, a comprehensive literature review led to the development of a performance
model based on synchronous stochastic gradient descent (S-SGD) for analysing
the execution time performance of deep learning frameworks in a multi-GPU environment.
This model’s evaluation involved three deep learning models (Convolutional Neural Networks (CNN), Autoencoder (AE), and Multilayer Perceptron (MLP)), implemented in three popular deep learning frameworks (MXNet, Chainer, and TensorFlow) respectively, with a focus on following an analytical approach. Additionally, a generic expression for the performance model was formulated, considering intrinsic parameters and extrinsic scaling factors that impact computing time in a distributed environment. This formulation involved a global optimization problem with a cost function dependent on unknown constants within the generic expression. Differential evolution was utilized to identify the best fitting values, matching experimentally determined computation times. Furthermore, to enhance the accuracy and stability of the performance model, regularization techniques were applied. Lastly, the proposed generic performance model underwent experimental evaluation in a real-world application. The results of this evaluation provided valuable insights into the influence of hyperparameters on performance, demonstrating the robustness and applicability of the performance model in understanding and optimizing model behavior
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
Deep Neural Networks (DNNs) are becoming an important tool in modern
computing applications. Accelerating their training is a major challenge and
techniques range from distributed algorithms to low-level circuit design. In
this survey, we describe the problem from a theoretical perspective, followed
by approaches for its parallelization. We present trends in DNN architectures
and the resulting implications on parallelization strategies. We then review
and model the different types of concurrency in DNNs: from the single operator,
through parallelism in network inference and training, to distributed deep
learning. We discuss asynchronous stochastic optimization, distributed system
architectures, communication schemes, and neural architecture search. Based on
those approaches, we extrapolate potential directions for parallelism in deep
learning
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202
The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism
We present scalable hybrid-parallel algorithms for training large-scale 3D
convolutional neural networks. Deep learning-based emerging scientific
workflows often require model training with large, high-dimensional samples,
which can make training much more costly and even infeasible due to excessive
memory usage. We solve these challenges by extensively applying hybrid
parallelism throughout the end-to-end training pipeline, including both
computations and I/O. Our hybrid-parallel algorithm extends the standard data
parallelism with spatial parallelism, which partitions a single sample in the
spatial domain, realizing strong scaling beyond the mini-batch dimension with a
larger aggregated memory capacity. We evaluate our proposed training algorithms
with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive
performance studies show that good weak and strong scaling can be achieved for
both networks using up 2K GPUs. More importantly, we enable training of
CosmoFlow with much larger samples than previously possible, realizing an
order-of-magnitude improvement in prediction accuracy.Comment: 12 pages, 10 figure
Breast Histopathology with High-Performance Computing and Deep Learning
The increasingly intensive collection of digitalized images of tumor tissue over the last decade made histopathology a demanding application in terms of computational and storage resources. With images containing billions of pixels, the need for optimizing and adapting histopathology to large-scale data analysis is compelling. This paper presents a modular pipeline with three independent layers for the detection of tumoros regions in digital specimens of breast lymph nodes with deep learning models. Our pipeline can be deployed either on local machines or high-performance computing resources with a containerized approach. The need for expertise in high-performance computing is removed by the self-sufficient structure of Docker containers, whereas a large possibility for customization is left in terms of deep learning models and hyperparameters optimization. We show that by deploying the software layers in different infrastructures we optimize both the data preprocessing and the network training times, further increasing the scalability of the application to datasets of approximatively 43 million images. The code is open source and available on Github
Distributed Equivalent Substitution Training for Large-Scale Recommender Systems
We present Distributed Equivalent Substitution (DES) training, a novel
distributed training framework for large-scale recommender systems with dynamic
sparse features. DES introduces fully synchronous training to large-scale
recommendation system for the first time by reducing communication, thus making
the training of commercial recommender systems converge faster and reach better
CTR. DES requires much less communication by substituting the weights-rich
operators with the computationally equivalent sub-operators and aggregating
partial results instead of transmitting the huge sparse weights directly
through the network. Due to the use of synchronous training on large-scale Deep
Learning Recommendation Models (DLRMs), DES achieves higher AUC(Area Under
ROC). We successfully apply DES training on multiple popular DLRMs of
industrial scenarios. Experiments show that our implementation outperforms the
state-of-the-art PS-based training framework, achieving up to 68.7%
communication savings and higher throughput compared to other PS-based
recommender systems.Comment: Accepted by SIGIR '2020. Proceedings of the 43rd International ACM
SIGIR Conference on Research and Development in Information Retrieval. 202