30 research outputs found

    Truly Sparse Neural Networks at Scale

    Get PDF
    Recently, sparse training methods have started to be established as a de facto approach for training and inference efficiency in artificial neural networks. Yet, this efficiency is just in theory. In practice, everyone uses a binary mask to simulate sparsity since the typical deep learning software and hardware are optimized for dense matrix operations. In this paper, we take an orthogonal approach, and we show that we can train truly sparse neural networks to harvest their full potential. To achieve this goal, we introduce three novel contributions, specially designed for sparse neural networks: (1) a parallel training algorithm and its corresponding sparse implementation from scratch, (2) an activation function with non-trainable parameters to favour the gradient flow, and (3) a hidden neurons importance metric to eliminate redundancies. All in one, we are able to break the record and to train the largest neural network ever trained in terms of representational power -- reaching the bat brain size. The results show that our approach has state-of-the-art performance while opening the path for an environmentally friendly artificial intelligence era.Comment: 30 pages, 17 figure

    Performance modelling for scalable deep learning

    Get PDF
    Performance modelling for scalable deep learning is very important to quantify the efficiency of large parallel workloads. Performance models are used to obtain run-time estimates by modelling various aspects of an application on a target system. Designing performance models requires comprehensive analysis in order to build accurate models. Limitations of current performance models include poor explainability in the computation time of the internal processes of a neural network model and limited applicability to particular architectures. Existing performance models in deep learning have been proposed, which are broadly categorized into two methodologies: analytical modelling and empirical modelling. Analytical modelling utilizes a transparent approach that involves converting the internal mechanisms of the model or applications into a mathematical model that corresponds to the goals of the system. Empirical modelling predicts outcomes based on observation and experimentation, characterizes algorithm performance using sample data, and is a good alternative to analytical modelling. However, both these approaches have limitations, such as poor explainability in the computation time of the internal processes of a neural network model and poor generalisation. To address these issues, hybridization of the analytical and empirical approaches has been applied, leading to the development of a novel generic performance model that provides a general expression of a deep neural network framework in a distributed environment, allowing for accurate performance analysis and prediction. The contributions can be summarized as follows: In the initial study, a comprehensive literature review led to the development of a performance model based on synchronous stochastic gradient descent (S-SGD) for analysing the execution time performance of deep learning frameworks in a multi-GPU environment. This model’s evaluation involved three deep learning models (Convolutional Neural Networks (CNN), Autoencoder (AE), and Multilayer Perceptron (MLP)), implemented in three popular deep learning frameworks (MXNet, Chainer, and TensorFlow) respectively, with a focus on following an analytical approach. Additionally, a generic expression for the performance model was formulated, considering intrinsic parameters and extrinsic scaling factors that impact computing time in a distributed environment. This formulation involved a global optimization problem with a cost function dependent on unknown constants within the generic expression. Differential evolution was utilized to identify the best fitting values, matching experimentally determined computation times. Furthermore, to enhance the accuracy and stability of the performance model, regularization techniques were applied. Lastly, the proposed generic performance model underwent experimental evaluation in a real-world application. The results of this evaluation provided valuable insights into the influence of hyperparameters on performance, demonstrating the robustness and applicability of the performance model in understanding and optimizing model behavior

    Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

    Full text link
    Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning

    Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

    Full text link
    Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202

    Large Scale Sparse Neural Networks

    Get PDF

    The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

    Full text link
    We present scalable hybrid-parallel algorithms for training large-scale 3D convolutional neural networks. Deep learning-based emerging scientific workflows often require model training with large, high-dimensional samples, which can make training much more costly and even infeasible due to excessive memory usage. We solve these challenges by extensively applying hybrid parallelism throughout the end-to-end training pipeline, including both computations and I/O. Our hybrid-parallel algorithm extends the standard data parallelism with spatial parallelism, which partitions a single sample in the spatial domain, realizing strong scaling beyond the mini-batch dimension with a larger aggregated memory capacity. We evaluate our proposed training algorithms with two challenging 3D CNNs, CosmoFlow and 3D U-Net. Our comprehensive performance studies show that good weak and strong scaling can be achieved for both networks using up 2K GPUs. More importantly, we enable training of CosmoFlow with much larger samples than previously possible, realizing an order-of-magnitude improvement in prediction accuracy.Comment: 12 pages, 10 figure

    FC2: Cloud-based cluster provisioning for distributed machine learning

    Get PDF

    Breast Histopathology with High-Performance Computing and Deep Learning

    Get PDF
    The increasingly intensive collection of digitalized images of tumor tissue over the last decade made histopathology a demanding application in terms of computational and storage resources. With images containing billions of pixels, the need for optimizing and adapting histopathology to large-scale data analysis is compelling. This paper presents a modular pipeline with three independent layers for the detection of tumoros regions in digital specimens of breast lymph nodes with deep learning models. Our pipeline can be deployed either on local machines or high-performance computing resources with a containerized approach. The need for expertise in high-performance computing is removed by the self-sufficient structure of Docker containers, whereas a large possibility for customization is left in terms of deep learning models and hyperparameters optimization. We show that by deploying the software layers in different infrastructures we optimize both the data preprocessing and the network training times, further increasing the scalability of the application to datasets of approximatively 43 million images. The code is open source and available on Github

    Distributed Equivalent Substitution Training for Large-Scale Recommender Systems

    Full text link
    We present Distributed Equivalent Substitution (DES) training, a novel distributed training framework for large-scale recommender systems with dynamic sparse features. DES introduces fully synchronous training to large-scale recommendation system for the first time by reducing communication, thus making the training of commercial recommender systems converge faster and reach better CTR. DES requires much less communication by substituting the weights-rich operators with the computationally equivalent sub-operators and aggregating partial results instead of transmitting the huge sparse weights directly through the network. Due to the use of synchronous training on large-scale Deep Learning Recommendation Models (DLRMs), DES achieves higher AUC(Area Under ROC). We successfully apply DES training on multiple popular DLRMs of industrial scenarios. Experiments show that our implementation outperforms the state-of-the-art PS-based training framework, achieving up to 68.7% communication savings and higher throughput compared to other PS-based recommender systems.Comment: Accepted by SIGIR '2020. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 202
    corecore