365 research outputs found

    One weird trick for parallelizing convolutional neural networks

    Full text link
    I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks

    Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

    Full text link
    The past few years have witnessed growth in the computational requirements for training deep convolutional neural networks. Current approaches parallelize training onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, these approaches result in suboptimal runtime performance in large-scale distributed training, since different layers in a network may prefer different parallelization strategies. In this paper, we propose layer-wise parallelism that allows each layer in a network to use an individual parallelization strategy. We jointly optimize how each layer is parallelized by solving a graph search problem. Our evaluation shows that layer-wise parallelism outperforms state-of-the-art approaches by increasing training throughput, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining original network accuracy

    User-transparent Distributed TensorFlow

    Full text link
    Deep Learning (DL) algorithms have become the {\em de facto} choice for data analysis. Several DL implementations -- primarily limited to a single compute node -- such as Caffe, TensorFlow, Theano and Torch have become readily available. Distributed DL implementations capable of execution on large scale systems are becoming important to address the computational needs of large data produced by scientific simulations and experiments. Yet, the adoption of distributed DL implementations faces significant impediments: 1) most implementations require DL analysts to modify their code significantly -- which is a show-stopper, 2) several distributed DL implementations are geared towards cloud computing systems -- which is inadequate for execution on massively parallel systems such as supercomputers. This work addresses each of these problems. We provide a distributed memory DL implementation by incorporating required changes in the TensorFlow runtime itself. This dramatically reduces the entry barrier for using a distributed TensorFlow implementation. We use Message Passing Interface (MPI) -- which provides performance portability, especially since MPI specific changes are abstracted from users. Lastly -- and arguably most importantly -- we make our implementation available for broader use, under the umbrella of Machine Learning Toolkit for Extreme Scale (MaTEx) at {\texttt http://hpc.pnl.gov/matex}. We refer to our implementation as MaTEx-TensorFlow.Comment: 9 pages, 8 figure

    HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

    Full text link
    With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.Comment: To appear in the 2019 25th International Symposium on High-Performance Computer Architecture (HPCA 2019

    DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters

    Full text link
    The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms and GPGPU-based acceleration provide a mainstream solution to this computational challenge. In this paper, we propose DeepSpark, a distributed and parallel deep learning framework that exploits Apache Spark on commodity clusters. To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe/Tensorflow-running nodes using Spark, and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent based update scheme, effectively complementing the synchronized processing capabilities of Spark. DeepSpark is an on-going project, and the current release is available at http://deepspark.snu.ac.kr

    ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks

    Full text link
    In this paper, we propose a deep neural network architecture for object recognition based on recurrent neural networks. The proposed network, called ReNet, replaces the ubiquitous convolution+pooling layer of the deep convolutional neural network with four recurrent neural networks that sweep horizontally and vertically in both directions across the image. We evaluate the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 and SVHN. The result suggests that ReNet is a viable alternative to the deep convolutional neural network, and that further investigation is needed

    Parle: parallelizing stochastic gradient descent

    Full text link
    We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations

    Image Classification at Supercomputer Scale

    Full text link
    Deep learning is extremely computationally intensive, and hardware vendors have responded by building faster accelerators in large clusters. Training deep learning models at petaFLOPS scale requires overcoming both algorithmic and systems software challenges. In this paper, we discuss three systems-related optimizations: (1) distributed batch normalization to control per-replica batch sizes, (2) input pipeline optimizations to sustain model throughput, and (3) 2-D torus all-reduce to speed up gradient summation. We combine these optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million images/second and no accuracy drop.Comment: Presented as part of Systems for ML Workshop @ NIPS 201

    DyVEDeep: Dynamic Variable Effort Deep Neural Networks

    Full text link
    Deep Neural Networks (DNNs) have advanced the state-of-the-art in a variety of machine learning tasks and are deployed in increasing numbers of products and services. However, the computational requirements of training and evaluating large-scale DNNs are growing at a much faster pace than the capabilities of the underlying hardware platforms that they are executed upon. In this work, we propose Dynamic Variable Effort Deep Neural Networks (DyVEDeep) to reduce the computational requirements of DNNs during inference. Previous efforts propose specialized hardware implementations for DNNs, statically prune the network, or compress the weights. Complementary to these approaches, DyVEDeep is a dynamic approach that exploits the heterogeneity in the inputs to DNNs to improve their compute efficiency with comparable classification accuracy. DyVEDeep equips DNNs with dynamic effort mechanisms that, in the course of processing an input, identify how critical a group of computations are to classify the input. DyVEDeep dynamically focuses its compute effort only on the critical computa- tions, while skipping or approximating the rest. We propose 3 effort knobs that operate at different levels of granularity viz. neuron, feature and layer levels. We build DyVEDeep versions for 5 popular image recognition benchmarks - one for CIFAR-10 and four for ImageNet (AlexNet, OverFeat and VGG-16, weight-compressed AlexNet). Across all benchmarks, DyVEDeep achieves 2.1x-2.6x reduction in the number of scalar operations, which translates to 1.8x-2.3x performance improvement over a Caffe-based implementation, with < 0.5% loss in accuracy

    Deep convolutional networks for pancreas segmentation in CT imaging

    Full text link
    Automatic organ segmentation is an important prerequisite for many computer-aided diagnosis systems. The high anatomical variability of organs in the abdomen, such as the pancreas, prevents many segmentation methods from achieving high accuracies when compared to other segmentation of organs like the liver, heart or kidneys. Recently, the availability of large annotated training sets and the accessibility of affordable parallel computing resources via GPUs have made it feasible for "deep learning" methods such as convolutional networks (ConvNets) to succeed in image classification tasks. These methods have the advantage that used classification features are trained directly from the imaging data. We present a fully-automated bottom-up method for pancreas segmentation in computed tomography (CT) images of the abdomen. The method is based on hierarchical coarse-to-fine classification of local image regions (superpixels). Superpixels are extracted from the abdominal region using Simple Linear Iterative Clustering (SLIC). An initial probability response map is generated, using patch-level confidences and a two-level cascade of random forest classifiers, from which superpixel regions with probabilities larger 0.5 are retained. These retained superpixels serve as a highly sensitive initial input of the pancreas and its surroundings to a ConvNet that samples a bounding box around each superpixel at different scales (and random non-rigid deformations at training time) in order to assign a more distinct probability of each superpixel region being pancreas or not. We evaluate our method on CT images of 82 patients (60 for training, 2 for validation, and 20 for testing). Using ConvNets we achieve average Dice scores of 68%+-10% (range, 43-80%) in testing. This shows promise for accurate pancreas segmentation, using a deep learning approach and compares favorably to state-of-the-art methods.Comment: SPIE Medical Imaging conference, Orlando, FL, USA: SPIE Proceedings | Volume 9413 | Classificatio
    corecore