4 research outputs found

    Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

    Full text link
    Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning

    Performance Primitives for Artificial Neural Networks

    Get PDF
    Optimized software implementations of artificial neural networks leverage primitives from performance libraries, such as the BLAS. However, these primitives were prototyped decades ago, and do not necessarily reflect the patterns of computations in neural networks. I propose modifications to common primitives provided by performance libraries to make them better building blocks for artificial neural networks, with a focus on inference, i.e. evaluation of a pre-trained artificial neural network. I suggest three classes of performance primitives for the convolutional operators and two optimized building blocks for softmax operators. High-intensity convolutional operators with large kernel sizes and unit stride benefit from asymptotically fast convolution algorithms based on Winograd transform and Fast Fourier transform. I jointly consider Fourier or Winograd transform and the matrix-matrix multiplication of blocks of transformed coefficients and suggest tuple-GEMM primitive which balance the number of irregular memory writes in the transformation with sufficient register blocking and instruction-level parallelism in the matrix-matrix multiplication part. Tuple-GEMM primitive can be thought of as a batched GEMM with a fixed architecture-dependent batch size and can be efficiently implemented as a modification of the Goto matrix-matrix multiplication algorithm. I additionally analyze small 2D Fast Fourier transforms, and suggest options that work best for modern wide-SIMD processors. Lower-intensity convolutional operators with small kernel sizes, non-unit strides, or dilation do not benefit from the fast convolution algorithms and require a different set of optimizations. To accelerate these cases I suggest replacing the traditional GEMM primitive with a novel Indirect GEMM primitive. Indirect GEMM primitive is a slight modification of GEMM and can leverage the extensive research on efficient GEMM implementations. I further introduce the Indirect Convolution algorithm which builds on top of the Indirect GEMM primitive, eliminates the runtime overhead of patch-building memory transformations and substantially reduce the memory complexity in convolutional operators compared to the traditional GEMM-based algorithms. Pointwise, or 1x1, convolutional operators directly map to matrix-matrix multiplication, and prompt yet another approach to optimization. I demonstrate that neural networks heavy on pointwise convolutions can greatly benefit from sparsification of the weights tensor and representing the operation as a sparse-matrix-dense-matrix multiplication (SpMM) and introduce neural network-optimized SpMM primitives. While SpMM primitives in Sparse BLAS libraries target problems with extremely high sparsity (commonly 99+% sparsity) and non-random sparsity patterns, the proposed SpMM primitive is demonstrated to work well with moderate sparsity in the 70-95% range and unpredictable sparsity patterns. Softmax operator is light on elementary floating-point operations, but involves evaluation of the exponential function, which in many implementations becomes the bottleneck. I demonstrate that with the high-throughput vector exponential function the softmax computation saturates the memory bandwidth and can be further improved only by reducing the number of memory access operations. I then constructively prove that it is possible to replace the traditional three-pass softmax algorithms with a novel two-pass algorithm for up to 28% runtime reduction. I implemented the proposed ideas in the open source NNPACK, QNNPACK, and XNNPACK libraries for acceleration of neural networks on CPUs, which at the time of release delivered state-of-the-art performance on mobile, server, and Web platforms.Ph.D
    corecore