Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.
INTRODUCTION
Machine Learning, and in particular Deep Learning [149] , is rapidly taking over a variety of aspects in our daily lives. At the core of deep learning lies the Deep Neural Network (DNN), a construct inspired by the interconnected nature of the human brain. Trained properly, the expressiveness of DNNs provides accurate solutions for problems previously thought to be unsolvable, merely by observing large amounts of data. Deep learning has been successfully implemented for a plethora of fields, ranging from image classification [110] , through speech recognition [6] and medical diagnosis [44] , to autonomous driving [22] and defeating human players in complex games [223] .
Since the 1980s, neural networks have attracted the attention of the machine-learning community [150] . However, DNNs' rise into prominence was tightly coupled to the available computational power, which allowed to exploit their inherent parallelism. Consequently, deep learning managed to outperform all existing approaches in speech recognition [152] and image classification [142] , where the latter increased the accuracy by a factor of two, sparking academic and industrial interest. As datasets increase in size and DNNs in complexity, the computational intensity and memory demands of deep learning increase proportionally. Training a DNN to competitive accuracy today essentially requires a high-performance computing cluster. To harness such systems, different aspects of training and inference (evaluation) of DNNs are modified to increase concurrency.
In this survey, we discuss the variety of topics in the context of parallelism and distribution in deep learning, spanning from vectorization to efficient use of supercomputers. In particular, we present parallelism strategies for DNN evaluation and implementations thereof, as well as extensions to training algorithms and systems targeted at supporting distributed environments. To provide comparative measures on the approaches, we analyze their concurrency and average parallelism using the Work-Depth model [21] .
This article surveys 252 other works, obtained by recursively tracking relevant bibliography from seminal papers in the field, dating back to the year 1984. We include additional papers resulting from keyword searches on Google Scholar 1 and arXiv. 2 Due to the quadratic increase in deep learning papers on the latter source (Table 1) , some works may not have been included. The full list of categorized papers in this survey can be found online. 3 Figure 1 shows an overview of this survey.
Related Surveys
Bengio [17] discusses scaling deep learning from various perspectives, focusing on models, optimization algorithms, and datasets. The paper also overviews some aspects of distributed computing, including asynchronous and sparse communication. Surveys of hardware architectures mostly focus on the computational side of training rather than the optimization. This includes a recent survey [233] that reviews computation techniques for DNN operators (layer types) and mapping computations to hardware. The survey also includes discussion on data representation reduction (e.g., via quantization) to reduce overall memory bandwidth within the hardware. Other surveys
where : H × X → R + is the loss of an individual sample. In this work, we consider two types of supervised learning problems, from which the sample loss functions are derived: (multi-class) classification and regression. In the former, the goal is to identify which class a sample most likely belongs to, e.g., inferring what type of animal appears in an image. In regression, the goal is to find a relation between the domains X and Y , predicting values in Y for new samples in X . For instance, such a problem might predict the future temperature of a region, given past observations.
For minimization purposes, a sample loss function should be continuous and differentiable. In regression problems, it is possible to use straightforward loss functions such as the squared difference (w, z) = ( f w (z) − h(z)) 2 . However, in classification problems, a simple definition of loss such as (w, z) = 0 if f w (z) = h(z) or 1 otherwise (also known as binary or 0-1 loss), does not match the continuity and differentiability criteria.
To resolve this issue, prominent multi-class classification problems define Y as a probability distribution of the inferred class types (see Figure 2 ), instead of a single label. The model output is typically normalized to a distribution using the softmax function σ (z) i = exp(z i ) k exp(z k ) . The loss function then computes the difference of the prediction from the true label "distribution," e.g., using cross-entropy: (w, z) = − i h(z) i log σ ( f w (z)) i . The cross-entropy loss can be seen as a generalization of logistic regression, inducing a continuous loss function for multi-class classification.
Minimizing the loss function can be performed by using different approaches, such as iterative methods (e.g., BFGS [185] ) or meta-heuristics (e.g., evolutionary algorithms [211] ). Optimization in machine learning is prominently performed via Gradient Descent. Since the full D is, however, never observed, it is necessary to obtain an unbiased estimator of the gradient. Observe that ∇L D (w ) = E z∼D [∇ (w, z)] (Equation (1), linearity of the derivative). Thus, in expectation, we can descend using randomly sampled data in each iteration, applying Stochastic Gradient Descent (SGD) [215] .
ALGORITHM 1: Stochastic Gradient Descent (SGD) 1: for t = 0 to T do SGD iterations 2: z ← Random element from S Sample dataset S 3: д ← ∇ (w (t ) , z) Compute gradient of 4: w (t +1) ← w (t ) + u (д, w (0, ...,t ) , t ) Update weights with function u 5: end for SGD (Algorithm 1) iteratively optimizes parameters defined by the sequence {w (t ) } T t =0 , using samples from a dataset S sampled from D with replacement. SGD is proven to converge at a rate of O(1/ √ T ) for convex functions with Lipschitz-continuous and bounded gradient [181] . Prior to running SGD, one must choose an initial estimate for the weights w (0) . Due to the illposed nature of some problems, the selection of w (0) is important and may reflect on the final quality of the result. The choice of initial weights can originate from random values, informed decisions (e.g., Xavier initialization [81] ), or from pre-trained weights in a methodology called Transfer Learning [197] . In deep learning, recent works state that the optimization space is riddled with saddle points [149] , and assume that the value of w (0) does not affect the final loss. In practice, improper initialization may have an adverse effect on generalization as networks become deeper [93] .
In line 1, T denotes the number of steps to run SGD for (known as the stopping condition or computational budget). Typically, real-world instances of SGD run for a constant number of steps, for a fixed period of time, or until a desired accuracy is achieved. Line 2 then samples random elements from the dataset, to provide the unbiased loss estimator. The gradient of the loss function with respect to the weights w (t ) is subsequently computed (line 3). In deep neural networks, the gradient is obtained with respect to each layer (w (t ) l ) using backpropagation (Section 4.2). This gradient is then used for updating the weights, using a weight update rule (line 4).
Weight Update
Rules. The weight update rule, denoted as u in Algorithm 1, can be defined as a function of the gradient д, the previous weight values w (0) , . . . ,w (t ) , and the current iteration t. Table 3 summarizes the popular u functions used in training. In the table, the basic SGD update 
where η represents the learning rate. η controls how much the gradient values will overall affect the next estimate w (t +1) , and in iterative nonlinear optimization methods finding the correct η is a considerable part of the computation [185] . In machine-learning problems, it is customary to fix η, or set an iteration-based weight update rule u alr (д, t ) = −η t · д, where η t decreases (decays) over time to bound the modification size and avoid local divergence.
Other popular weight update rules include Momentum, which uses the difference between current and past weights w (t ) − w (t −1) to avoid local minima and redundant steps with natural motion [182, 205] . More recent update rules, such as RMSProp [96] and Adam [136] , use the first and second moments of the gradient to adapt the learning rate per-weight, enhancing sparser updates over others.
Factors such as the learning rate and other symbols found in Table 3 are called hyper-parameters, and are set before the optimization process begins. In the table, μ, β, β 1 , and β 2 represent the momentum, RMS decay rate, and first and second moment decay rate hyper-parameters, respectively. To obtain the best results, hyper-parameters must be tuned, which can be performed by value sweeps or by meta-optimization (Section 7.5.2). The multitude of hyper-parameters and the reliance upon them is considered problematic by a part of the community [207] .
Minibatch SGD.
When performing SGD, it is common to decrease the number of weight updates by computing the sample loss in minibatches (Algorithm 2), averaging the gradient with respect to subsets of the data [147] . Minibatches represent a tradeoff between traditional SGD, which is proven to converge when drawing one sample at a time, and batch methods [185] , which make use of the entire dataset at each iteration. In practice, minibatch sampling is implemented by shuffling the dataset S, and processing that permutation by obtaining contiguous segments of size B from it. An entire pass over the dataset is 65:6 T. Ben-Nun and T. Hoefler called an epoch, and a full training procedure usually consists of tens to hundreds of such epochs [84, 260] . As opposed to the original SGD, shuffle-based processing entails without-replacement sampling. Nevertheless, minibatch SGD was proven [221] to provide similar convergence guarantees.
Parallel Computer Architecture
We continue with a brief overview of parallel hardware architectures that are used to execute learning problems in practice. They can be roughly classified into single-machine (often shared memory) and multi-machine (often distributed memory) systems.
Single-machine Parallelism.
Parallelism is ubiquitous in today's computer architectures, internally on the chip in the form of pipelining and out-of-order execution, as well as exposed to the programmer in the form of multi-core or multi-socket systems. Multi-core systems have a long tradition and can be programmed with either multiple processes (different memory domains), multiple threads (shared memory domains), or a mix of both. The main difference is that multiprocess parallel programming forces the programmer to consider the distribution of the data as a first-class concern while multi-threaded programming allows the programmer to only reason about the parallelism, leaving the data shuffling to the hardware system (often through hardware cache-coherence protocols).
Out of the 252 reviewed papers, 159 papers present empirical results and provide details about their hardware setup. Figure 3 (a) shows a summary of the machine architectures used in research papers over the years. We see a clear trend toward GPUs, which dominate the publications beginning from 2013. However, even accelerated nodes are not sufficient for the large computational workload. Figure 3 (b) illustrates the quickly growing multi-node parallelism in those works. This shows that, beginning from 2015, distributed-memory architectures with accelerators such as GPUs have become the default option for machine learning at all scales today.
Multi-machine Parallelism.
Training large-scale models is a very compute-intensive task. Thus, single machines are often not capable to finish this task in a desired time-frame. To accelerate the computation further, it can be distributed across multiple machines connected by a network. The most important metrics for the interconnection network (short: interconnect) are latency, bandwidth, and message-rate. Different network technologies provide different performance. For example, both modern Ethernet and InfiniBand provide high bandwidth but InfiniBand has significantly lower latencies and higher message rates. Special-purpose HPC interconnection networks can achieve higher performance in all three metrics. Yet, network communication remains generally slower than intra-machine communication. Out of the 252 reviewed papers, 80 make use of distributed-memory systems and provide details about their hardware setup. We observe that large-scale setups, similar to HPC machines, are commonplace and essential in today's training.
Parallel Programming
Programming techniques to implement parallel learning algorithms on parallel computers depend on the target architecture. They range from simple threaded implementations to OpenMP on single machines. Accelerators are usually programmed with special languages such as NVIDIA's CUDA, OpenCL, or in the case of FPGAs using hardware design languages. Yet, the details are often hidden behind library calls (e.g., cuDNN or MKL-DNN) that implement the time-consuming primitives.
On distributed memory clusters, one can either use primitive communication mechanisms such as TCP/IP and Remote Direct Memory Access (RDMA), or programmable networks [14] . One can also use more convenient libraries such as the Message Passing Interface (MPI) or Apache Spark. MPI is a low level library focused on providing portable performance while Spark is a higher-level framework that focuses more on programmer productivity. Figure 4 (b) shows a breakdown of the different communication mechanisms that were specified in 55 of the 80 papers using multi-node parallelism. It shows how the community quickly recognized that deep learning has very similar characteristics than large-scale HPC applications. Thus, beginning from 2016, the established MPI interface became the de-facto portable communication standard in distributed deep learning.
Parallel Algorithms
We now briefly discuss some key concepts in parallel computing that are needed to understand parallel machine learning. Every computation on a computer can be modeled as a directed acyclic graph (DAG). The vertices of the DAG are the computations and the edges are the data dependencies (or data flow). The computational parallelism in such a graph can be characterized by two main parameters: the graph's work W, which corresponds to the total number of vertices, and the graph's depth D, which is the number of vertices on any longest path in the DAG. These two parameters allow us to characterize the computational complexity on a parallel system. For example, assuming we can process one operation per time unit, then the time needed to process the graph on a single processor is T 1 = W and the time needed to process the graph on an infinite number of processes is T ∞ = D. The average parallelism in the computation is W/D, which is often a good number of processes to execute the graph with. Furthermore, we can show that the execution time of such a DAG on p processors is bounded by: min{W/p, D} ≤ T p ≤ O(W/p + D) [8, 26] .
Most of the operations in learning can be modeled as operations on tensors (typically tensors as a parallel programming model [228] ). Such operations are highly data-parallel and only summations introduce dependencies. Thus, we will focus on parallel reduction operations in the following.
In a reduction, we apply a series of binary operators ⊕ to combine n values into a single value, e.g., y = x 1 ⊕ x 2 ⊕ x 3 · · · ⊕ x n−1 ⊕ x n . If the operation ⊕ is associative, then we can change its application, which changes the DAG from a linear-depth line-like graph as shown in Figure 5 (a) to a logarithmic-depth tree graph as shown in Figure 5 (b). It is simple to show that the work and depth for reducing n numbers is W = n − 1 and D = log 2 n , respectively. In deep learning, one often needs to reduce (sum) large tables of m independent parameters and return the result to all processes. This is called allreduce in the MPI specification [86, 173] .
In multi-machine environments, these tables are distributed across the machines that participate in the overall reduction operation. Due to the relatively low bandwidth between the machines (compared to local memory bandwidths), this operation is often most critical for distributed learning. Even if we fully overlap communication and computation [102] , (nonblocking) allreductions often turn into a bottleneck. We analyze the algorithms in a simplified LogP model [54] , where we ignore injection rate limitations (o = д = 0), which makes it similar to the simple α-β model: L = α models the point-to-point latency in the network, G = β models the cost per byte, and P ≤ p is the number of networked machines. Based on the DAG model from above, it is simple to show a lower bound for the reduction time T r ≥ L log 2 (P ) in this simplified model. Furthermore, because each element of the table has to be sent at least once, the second lower bound is T r ≥ γmG, where γ represents the size of a single data value and m is the number of values sent. This bound can be strengthened to T r ≥ L log 2 (P ) + 2γmG (P − 1)/P if we disallow redundant computations [29] .
Several practical algorithms exist for the parallel allreduce operation in different environments and the best algorithm depends on the system, the number of processes, and the message size. We refer to Chan et al. [29] and Hoefler and Moor [103] for surveys of collective algorithms. Here, we summarize key algorithms that have been rediscovered in the context of parallel learning, and illustrate them in Figure 5 (c). The simplest algorithm is to combine two trees, one for summing the values to one process, similar to Figure 5 (b), and one for broadcasting the values back to all processes; its complexity is T tree = 2 log 2 (P )(L + γmG). Yet, this algorithm is inefficient and can be optimized with a simple butterfly pattern, reducing the time to T bfly = log 2 (P )(L + γmG). The butterfly algorithm is efficient (near-optimal) for small γm. For large γm and small P, a simple linear pipeline that splits the message into P segments is bandwidth-optimal and performs well in practice, even though it has a linear component in P: T pipe = 2(P − 1)(L + γ m P G). For most ranges of γm and P, one could use Rabenseifner's algorithm [206] , which combines reduce-scatter with allgather, running in time T rabe = 2L log 2 (P ) + 2γmG (P − 1)/P. This algorithm achieves the lower bound [198] but may be harder to implement and tune.
Other communication problems needed for convolutions and pooling, illustrated in Figure 5 (d), exhibit high spatial locality due to strict neighbor interactions. They can be optimized using wellknown HPC techniques for stencil computations such as MPI Neighborhood Collectives [104] (formerly known as sparse collectives [106] ) or optimized Remote Memory Access programming [15] . In general, exploring different low-level communication, message scheduling, and topology mapping [105] strategies that are well-known in the HPC field could significantly speed up the communication in distributed deep learning.
THE EFFICIENCY TRADEOFF: GENERALIZATION VS. UTILIZATION
In the previous section, we mentioned that SGD can be executed concurrently through the use of minibatches. However, setting the minibatch size is a complex optimization space on its own merit, as it affects both statistical accuracy (generalization) and hardware efficiency (utilization) of the model. As illustrated in Figure 6 (a), minibatches should not be too small (region A), to harness inherent concurrency in evaluation of the loss function; nor should they be too large (region C), as the quality of the result decays once increased beyond a certain point.
We can show the existence of region C by combining SGD with the descent lemma for a function f with L-Lipschitz gradient:
, where z ∼ D and ∇f z is the stochastic subgradient for z. This indicates that a large minibatch (with adjusted learning rate) can increase the convergence rate (negative term), but along with it the gradient variance and learning rate, which causes the last term to hinder convergence.
Indeed, the illustrated behavior is empirically shown for larger minibatch sizes in Figure 6 (b), and typical sizes lie between the orders of 10 and 10,000. Also, large-batch methods only converge and generalize when: (a) learning rates are adjusted statically [84, 141] or adaptively [259] ; (b) using a "warmup" phase [84] ; (c) using the batch size to control gradient variance [78] ; (d) adaptively increasing minibatch size during training [226] ; or (e) when using specific learning-rate schedules [169] . The use of large-batch methods and reduced floating-point precision has sparked a "race" toward accelerating the training process of specific model/dataset combinations in image processing [84, 123, 176, 191, 256] and language modeling [192, 203, 257] . Overall, such works increase the upper bound on feasible minibatch sizes but do not remove it.
DEEP NEURAL NETWORKS
We now describe the anatomy of a Deep Neural Network (DNN). In Figure 7 , we see a DNN in two scales: the single operator (Figure 7 (a), also ambiguously called layer) and the composition of such operators in a layered deep network (Figure 7(b) ). In the rest of this section, we describe popular operator types and their properties, followed by the computational description of deep networks and the backpropagation algorithm. Then, we study several examples of popular neural networks, highlighting the computational trends driven by their definition.
Neurons
The basic building block of a deep neural network is the neuron. Modeled after the brain, an artificial neuron (Figure 7 (a)) accumulates signals from other neurons connected by synapses. An activation function (or axon) is applied on the accumulated value, which adds nonlinearity to the network and determines the signal this neuron "fires" to its neighbors. In feed-forward neural networks, the neurons are grouped to layers strictly connected to neurons in subsequent layers. In contrast, recurrent neural networks allow back-connections within the same layer.
Feed-forward Operators.
Neural network operators are implemented as weighted sums, using the synapses as weights. Activations (denoted σ ) can be implemented as different functions, such as Sigmoid, Softmax, hyperbolic tangents, Rectified Linear Units (ReLU), or variants thereof [93] . When color images are used as input (as is commonly the case in computer vision), they are usually represented as a four-dimensional tensor-sized N ×C×H ×W . As shown in Figure 8 , N is number of images in the minibatch, where each H ×W image contains C channels (e.g., image RGB components). If an operator disregards spatial locality in the image tensor (e.g., a fully connected layer), then the dimensions are flattened to N × (C · H · W ). In typical DNN and CNN constructions, the number of features (channels in subsequent layers), as well as the width and height of an image, change from layer to layer using the operators defined below. We denote the input and output features of a layer by C in and C out , respectively.
A fully connected layer (Figure 7 (a)) is defined on a group of neurons x (sized N × C in , disregarding spatial properties) by y i, * = σ (wx i, * + b), where w is the weight matrix (sized C in × C out ) and b is a per-layer trainable bias vector (sized C out ). While this inner product is usually implemented with multiplication and addition, some works use other operators, such as similarity [46] . Not all operators in a neural network are fully connected. Sparsely connecting neurons and sharing weights is beneficial for reducing the number of parameters; as is the case in the popular convolutional operator. In a convolutional operator, every 3D tensor x (i.e., a slice of the 4D minibatch tensor representing one image) is convolved with C out kernels of size C in ×K y ×K x , where the base formula for a minibatch is given by
where y's dimensions are N × C out × H × W , H = H − K y + 1, and W = W − K x + 1, accounting for the size after the convolution, which does not consider cases where the kernel is out of the image bounds. Note that the formula omits various extensions of the operator [71] , such as variable stride, padding, and dilation [262] , each of which modifies the accessed indices and H ,W . The two inner summations of Equation (2) are called the convolution kernel and the kernel (or filter) size is K x × K y . While convolutional operators are the most computationally demanding in CNNs, other operator types are prominently used in networks. Two such operators are pooling and batch normalization. The former reduces an input tensor in the width and height dimensions, performing an operation on contiguous sub-regions of the reduced dimensions, such as maximum (called maxpooling) or average, and is given by
The goal of this operator is to reduce the size of a tensor by sub-sampling it while emphasizing important features. Applying subsequent convolutions of the same kernel size on a sub-sampled tensor enables learning high-level features that correspond to larger regions in the original data. Batch Normalization (BN) [121] is an example of an operator that creates inter-dependencies between samples in the same minibatch. Its role is to center the samples around a zero mean and a variance of one, which, according to the authors, reduces the internal covariate shift. BN is given by the following transformation:
where γ , β are scaling factors, and ϵ is added to the denominator for numerical stability.
Recurrent
Operators. Recurrent Neural Networks (RNNs) [72] enable connections from a layer's output to its own inputs. These connections create "state" in the neurons, retaining persistent information in the network and allowing it to process data sequences instead of a single tensor. We denote the input tensor at time point t as x (t ) .
The standard Elman RNN layer is defined as
) (omitting bias, illustrated in Figure 9 (a)), where h t represents the "hidden" data at time-point t and is carried over to the next time-point. Despite the initial success of these operators, it was found that they tend to "forget" information quickly (as a function of sequence length) [19] . To address this issue, Long-Short Term Memory (LSTM) [100] (Figure 9 (b)) units redesign the structure of the recurrent connection to resemble memory cells. Several variants of LSTM exist, such as the Gated Recurrent Unit (GRU) [40] (Figure 9 (c)), which simplifies the LSTM gates to reduce the number of parameters. 
Deep Networks
According to the definition of a fully connected layer, the expressiveness of a "shallow" neural network is limited to a separating hyperplane, skewed by the nonlinear activation function. When composing layers one after another, we create deep networks (as shown in Figure 7 (b)) that can approximate arbitrarily complex continuous functions. While the exact class of expressible functions is currently an open problem, it has been shown [47, 60] that neural network depth can reduce breadth requirements exponentially with each additional layer.
A Deep Neural Network (DNN) can be represented as a function composition, e.g., (L M (w M , . . . L 2 (w 2 , L 1 (w 1 , x )))), where each function L i is an operator, and each vector w i represents operator i's weights (parameters). In addition to direct composition, a DNN DAG might reuse the output values of a layer in multiple subsequent layers, forming shortcut connections [94, 110] .
Computation of the DNN loss gradient ∇ , which is necessary for SGD, can be performed by repeatedly applying the chain rule in a process commonly referred to as backpropagation. As shown in Figure 10 , the process of obtaining ∇ (w, x ) is performed in two steps. First, (w, x ) is computed by forward evaluation (top portion of the figure), computing each layer of operators after its dependencies in a topological ordering. After computing the loss, information is propagated backward through the network (bottom portion of the figure), computing two gradients -∇x (w.r.t. input data), and ∇w i (w.r.t. layer weights). Note that some operators do not maintain mutable parameters (e.g., pooling, concatenation), and thus ∇w i is not always computed.
In terms of concurrency, we use the Work-Depth (W-D) model to formulate the costs of computing the forward evaluation and backpropagation of different layer types. Table 4 shows that the work (W) performed in each layer asymptotically dominates the maximal operation dependency path (D), which is at most logarithmic in the parameters. This result reaffirms the state of the practice, in which parallelism plays a major part in the feasibility of evaluating and training DNNs.
As opposed to feed-forward networks, RNNs contain self-connections and thus cannot be trained with backpropagation alone. The most popular way to solve this issue is by applying backpropagation through time (BPTT) [246] , which unrolls the recurrent layer up to a certain amount 
Operator Type
Eval. of sequence length, using the same weights for each time-point. This creates a larger, feed-forward network that can be trained with the usual means.
Work (W) D e p t h ( D)
Activation y O(NCHW ) O(1) ∇w O(NCHW ) O(1) ∇x O(NCHW ) O(1) Fully Connected y O(C out · C in · N ) O(log C in ) ∇w O(C in · N · C out ) O(log N ) ∇x O(C in · C out · N ) O(log C out ) Convolution (Direct) y O(N · C out · C in · H · W · K x · K y ) O(log K x + log K y + log C in ) ∇w O(N · C out · C in · H · W · K x · K y ) O(log K x + log K y + log C in ) ∇x O(N · C out · C in · H · W · K x · K y ) O(log K x + log K y + log C in ) Pooling y O(NCHW ) O(log K x + log K y ) ∇w - - ∇x O(NCHW ) O(1) Batch Normalization y O(NCHW ) O(log N ) ∇w O(NCHW ) O(log N ) ∇x O(NCHW ) O(log N )
Trends in DNN Characteristics
To understand how successful neural architectures orchestrate the aforementioned operators, we discuss five influential convolutional networks and highlight trends in their characteristics over the past years. Each of the networks, listed in Table 5 , has achieved state-of-the-art performance upon publication. The table summarizes these networks, their concurrency characteristics, and their achieved test accuracy on the ImageNet [62] (1,000 class challenge) and CIFAR-10 [140] datasets. The listed networks, as well as other works [37, 45, 57, 91, 114, 162, 224, 280] , indicate three periods in the history of classification neural networks: experimentation (∼1985-2010), growth (2010-2015), and resource conservation (2015-today).
In the experimentation period, different types of neural network structures (e.g., Deep Belief Networks [18] ) were researched, and the methods to optimize them (e.g., backpropagation) were developed. Once the neural network community has converged on deep feed-forward networks (with the success of AlexNet cementing this decision), research during the growth period yielded networks with larger sizes and more operations, in an attempt to both increase model parallelism and solve increasingly complex problems. This trend was supported by the advent of GPUs and Fig. 11 . Performance of cublasSgemm on a tesla K80 GPU for various matrix sizes (adapted from Reference [193] ).
other large computational resources (e.g., the Google Brain cluster), increasing the available processing elements toward the average parallelism (W/D).
However, as over-parameterization leads to overfitting, and since the resulting networks were too large to fit into consumer devices, efforts to decrease resource usage started around 2015, and so did the average parallelism (see table) . Research has since focused on increasing expressiveness, mostly by producing deeper networks, while also reducing the number of parameters and operations required to forward-evaluate the given network. Parallelization efforts have thus shifted toward concurrency within minibatches (data parallelism, see Section 6). By reducing memory and increasing energy efficiency, the resource conservation trend aims to move neural processing to the end user, i.e., to embedded and mobile devices. At the same time, smaller networks are faster to prototype and require less information to communicate when training on distributed platforms.
CONCURRENCY IN OPERATORS
Given that neural network layers operate on four-dimensional tensors (Figure 8(a) ) and the high locality of the operations, there are several opportunities for parallelizing layer execution. In most cases, computations (e.g., in the case of pooling operators) can be directly parallelized. However, to expose parallelism in other operator types, computations have to be reshaped. Below, we list efforts to model DNN performance, followed by a concurrency analysis of three popular operators.
Performance Modeling
Even with work and depth models, it is hard to estimate the runtime of a single DNN operator, let alone an entire network. Figure 11 presents measurements of the performance of the highly-tuned matrix multiplication implementation in the NVIDIA CUBLAS library [189] , which is at the core of many operators. The figure shows that as the dimensions are modified, the performance does not change linearly, and that in practice the system internally chooses from one of 15 implementations for the operation, where the left-hand side of the figure depicts the segmentation.
In spite of the above observation, other works still manage to approximate the runtime of a given DNN with performance modeling. Using the values in the figure as a lookup table, it was possible to predict the time to compute and backpropagate through minibatches of various sizes with ∼5-19% error, even on clusters of GPUs with asynchronous communication [193] . The same was achieved for CPUs in a distributed environment [255] , using a similar approach, and for Intel Xeon Phi accelerators [244] strictly for training time estimation (i.e., not individual layers or DNN evaluation). Paleo [204] derives a performance model from operation counts alone (with 10-30% prediction error), and Pervasive CNNs [229] uses performance modeling to select networks with decreased accuracy to match real-time requirements from users. To further understand the performance characteristics of DNNs, Demmel and Dinh [61] provide lower bounds on communication requirements for the convolution and pooling operators. 
Fully Connected Layers
As described in Section 4.1, a fully connected layer can be expressed and modeled (see Table 4 ) as a matrix-matrix multiplication of the weights and the neuron values (column per minibatch sample). To that end, efficient linear algebra libraries, such as CUBLAS [189] , MKL [119] , and ESSL [116] , can be used. The BLAS [183] GEneral Matrix-Matrix multiplication (GEMM) operator, used for this purpose, also includes scalar factors that enable matrix scaling and accumulation, which can be used when batching groups of neurons.
Vanhoucke et al. [239] present a variety of methods to further optimize CPU implementations of fully connected layers. In particular, the paper shows efficient loop construction, vectorization, blocking, unrolling, and batching. The paper also demonstrates how weights can be quantized to use fixed-point math instead of floating-point.
Convolution
Convolutions constitute the majority of computations involved in training and inference of DNNs. As such, the research community and the industry have invested considerable efforts into optimizing their computation on all platforms. Figure 12 depicts the convolution methods detailed below, and Table 6 summarizes their work and depth characteristics.
While a convolution operator (Equation (2)) can be computed directly, it will not fully utilize the resources of vector processors (e.g., Intel's AVX registers) and many-core architectures (e.g., GPUs), which are geared toward many parallel multiplication-accumulation operations. It is possible, however, to increase the utilization by ordering operations to maximize data reuse [61] , introducing data redundancy, or via basis transformation.
The first algorithmic change proposed for convolutional operators was the use of the wellknown technique to transform a discrete convolution into matrix multiplication, using Toeplitz matrices (colloquially known as im2col). The first occurrence of unrolling convolutions in CNNs [31] used both CPUs and GPUs for training (since the work precedes CUDA, it uses Pixel Shaders for GPU computations). The method was subsequently popularized by Coates et al. [45] , and it consists of reshaping the images in the minibatch from 3D tensors to 2D matrices. Each 1D row in the matrix contains an unrolled 2D patch that would usually be convolved (possibly with overlap), generating redundant information (see Figure 12 (a)). The convolution kernels are then stored as a 2D matrix, where each column represents an unrolled kernel (one convolution filter). Multiplying those two matrices results in a matrix that contains the convolved tensor in 2D format, which can be reshaped to 3D for subsequent operations. Note that this operation can be generalized to 4D tensors (an entire minibatch), converting it into a single matrix multiplication. Alternatively, the kernels can be unrolled to rows (kn2row) for the matrix multiplication [242] .
While processor-friendly, the GEMM method (as described above) consumes a considerable amount of memory, and thus was not scalable. Practical implementations of the GEMM method, such as in cuDNN [38] , implement "implicit GEMM," in which the Toeplitz matrix is never materialized. It was also reported [49] that the Strassen matrix multiplication [231] can be used for the underlying computation, reducing the number of operations by up to 47%.
A second method to compute convolutions is to make use of the Fourier domain, in which convolution is defined as an element-wise multiplication [172, 241] . In this method, both the data and the kernels are transformed using FFT, multiplied, and the inverse FFT is applied on the result:
where F denotes the Fourier Transform and • is element-wise multiplication. Note that for a single minibatch, it is enough to transform w once and reuse the results.
Experimental results [241] have shown that the larger the convolution kernels are, the more beneficial FFT becomes, yielding up to 16× performance over the GEMM method, which has to process patches of proportional size to the kernels. Additional optimizations were made to the FFT and IFFT operations [241] , using DNN-specific knowledge: (a) The process uses decimationin-frequency for FFT and decimation-in-time for IFFT to mitigate bit-reversal instructions; (b) multiple FFTs with sizes ≤32 are batched together and performed at the warp-level on the GPU; and (c) pre-computation of twiddle factors.
Working with DNNs, FFT-based convolution can be optimized further. In ZNNi [278] , the authors observed that due to zero-padding, the convolutional kernels, which are considerably smaller than the images, mostly consist of zeros. Thus, pruned FFT [230] can be executed for transforming the kernels, reducing the number of operations by 3×. In turn, the paper reports 5× and 10× speedups for CPUs and GPUs, respectively.
The prevalent method used today to perform convolutions is Winograd's algorithm for minimal filtering [249] . First proposed by Lavin and Gray [146] , the method modifies the original algorithm for multiple filters (as is the case in convolutions), performing the following computation for one tile:
with the matrices A, G, B constructed as in Winograd's algorithm.
Since the number of operations in Winograd convolutions grows quadratically with filter size, the convolution is decomposed into a sum of tiled, small convolutions, and the method is strictly used for small kernels (e.g., 3 × 3). Additionally, because the magnitude of elements in the expression increases with filter size, the numerical accuracy of Winograd convolution is generally lower than the other methods, and decreases as larger filters are used. Table 6 lists the concurrency characteristics of the aforementioned convolution implementations, using the Work-Depth model. From the table, we can see that each method exhibits different behavior, where the average parallelism (W/D) can be determined by the kernel size or by image size (e.g., FFT). This coincides with experimental results [38, 146, 241] , which show that there is no "one-size-fits-all" convolution method. We can also see that the Work and Depth metrics are not always sufficient to reason about absolute performance, as the Direct and im2col methods exhibit the same concurrency characteristics, even though im2col is faster in many cases, due to high processor utilization and memory reuse (e.g., caching) opportunities.
Data layout also plays a role in convolution performance. Li et al. [154] assert that convolution and pooling operators can be computed faster by transposing the data from N ×C×H ×W tensors to C×H ×W ×N . The paper reports up to 27.9× performance increase over the state-of-the-art for a single operator, and 5.6× for a full DNN (AlexNet). The paper reports speedup even in the case of transposing the data during the computation of the DNN, upon inputting the tensor to the operator.
DNN primitive libraries, such as cuDNN [38] and MKL-DNN [120], provide a variety of convolution methods and data layouts. To assist users in choosing an algorithm, such libraries provide functions that select the best-performing method, given tensor sizes and memory constraints. Internally, the libraries may run all methods and pick the fastest one.
Recurrent Units
The complex gate systems that occur within RNN units (e.g., LSTMs, see Figure 9 (b)) contain multiple operations, each of which incurs a small matrix multiplication or an element-wise operation. Due to this reason, these layers were traditionally implemented as a series of high-level operations, such as GEMMs. However, further acceleration of such layers is possible. Moreover, since RNN units are usually chained together (forming consecutive recurrent layers), two types of concurrency can be considered: within the same layer, and between consecutive layers.
Appleyard et al. [7] describe several optimizations that can be implemented for GPUs. The first optimization fuses all computations (GEMMs and otherwise) into one function (kernel), saving intermediate results in scratch-pad memory. This both reduces the kernel scheduling overhead, and conserves round-trips to the global memory, using the multi-level memory hierarchy of the massively parallel GPU. Other optimizations include pre-transposition of matrices and enabling concurrent execution of independent recurrent units on different multi-processors on the GPU.
Inter-layer concurrency is achieved through pipeline parallelism, with which Appleyard et al. implement stacked RNN unit computations, immediately starting to propagate through the next layer once its data dependencies have been met. Overall, these optimizations result in ∼ 11× performance increase over the high-level implementation.
From the memory consumption perspective, dynamic programming was proposed [87] for RNNs (see Figure 13 (a)) to balance between caching intermediate results and recomputing forward inference for backpropagation. For long sequences (1,000 time-points), the algorithm conserves 95% memory over standard BPTT, while adding ∼33% time per iteration. A similar result has been achieved when re-computing convolutional operators as well [36] , yielding memory costs sublinear in the number of layers.
Persistent RNNs [65] are an optimization that addresses two limitations of GPU utilization: small minibatch sizes and long sequences of inputs. By caching the weights of standard RNN units on the GPU registers, they optimize memory round-trips between timesteps (x (t ) ) during training (Figure 13(b) ). For the registers not to be scheduled out, this requires the GPU kernels that execute the RNN layers to be "persistent," performing global synchronization on their own and circumventing the normal GPU programming model. The approach attains up to ∼30× speedup over previous state-of-the-art for low minibatch sizes, performing on the order of multiple TFLOP/s per-GPU, even though it does not execute GEMM operations and loads more memory for each multi-processor. Additionally, the approach reduces the total memory footprint of RNNs, allowing users to stack more layers using the same resources.
CONCURRENCY IN NETWORKS
The high average parallelism (W/D) in neural networks may not only be harnessed to compute individual operators efficiently but also to evaluate the whole network concurrently with respect to different dimensions. Owing to the use of minibatches, the breadth (∝ W) of the layers, and the depth of the DNN (∝ D), it is possible to partition both the forward evaluation and the backpropagation phases (lines 4-5 in Algorithm 2) among parallel processors. Below, we discuss three prominent partitioning strategies, illustrated in Figure 14 : partitioning by input samples (data parallelism), partitioning by network structure (model parallelism), and partitioning by layer (pipelining).
Data Parallelism
In minibatch SGD (Section 2.1.2), data is processed in increments of N samples. As most of the operators are independent with respect to N (Section 4), a straightforward approach for parallelization is to partition the work of the minibatch samples among multiple computational resources (cores or devices). This method (initially named pattern parallelism, as input samples were called patterns), dates back to the first practical implementations of artificial neural networks [273] . We refer to Shallue et al. [220] for an in-depth study of the effects of data parallelism on training.
It could be argued that the use of minibatches in SGD for neural networks was initially driven by data parallelism. Farber and Asanović [75] used multiple vector accelerator microprocessors (Spert-II) to parallelize error backpropagation for neural network training. To support data parallelism, the paper presents a version of delayed gradient updates called "bunch mode," where the gradient is updated several times prior to updating the weights, essentially equivalent to minibatch SGD.
One of the earliest occurrences of mapping DNN computations to data parallel architectures (e.g., GPUs) were performed by Raina et al. [208] . The paper focuses on the problem of training Deep Belief Networks [98] , mapping the unsupervised training procedure to GPUs by running minibatch SGD. The paper shows speedup of up to 72.6× over CPU when training Restricted Boltzmann Machines. Today, data parallelism is supported by the vast majority of deep learning frameworks, using a single GPU, multiple GPUs, or a cluster of multi-GPU nodes.
The scaling of data parallelism is naturally defined by the minibatch size (Table 4 ). Apart from Batch Normalization (BN) [121] , all operators mentioned in Section 4 operate on a single sample at a time, so forward evaluation and backpropagation are almost completely independent. In the weight update phase, however, the results of the partitions have to be averaged to obtain the gradient w.r.t. the whole minibatch, which potentially induces an allreduce operation. Furthermore, in this partitioning method, all DNN parameters have to be accessible for all participating devices, which means that they should be replicated.
Neural Architecture Support for Large Minibatches. By applying various modifications to
the training process, recent works have successfully managed to increase minibatch size to 8k samples [84] , 32k samples [259] , and even 64k [226] without losing considerable accuracy. While the generalization issue still exists (Section 3), it is not as severe as claimed in prior works [218] . One bottleneck that hinders scaling of data parallelism, however, is the BN operator, which requires a full synchronization point upon invocation. Since BN recurs multiple times in some DNN architectures [94] , this is too costly. Thus, popular implementations of BN follow the approach driven by large-batch papers [84, 107, 259] , in which small subsets (e.g., 32 samples) of the minibatch are normalized independently. If at least 32 samples are scheduled to each processor, then this synchronization point is local, which in turn increases scaling.
Another approach to the BN problem is to define a different operator altogether. Weight Normalization (WN) [216] proposes to separate the parameter (w) norm from its directionality by way of re-parameterization. In WN, the weights are defined as w = ( д v ) · v, where д represents weight magnitude and v a normalized direction (as changing the magnitude of v will not introduce changes in ∇ ). WN decreases the depth (D) of the operator from O(log N ) to O(1), removing inter-dependencies within the minibatch. According to the authors, WN reduces the need for BN, achieving comparable accuracy using a simplified version of BN (without variance correction).
Coarse-and Fine-grained Data Parallelism.
Additional approaches for data parallelism were proposed in literature. In ParallelSGD [277] , SGD is run (possibly with minibatches) k times in parallel, dividing the dataset among the processors. After the convergence of all SGD instances, the resulting weights are aggregated and averaged to obtain w, exhibiting coarse-grained parallelism.
ParallelSGD, as well as other deep learning implementations [125, 147, 267] , were designed with the MapReduce [58] programming paradigm. Using MapReduce, it is easy to schedule parallel tasks onto multiple processors, as well as distributed environments. Prior to these works, the potential scaling of MapReduce was studied [42] on a variety of machine-learning problems, including NNs, promoting the need to shift from single-processor learning to distributed memory systems. While the MapReduce model was successful for deep learning at first, its generality hindered DNN-specific optimizations. Therefore, current implementations make use of high-performance communication interfaces (e.g., MPI) to implement fine-grained parallelism features, such as reducing latencies via asynchronous execution and pipelining [79] (Figure 15(a) ), sparse communication (see Section 7.3), and exploiting parallelism within a given computational resource Fig. 15 . Data parallelism schemes. [194, 278] . In the last category, minibatches are fragmented into micro-batches ( Figure 15(b) ) that are decomposed [278] or computed sequentially [194] . This reduces the required memory footprint, thus making it possible to choose faster methods that require more memory, as well as enabling hybrid CPU-GPU inference.
Model Parallelism
The second partitioning strategy for DNN training is model parallelism [76] (also known as network parallelism). This strategy divides the work according to the neurons in each layer, namely, the C, H , orW dimensions in a four-dimensional tensor. In this case, the sample minibatch is copied to all processors, and different parts of the DNN are computed on different processors, which can conserve memory (since the full network is not stored in one place) but incurs additional communication after every layer.
Since the minibatch size does not change in model parallelism, the utilization vs. generalization tradeoff (Section 3) does not apply. Nevertheless, the DNN architecture creates layer interdependencies, which, in turn, generate communication that determines the overall performance. Fully connected layers, for instance, incur all-to-all communication (as opposed to allreduce in data parallelism), as neurons connect to all the neurons of the next layer.
To reduce communication costs in fully connected layers, it has been proposed [179] to introduce redundant computations to neural networks. In particular, the proposed method partitions an NN, such that each processor will be responsible for twice the neurons (with overlap) and thus would need to compute more but communicate less.
Another method proposed for reducing communication in fully connected layers is to use Cannon's matrix multiplication algorithm, modified for DNNs [74] . The paper reports that Cannon's algorithm produces better efficiency and speedups over simple partitioning on small-scale multilayer fully connected networks.
Model parallelism in convolutions may scale less efficiently, depending on input dimensions. If samples are partitioned across processors by feature (channel), then each convolution requires a collective summation over all processors (resulting in allreduce or allgather operations depending on the decomposition). Exploiting parallelism across the spatial dimensions (H ,W ) can yield speedups for certain input sizes, since halo exchanges can be overlapped by other computations in the forward and backward passes [68] . To mitigate the halo exchange issue, convolution filters can be decomposed over the processor in Tiled and Locally-Connected Networks [184] (LCNs, Figure 16(a) ), where the former partially share weights, whereas LCNs do not. Using LCNs and model parallelism, the work presented by Coates et al. [45] managed to outperform a CNN of the same size running on 5,000 CPU nodes with a three-node multi-GPU cluster. Due to the lack of weight sharing (apart from spatial image boundaries), training is not communication-bound, and scaling can be achieved. Successfully applying the same techniques on CNNs requires fine-grained control over parallelism, as we shall show in Section 6.4. Unfortunately, weight sharing is an important part of CNNs, contributing to memory footprint reduction as well as improving generalization, and thus standard convolutional operators are used more frequently than LCNs.
A second form of model parallelism is the replication of DNN elements. In TreeNets [153] , the authors study ensembles of DNNs (groups of separately trained networks whose results are averaged, rather than their parameters), and propose a mid-point between ensembles and training a single model: a certain layer creates a "junction," from which multiple copies of the network are trained (see Figure 16 (b)). The paper defines ensemble-aware loss functions and backpropagation techniques, to regularize the training process. The training process, in turn, is parallelized across the network copies, assigning each copy to a different processor. The results presented in the paper for three datasets indicate that TreeNets essentially train an ensemble of expert DNNs.
Pipelining
In deep learning, pipelining can either refer to overlapping computations, i.e., between one layer and the next (as data becomes ready); or to partitioning the DNN according to depth, assigning layers to specific processors (Figure 14(c) ). Pipelining can be viewed as a form of data parallelism, since elements (samples) are processed through the network in parallel but also as model parallelism, since the length of the pipeline is determined by the DNN structure.
The first form of pipelining can be used to overlap forward evaluation, backpropagation, and weight updates. This scheme is widely used in practice [1, 48, 124, 238] and increases utilization by mitigating processor idle time. In a finer granularity, neural network architectures can be designed around the principle of overlapping layer computations, as is the case with Deep Stacking Networks (DSN) [63] . In DSNs, each step computes a different fully connected layer of the data. However, the results of all previous steps are concatenated to the layer inputs (see Figure 17(a) ). This enables each layer to be partially computed in parallel, due to the relaxed data dependencies.
GPipe [111] uses layer partitioning to train many-parameter DNNs, achieving state-of-the-art accuracy on ImageNet and CIFAR-10. There are several advantages for a multi-processor pipeline over both data and model parallelism: (a) there is no need to store all parameters on all processors during forward evaluation and backpropagation (as with model parallelism); (b) there is a fixed number of communication points between processors (at layer boundaries), and the source and destination processors are always known. Moreover, since the processors always compute the same layers, the weights can remain cached to decrease memory round-trips. Two disadvantages of pipelining, however, are that data (samples) have to arrive at a specific rate to fully utilize the system, and that latency proportional to the number of processors is incurred.
In the following section, we discuss two implementations of layer partitioning-DistBelief [57] and Project Adam [39] -which combine the advantages of pipelining with data and model parallelism.
Hybrid Parallelism
The combination of multiple parallelism schemes can overcome the drawbacks of each scheme. Below, we overview successful instances of such hybrids.
In AlexNet, most of the computations are performed in the convolutional layers, but most of the parameters belong to the fully connected layers. When mapping AlexNet to a multi-GPU node using data or model parallelism separately, the best reported speedup for 4 GPUs over one is ∼2.2× [254] . One successful example [141] of a hybrid scheme applies data parallelism to the convolutional layer, and model parallelism to the fully connected part (see Figure 17 (b)). Using this hybrid approach, a speedup of up to 6.25× can be achieved for 8 GPUs over one, with less than 1% accuracy loss (due to an increase in minibatch size). These results were also reaffirmed in other hybrid implementations [16] , in which 3.1× speedup was achieved for 4 GPUs using the same approach, and derived theoretically using communication cost analysis [80] , promoting 1.5D matrix multiplication algorithms for integrated data/model parallelism.
AMPNet [79] is an asynchronous implementation of DNN training on CPUs, which uses an intermediate representation to implement fine-grained model parallelism. In particular, internal parallel tasks within and between layers are identified and scheduled asynchronously. Additionally, asynchronous execution of dynamic control flow enables pipelining the tasks of forward evaluation, backpropagation, and weight update (Figure 15(a) , right). The main advantage of AMPNet is in recurrent, tree-based, and gated-graph neural networks, all of which exhibit heterogeneous characteristics, i.e., variable length for each sample and dynamic control flow (as opposed to homogeneous CNNs). The paper shows speedups of up to 3.94× over the TensorFlow [1] framework.
Last, the DistBelief [57] distributed system combines the three parallelism strategies. In the implementation, training is performed on multiple model replicas simultaneously, where each replica is trained on different samples (data parallelism). Within each replica, the DNN is distributed according to neurons in the same layer (model parallelism), and according to the different layers (pipelining). Project Adam [39] extends upon the ideas of DistBelief and exhibits the same types of parallelism. However, in Project Adam pipelining is restricted to CPU cores on the same node.
CONCURRENCY IN TRAINING
So far, we have discussed training algorithms where there is only one copy of w, and its up-todate value is directly visible to all processors. In distributed environments, there may be multiple instances of SGD (training agents) running independently, and thus the overall algorithm has to be adapted. Distribution schemes for deep learning can be categorized along three axes: model consistency, parameter distribution, and training distribution; where Figures 18 and 19 summarize the applied techniques and optimizations.
Model Consistency
We denote training algorithms in which the up-to-date w is observed by everyone as consistent model methods (see Figures 20(a) and 20(b) ). Directly dividing the computations among nodes creates a distributed form of data parallelism (Section 6), where all nodes have to communicate their updates to the others before fetching a new minibatch. To support distributed, data parallel SGD, we can modify Algorithm 2 by changing lines 3 and 7 to read (write) weights from (to) a parameter store, which may be centralized or decentralized (see Section 7.2). This incurs a substantial overhead on the overall system, which hinders training scaling. Recent works relax the synchronization restriction, creating an inconsistent model (Figure 20(c) ). As a result, a training agent i at time t contains a copy of the weights, denoted as w (τ ,i ) for τ ≤ t, where t − τ is called the staleness (or lag). A well-known instance of inconsistent SGD is the HOG-WILD shared-memory algorithm [213] , which allows training agents to read parameters and update gradients at will, overwriting existing progress. HOGWILD has been proven to converge for sparse learning problems [213] , where updates only modify small subsets of w, for general convex optimization [56] , and nonconvex optimization [160] . Based on foundations of distributed asynchronous SGD [237] , the proofs impose that (a) write-accesses (adding gradients) are always atomic; (b) Lipschitz continuous differentiability and strong convexity on f w ; and (c) that the staleness, i.e., the maximal number of iterations between reading w and writing ∇w, is bounded.
The HOGWILD algorithm was originally designed for shared-memory architectures but has since been extended [57, 186] to distributed-memory systems, in which it still attains convergence for deep learning problems. To mitigate the interference effect of overwriting w at each step, the implementation transfers the gradient ∇w instead of w from the training agents. Asymptotically, the lack of synchronization in HOGWILD and its gradient-communicating variants admits an optimal SGD convergence rate of O(1/ √ mT ) for m participating nodes [2, 59, 160] , as well as linear scaling, as every agent can train almost independently.
To provide correctness guarantees in spite of asynchrony, Stale-Synchronous Parallelism (SSP) [99] proposes a compromise between consistent and inconsistent models. In SSP (Figure 20(d) ), the gradient staleness is enforced to be bounded by performing a global synchronization step after a maximal staleness may have been reached by one of the nodes. This approach works especially well in heterogeneous environments, where lagging agents (stragglers) are kept in check. To that end, distributed asynchronous processing has the additional advantage of adding and removing nodes on-the-fly, allowing users to add more resources, introduce node redundancy, and remove straggling nodes [57, 195] .
Centralization
The choice between designing a centralized and a decentralized network architecture for DNN training depends on multiple factors [159] , including the network topology, bandwidth, communication latency, parameter update frequency, and desired fault tolerance. A centralized network architecture would typically include a parameter server (PS) infrastructure (e.g., Figures 20(a) , 20(c), and 21), which may consist of one or more specialized nodes; whereas a decentralized architecture (Figures 20(b) and 20(d) ) would rely on allreduce to communicate parameter updates among the nodes. Following communication, centralized parameter update is performed by the PS, whereas the decentralized update is computed by each node separately. In the latter case, every node creates its own optimizer.
The tradeoff of the distribution schemes can be modeled by the communication cost per global update. While the allreduce operation can be implemented efficiently for different message sizes and nodes (Section 2.4), the PS scheme requires each training agent to send and receive information to/from the PS nodes. Thus, not all network routes are used, and in terms of communication the operation is equivalent to a reduce-then-broadcast implementation of allreduce, taking T tree time. However, the PS can keep track of a "global view" of training, averaging the gradients at one location and enabling asynchronous operation of the agents. This, in turn, allows nodes to communicate less information by performing some of the computations on the PS [39] , as well as increases fault tolerance by dynamic spin-up and removal of nodes during training.
The PS infrastructure is an abstract concept, and is not necessarily represented by one physical server. Sharded parameter servers [39, 57] divide the ownership of w over multiple nodes, each containing a segment of its elements. In conjunction with model parallelism and layer pipelining (Sections 6.2 and 6.3), this alleviates some of the congestion at the PS, as shown in Figure 21 (a), in which each portion of a "model replica" (training agent) transmits its gradients and receives its weights from a different shard. Hierarchical parameter servers [89, 263] (Figure 21(b) ) further alleviate resource contention by assigning training agents with PS "leaves," propagating weights and gradients from specific agent groups up to the global parameter store. Rudra [89] also studies the tradeoff in allowed staleness, number of agents, and minibatch size, showing that SSP performs better, but requires adapting the learning rate accordingly.
A PS infrastructure is not only beneficial for performance but also for fault tolerance. The simplest form of fault tolerance in machine learning is checkpoint/restart, in which w (t ) is periodically synchronized and persisted to a non-volatile data store (e.g., a hard drive). This is performed locally in popular deep learning frameworks, and globally in frameworks such as Poseidon [265] . Besides checkpoints, fault tolerance in distributed deep learning has first been tackled by Dist-Belief [57, 148] . In the system, training resilience is increased by both introducing computational redundancy in the training agents (using different nodes that handle the same data), as well as replicating parameter server shards. In the former, an agent, which is constructed from multiple physical nodes in DistBelief via hybrid parallelism (Section 6.4), is assigned multiple times to separate groups of nodes. Allocating redundant agents enables handling slow and faulty replicas ("stragglers") by cancelling their work upon completion of the faster counterpart. As for the latter resilience technique, in DistBelief and Project Adam [39] , the parameters on the PS are replicated and persisted on non-volatile memory using a dedicated manager. Project Adam further increases the resilience of distributed training by using separate communication endpoints for replication and using Paxos consensus between PS nodes.
Applying weight updates in a distributed environment is another issue to be addressed. In Section 2.1, we establish that all popular weight rules are first-order with respect to the required gradients ( Table 3 ). As such, both centralized and decentralized schemes can perform weight updates by storing the last gradient and parameter values. Since GPUs are commonly used when training DNNs (Figure 3(a) ), frameworks such as GeePS [52] implement a specialized PS for acceleratorbased training agents. In particular, GeePS incorporates additional components over a general CPU PS, including CPU-GPU memory management components for weight updates.
In addition to reducing local (e.g., CPU-GPU) memory copies, PS infrastructures enable reducing the amount of information communicated over the network. Project Adam utilizes the fact that the PS is a compute-capable node to offload computation in favor of communicating less. In particular, it implements two different weight update protocols. For convolutional operators, in which the weights are sparse, gradients are communicated directly. However, in fully connected layers, the output of the previous layer x ∈ X C in ×N and error ∂ ∂y ∈ X C out ×N are transmitted instead, and ∇w is computed on the PS. Therefore, with m nodes communication is modified from m · C out · C in to m · N · (C out + C in ), which may be significantly smaller, and balances the load between the agents and the normally under-utilized PS.
Parameter servers also enable handling heterogeneity, both in training agents [125] and in network settings (e.g., latency) [109] . The former work models distributed SGD over clusters with heterogeneous computing resources, and proposes two distributed algorithms based on stalesynchronous parallelism. Specifically, by decoupling global and local learning rates, unstable convergence caused by stragglers is mitigated. The latter work [109] acknowledges that training may be geo-distributed, i.e., originating from different locations, and proposes a hierarchical PS infrastructure that only synchronizes "significant" (large enough gradient) updates between data centers. To support this, the Approximate Synchronous Parallel model is defined, proven to retain convergence for SGD, and empirically shown to converge up to 5.6× faster with GoogLeNet.
In a decentralized setting, load balancing can be achieved using asynchronous training. However, performing the parameter exchange cannot use the allreduce operation, as it incurs global synchronization. One approach to inconsistent decentralized parameter update is to use a fixed communication graph, via neighbor-based gradient/parameter exchange [161] . Another approach is to use Gossip Algorithms [24] , in which each node communicates with a fixed number of random nodes (typically in the order of 3). In the random case, with very high probability [67] , communicating for 1.639 · loд 2 m time-steps with m nodes will disseminate the data across all others. As strong consistency is not required for distributed deep learning, this method has shown marginal success for SGD [55, 126, 209] , attaining both convergence and faster performance than allreduce SGD up to 32 nodes. On larger systems, the resulting test accuracy degrades. One approach to improve this could be to employ deterministic correction protocols [101] .
Parameter and Gradient Compression
The distributed SGD algorithm requires global reduction operations to converge. As discussed above, reducing the number of messages (via an inconsistent view of w or efficient collective operations) is possible. Here, we discuss reducing the size of each message.
There are two general ways to conserve communication bandwidth in distributed deep learning: compressing the parameters with efficient data representations, and avoiding sending unnecessary information altogether, resulting in communication of sparse data structures. While the methods in the former category are orthogonal to the network infrastructure, the methods applied in the latter category differ when implemented using centralized (PS) and decentralized topologies.
Quantization.
A prominent data representation for gradient (or parameter) compression is quantization, i.e., mapping continuous information into buckets that represent sets of values (usually ranges). It has been shown [138] that the distributions of parameter and gradient values are narrowly dispersed (Figure 22(a) ), thus these methods are effective in representing the working range to reduce the number of bits per parameter. This method has been successfully utilized in deep learning, both during training [64, 88, 112] and for inference, where values are quantized post-training [210, 276] . Some papers go so far as to quantize gradients to binary [51, 217] or ternary [156] values, while still attaining convergence with marginally reduced accuracy. Quantization is commonly performed by way of reducing floating-point dynamic range [56, 64, 88] . In particular, such methods represent IEEE 754 32-bit floating-point values with fewer bits. While already applied to inference hardware [35] , the first successful instance of reduced precision for training [88] was performed with IEEE 754 16-bit float values ("half precision"). As evaluated in the paper, quantized training does not work "out-of-the-box" for lossy compression such as reduced precision. Rather, it depends on rounding the parameters in a way that preserves the expected value (E) of the parameters. To resolve this issue, the paper proposes Stochastic Rounding [88] , which randomly rounds numbers down or up, providing correct values in expectation.
Other forms of quantization extend upon these ideas. QSGD [5] generalizes stochastic rounding to stochastic quantization, and proposes multi-level gradient quantization schemes. Deep Compression [91] also employs the lossless Huffman Coding [113] to further increase storage efficiency without impairing convergence. Binarized Neural Networks (BNNs) [50] , Ternary Weight Networks [156] , TernGrad [245] , and DoReFa-Net [276] quantize networks to binary parameters, ternary parameters, ternary gradients, and binary parameters+ternary gradients, respectively. Both BNNs (in some cases) and TernGrad use stochastic rounding to lower the input representation. Last, FlexPoint [138] implements block floating-point arithmetic [247] , which computes the mantissa portion of the floating-point values as fixed-point math, and shares the exponent part among multiple parameters/gradients. To accommodate changes to the exponents, predictive analysis is used for estimating subsequent values.
Essential to the convergence of SGD with lossy quantization is local gradient accumulation. Particularly in distributed environments, where the updates are inconsistent, it is important to carry the quantization error to the next gradient, accumulating error to avoid drift. The idea originates from Sigma-Delta Modulation [217] and has proven to be successful in many cases. Deep Gradient Compression [163] extends this idea by accumulating momentum locally as well, further decreasing the loss in accuracy to become non-negligible and even resulting in a minor accuracy increase.
Sparsification.
DNNs (and CNNs in particular) exhibit sparse gradients during parameter updates. This is primarily due to the very large number of parameters that do not necessarily change all at once; and operators such as convolutions, in which the optimization process may improve the accuracy of certain convolution kernels. Therefore, the full gradient is not necessary to retain convergence, and various methods that leverage this feature have been proposed.
The first application of gradient sparsification [232] prunes gradients using a static threshold, below which an element should not be sent. Results show up to 54× speedup for 80 nodes and up to 1.8% reduction in error. The authors achieved a compression ratio (which also includes 32-bit fixed point quantization) of 846-2,871× for a non-convolutional DNN. Subsequent works propose relative (e.g., top 1%) [3, 32, 222] and adaptive thresholds [69] to transmit only the "important" gradients, based on their absolute value. To counter the accuracy loss as a result of sparsification, some works suggest to condition gradient values by changing the DNN architecture, adding normalization operators [3] ; whereas others [163] propose local gradient clipping and warm-up training.
In a centralized setting (Section 7.2), distributing sparse gradients is straightforward-sparse messages are sent between the training agents and the PS. However, implementing the necessary allreduce in a decentralized setting is not as simple, because each agent may contribute different non-zero indices (dimensions) in its gradient. Kylix [274] implements sparse allreduce in two steps, first exchanging the indices and then the data. While this is desirable for systems where the sparsity pattern per node does not change, in deep learning the gradient indices differ with each iteration. SparCML [214] targets the specifics of deep learning explicitly by supporting arbitrarily changing indices in a framework for sparse allreduce operations, as illustrated in Figure 22 (b). SparCML combines sending only the top-k most significant indices with quantization and supports sparse vectors of heterogeneous sizes. The system switches between a sparse and dense representation automatically, informed by a simple performance model. SparCML achieves a speedup of more than 20× over a well-tuned CNTK implementation on Ethernet.
Other
Techniques. In Section 7.2, we discuss Project Adam sending activations and errors instead of parameters, decreasing the overall footprint for fully connected layers in favor of redundant computation on the PS. The Poseidon (formerly Petuum) framework [252, 264, 265] extends the idea of transmitting decomposed outer products u · v T of w, generalizing the concept to other fields in machine learning as Sufficient Factor Broadcasting (SFB) [251] . With SFB, the activations are not sent to the PS, but rather disseminated among the training agents for local recomposition. SFB works best in centralized topologies, as recomposing the gradients in a decentralized environment causes each agent to process m − 1 additional outer products with each step, where m is the number of agents.
A different approach to reduce DNN memory footprint is to design them specifically for that purpose [41, 114, 135, 155] . Such works make use of memory-efficient operators and techniques, mostly applied to convolutions, to train networks that fit on devices such as FPGAs and mobile phones. Applied techniques include layers constructed from a multitude of 1 × 1 convolutions [114] , reshaping [155] or applying Tucker Decomposition [135] on convolution tensors, and separable convolutions (sequential application of reduced-dimension convolutions) [41, 108] . The papers show that DNNs can decrease in size (up to 50×) and evaluation time (6.13×), exhibiting minor reduction in accuracy.
Model Consolidation
In this section, we discuss the far (inconsistent) end of the parameter consistency spectrum (shown in Figure 23 ) in distributed deep learning. In such cases, parameter updates are highly infrequent (or nonexistent), and thus precautions must be taken with the received values. In particular, rather than running data-parallel SGD on multiple nodes, distributed deep learning can be achieved by assigning training agents with different copies of w and combining the resulting models, either post-training or several times during training. While the latter can be seen as a generalization of an inconsistent view of w, the former may entirely change the training and inference processes, depending on the method.
Ensemble Learning and Knowledge Distillation.
A widely used technique for post-training consolidation is ensemble learning [118, 153, 225] . With ensembles, multiple instances of w are trained separately on the same dataset, and the overall prediction is the average of the predictions of the ensemble members, i.e., f (
. Ensemble learning has been used extensively in machine learning before the deep learning era [66] as a form of boosting, and typically increases the overall accuracy over a single model. Thus, it is routinely applied in machine-learning competitions such as ILSVRC [62] and in industrial applications. Distributed training of ensembles is a completely parallel process, requiring no communication between the agents. However, works such as TreeNets [153] (Section 6.2) combine ensemble learning with custom (ensemble-aware) loss functions to promote diversity between ensemble members.
Given that ensembles consume a factor of m more memory and compute power, another posttraining model consolidation technique is to reduce the size of a DNN using knowledge distillation [10, 97] . In this scheme, training is performed in two steps: in the first step, a large network or an ensemble is trained normally; and the second step trains a single neural network to mimic the output of the large ensemble. Results [97] show that the second network is easier to train on the ensemble than on a labeled dataset, attaining the same word error rate as an ensemble of 10 DNNs.
Model
Averaging. Another technique for consolidating models is model averaging [201] . Such methods may separately run m SGD instances on different machines, aggregating the parameters only once (post-training) [277] or every few iterations [33, 174] . While these methods are proven to converge, applying stale-synchronous SGD (Section 7.1) leads to higher overall accuracy.
To overcome accuracy degradation as a result of infrequent averaging, more sophisticated consolidation methods include Elastic Averaging SGD (EASGD) [268] and Natural Gradient Descent [202] . EASGD is based on a centralized environment (i.e., including a PS), extending direct averaging by using elastic forces between the training agents' view of w (w (t,i ) ) and the PS's view (w). This allows the agents to "explore" further by increasing the possible distance of each agent from the average, and also allows to communicate sparsely with respect to time (iterations). EASGD was reported [268] to outperform the DistBelief [57] SGD method in terms of accuracy, shown to be tolerant in terms of update delay, and was used successfully in practice for communication reduction by other works [159, 258] .
Natural Gradient Descent (NG-SGD) can also be used to deal with diverging parameters in different agents [202] . NG-SGD modifies SGD to define learning-rate matrices, approximating the inverse Fisher information matrix and thus natural gradients. By averaging agent parameters only every k samples (typically in the order of hundreds of thousands), the algorithm allows agents to gradually diverge and synchronize less than traditional SGD. Natural Gradients were also approximated for distributed deep learning using Kronecker Factorization (K-FAC) [11] , where the work is divided between gradient-and statistics-computing agents (for Fisher matrix blocks).
In distributed settings, algorithms are also inspected w.r.t. fault tolerance. Krum [20] is a Byzantine fault-tolerant [145] SGD algorithm, allowing up to f Byzantine training agents. In particular, the paper shows that any gradient aggregation rule based on linear combination cannot sustain a single Byzantine agent. By combining specific m − f gradients (that are closest to each other), Krum is able to overcome adversarial gradient inputs from the network.
Optimization Algorithms and Architecture Search
As training in deep learning is a nonlinear optimization problem, other algorithms that exhibit concurrency can be used in SGD's stead [23] . Furthermore, it is possible to use excess computational power to perform meta-optimization, searching for better hyper-parameters and DNN architectures.
Parameter Search.
Supervised learning can either be viewed as a stochastic optimization process that uses one or a minibatch of samples at a time, or it can be expressed as a batch optimization problem, where the entire dataset is necessary to obtain gradients for descent. Batch optimization has been used for deep learning since the inception of DNNs [151] , using first-and second-order methods [185] such as Levenberg-Marquardt, Conjugate Gradient (CG), and L-BFGS. Although considerably more computationally expensive than SGD, there are several advantages to such approaches, including increased concurrency (as data-parallelism increases) and better theoretical convergence guarantees (e.g., second-order methods converge locally at a quadratic rate). As mentioned in Sections 3 and 6.1, large-minibatch training represents a middle ground between SGD and batch methods. Such methods combine the "best of both worlds"-on one hand, they exhibit increased inherent concurrency (as higher-order methods); on the other hand, they employ stochasticity, which, despite the sublinear rate of convergence, works well in practice.
For distributed deep learning, batch methods [147] (specifically CG and L-BFGS) and Hessianfree second-order optimization [43, 95, 171] have initially been favored due to their apparent scalability compared to traditional SGD (Algorithm 1). However, due to the superior generalization properties of first-order stochastic optimization, and the successful DistBelief [57] implementation of inconsistent SGD (called Downpour SGD, based on HOGWILD [213] ); it was found that the quadratic increase of batch methods in memory, communication, and computations due to high dimensionality is not desirable. To overcome these issues, stochastic variants of L-BFGS have been proposed [28, 177] that estimate the inverse Hessian matrix and proven to converge at a linear rate in strongly-convex, Lipschitz-continuous settings [177] .
Other optimization algorithms applied to deep learning attempt to: (a) reduce the variance of SGD incurred by random sampling [128] , (b) use the Alternating Direction Method of Multipliers (ADMM) [25] to skip the backpropagation altogether [235] , (c) use K-FAC to approximate second order information [191] , or (d) use the Neumann series expansion to approximate the Hessian matrix [139] , scaling to large minibatch sizes (32k with no accuracy loss, 131k with minimal loss) without substantial computational overhead.
Gradient-free evolutionary algorithms have also been employed for deep learning, where examples include Genetic Algorithms [199, 250] , Neuro-Evolution [175, 243] , and Particle-Swarm Optimization [168] . Apart from recombination/evolution steps, training behavior is similar to ensemble learning, and thus these algorithms are more amenable to parallelism than traditional gradient descent. As we show in the rest of this section, the gradient-independent nature of such algorithms enable their use for meta-optimization of both hyper-parameters and DNN architectures.
7.5.2
Hyper-parameter Search. The multitude of hyper-parameters in SGD (e.g., learning rate, momentum, maximal staleness) and their adverse effect on the resulting accuracy hinders research efforts into new techniques in machine learning. Until recently, the prominent method for hyperparameter search was to perform parameter sweeps (i.e., grid search over feasible ranges). Since this method increases the overall time exponentially with the number of hyper-parameters, its effectiveness is limited by the availability of computing power.
Several methods try to expand beyond simple parameter sweeps by making educated guesses and tuning hyper-parameters during training. In the former class, methods include Bayesian optimization [227] , predictive analysis of the learning curves (e.g., training error, validation error) [13, 137] for dropping undesirable configurations, and sampling the hyper-parameter space efficiently using spectral methods such as Compressed Sensing [92] . As for tuning hyper-parameters during training, Omnivore [90] employs predictive analysis and grid searches every predetermined number of minutes to modify the momentum and a hyperparameter controlling local gradient staleness. The paper shows that in distributed environments, controlling the synchronous SGD node-group size during training can increase both accuracy and performance. YellowFin [266] uses the local gradient curvature and variance to tune momentum, working especially well on LSTM-based models and asynchronous environments, performing up to 3.28× faster than the Adam optimizer (Table 3) .
Metaheuristic optimization algorithms can inherently integrate hyper-parameter tuning with training and are thus used for DNNs. Such methods include Particle Swarm Optimization (PSO)based deep learning [168] ; and CoDeepNEAT [175] , a modification of the NEAT algorithm that simultaneously searches for hyper-parameter and architecture configurations (see below). Such methods scale almost linearly, due to abundance of independent computations.
Last, Population-based Training [122] (PBT) uses a reinforcement learning approach to "explore" and "exploit" the hyper-parameter space. As illustrated in Figure 24 (a), each training agent independently samples (exploits) information from other agents every few SGD iterations. The information is then used to select the best configuration (e.g., using a t-test), and hyper-parameters are in turn perturbed (explored) to continue learning. This creates a decentralized topology where communication is nondeterministic (i.e., exploitation is performed with a randomly sampled agent), which may scale better as the number of training agents increases. SMBO-based search methods rely on optimizing an architecture candidate, defining a finite set of states to explore (e.g., search tree children), and traversing those sets. As a result, concurrency depends on the number of points in the search space at a given time. Examples of SMBO include DeepArchitect [180] , which proposes a DNN definition language that allows programmers to explicitly define the space; PNASNet [164] , which searches for networks ordered by increasing complexity using a search algorithm based on A*, conserving half the evaluated models compared to an equivalent RL approach [280] ; SMASH [27] , which assesses optimality (fitness) of candidate networks using another CNN that maps the given DNN architecture to weights for testing; and DARTS [166] , which formulates architecture search as a bi-level, differentiable optimization problem.
Many recent DNN architectures (Section 4.3) exhibit self-similarity and repeating sub-units (modules). This observation can be leveraged to dramatically reduce the number of explored architectures, composing networks hierarchically out of modules and basic blocks (e.g., convolution). This approach has been used successfully in the community, creating new candidates for both CNN modules [164, 165, 175, 211, 275, 280] and RNN units [200, 279] .
RL-based architecture search uses the accuracy of the resulting network as a reward function, whereas modifications to the DNN or its hyper-parameters are actions. In Neural Architecture Search (NAS) [279] , the parameters of each layer can be modified, but the number of layers is fixed. A sharded PS-based distributed system, in conjunction with policy gradient optimization [248] , is used for training. Other examples include MetaQNN [12] and BlockQNN [275] , which operate similarly to NAS, but use Q-learning for optimization; and ENAS [200] , which significantly reduces computational time over NAS (by three orders of magnitude) by sharing parameters across children DNNs (i.e., networks in the immediate search space).
Evolutionary Algorithms (EA) are advantageous for architecture search, as any function (not necessarily differentiable) can be optimized using these methods. HyperNEAT was the first EA successfully applied [243] to deep learning, used for training weights and DNN architecture at the same time; and CoDeepNEAT [175] defines a variant of the NEAT algorithm to optimize hyperparameters and architecture, using the self-similarity feature of DNNs by optimizing "blueprints" that are composed of modules. Genetic CNNs [250] uses Genetic Algorithms (GAs) by encoding the DNN connections as binary genes (as required in GAs, shown in Figure 24(b) ), and training the population of DNNs with every time-step, using the final accuracy as the fitness function. GAs are highly amenable to parallelism, and have been successfully used for very large-scale training [261] , where 18,000 nodes were used on the Titan supercomputer for 24h to obtain state-of-the-art accuracy for segmentation and reconstruction problems.
Large-Scale Evolution [212] also uses GAs, but defines a set of specific mutations (e.g., insert convolution, alter stride) that can be applied. Large-Scale Evolution outperforms some existing RL-based methods in terms of accuracy, as well as in terms of scalability, as GAs can run the entire population in parallel (where accuracy increases with population size in expectation). However, in the general case GA requires synchronous reductive communication between time-steps for selection of the fittest candidates. To overcome this issue, the paper employs tournament selection [82] , which only performs pairwise comparisons between population members.
Additional GA architecture search methods include the use of multi-level hierarchical representations of DNNs [165] , which implement an asynchronous distributed tournament selection (centralized, queue-based implementation) with specialized mutation. Regularized Evolution (Amoe-baNets) [211] further extends GA with tournament selection by removing the oldest sample from the population each iteration (akin to death in nature), thus regularizing the optimization process. AmoebaNets outperform all existing methods [111] , including manually engineered DNNs and RL-based searches, with 3% top-5 error for ImageNet and 1.5% top-1 error for CIFAR-10 (compared to 5.29% and 3.62% on the best instances of DenseNet, see Table 5 ).
CONCLUDING REMARKS
The world of deep learning is brimming with concurrency. Nearly every aspect of training, from the computation of a convolution to the meta-optimization of DNN architectures, is inherently parallel. Even if an aspect is sequential, its consistency requirements can be reduced, due to the robustness of nonlinear optimization, to increase concurrency while still attaining reasonable accuracy, if not better. In this article, we give an overview of many of these aspects, the respective approaches documented in literature, and we provide concurrency analysis using the W-D model when possible.
It is hard to predict what the future holds for this highly active field of research (many have tried over the years). Below, we highlight potential directions for future research in parallel and distributed deep learning.
As research progresses, DNN architectures are becoming deeper and more interconnected, between consecutive and non-consecutive layers ("skip connections"). Apart from accuracy, considerable effort is devoted to reducing the memory footprint and number of operations [108, 211] , to successfully run inference on mobile devices. This also means that post-training DNN compression [91] will likely be researched further, and training compressible networks will be desirable. Since mobile hardware is limited in memory capacity and has to be energy efficient, specialized DNN computational hardware is frequently proposed [233] . We see this trend with the NVIDIA Tensor Cores [188] , the Tensor Processing Unit [129] , other ASICs and FPGAs [35, 187] , and even neuromorphic computing [4] . Handling DNN sparsity (e.g., after compression) is a focus for some ASICs [269] , and advances in recurrent networks and attention learning [30, 253] indicate that training and inference hardware would also need to work efficiently with variable-length inputs.
Computing individual operators is highly optimized today (Section 5), and thus current research is oriented toward inter-layer and whole-DNN optimization. TensorFlow XLA [83] , Tensor Comprehensions [240] , Latte [236] , and TVM [34] compile entire neural network graphs at once, performing a variety of transformations (e.g., fusion) to optimize execution time, achieving 4× speedup over manually tuned individual operators. We expect research to continue in this direction to the point where DNN evaluation is close to optimal in terms of operations and shared-memory optimizations.
Techniques applied in distributed deep learning are converging to the point where a standard programming interface (or framework) can be designed. In the future, ecosystems such as Ease.ml [158] may make the definition of a training scheme (e.g., with respect to centralization and gradient consistency) easier, hiding most of the low-level infrastructure setup. Combining the increasing support for cloud systems [271] and elastic training [195] (where nodes can be spun up and removed at will) with the latest developments in evolutionary algorithms (see Section 7.5), we may see adaptive and financially viable optimization methods rising to prominence.
Finally, deep learning is being used to solve increasingly complex problems such as routing algorithms [85] and hierarchical task combination [77] . Research toward Artificial General Intelligence is now focusing on multi-purpose networks [127, 130] , which creates new, unexplored opportunities for model parallelism and different training algorithms. Searching for adequate multi-purpose networks may be beyond the ingenuity of a human team, and as meta-optimization (specifically, architecture search) and progressive training [132] increase in usability and quality; parameter sweeps and manual DNN architecture engineering will become obsolete. Supporting this claim is the fact that the current state-of-the-art CNN in computer vision [211] (CIFAR-10 and ImageNet datasets) is the result of an automated architecture search. Exploiting parallelism is necessary for such breakthroughs and others, going hand in hand with the advancement of deep learning as a field.
