Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.
INTRODUCTION
In the last five years, neural networks and deep architectures have been proven very effective in application areas such as computer vision, speech recognition, and machine translation. The recent breakthroughs of AlphaGo further cement interest in employing deep architectures to develop intelligent machines. Although deep architectures such as convolutional neural networks (CNNs) [22, 32, 34] , recurrent neural networks (RNNs) [21, 45] , and restricted Boltzman machines (RBMs) [19, 31] have been around since the 1980s, they have never been under the spotlight. Why are they thriving now? The convincing factor this time around is scale, in both data volume and computation resources.
When the scale of training data is small, all supervised learning algorithms (e.g., decision trees, support vector machines, and logistic regression) achieve the same level of classification accuracy. In 2012, AlexNet [32] demonstrated that with millions of training images from ImageNet [16] , CNNs substantially outperform all prior works on image classification. Since then it has been shown in several vertical domains that large training datasets can improve the accuracy of classification tasks.
Since the computation complexity of a deep learning algorithm is high (e.g., the convolution stage of CNNs requires a six-level nested loop), the scale of data demands scalable computation resources. Fortunately, processor speed has soared more than one thousand folds in the last three decades. In addition, with specialized arrays of processors (e.g., GPUs) and accessibility of parallel computing infrastructures via the cloud, millions of cores can be utilized simultaneously for training. However, scaling up computation is not merely throwing in an infinite number of cores. As Amdahl's law [4] states, the non-parallelizable portion of a computation task such as communication, I/O, and interprocess synchronization may cap computation speedup. For instance, if the non-parallelizable portion is 50%, reducing computation time to zero achieves only a speedup factor of two. All deep learning frameworks involve substantial non-parallelizable overheads, which must be carefully mitigated to speed up training time.
Several open-source projects (e.g., Caffe [29] , MXNet [9] , TensorFlow [3] , and Torch [12] ) have been devoted to speeding up training deep networks. They can be summarized into two approaches: deep-learning algorithm optimization and algorithm parallelization (details of related work are presented in Section 1.1). The former includes using better convolution algorithms, improving stochastic gradient decent (SGD) with faster methods, employing compression/quantization, and tuning the learning rate with advanced optimization techniques. Indeed, most open-source libraries have quickly adopted available state-of-the-art optimizations. However, most users in academia and industry do not know how to set parameters, algorithmic and system, to conduct costeffective training. Researchers and professionals face at least the following questions in three levels, which are intra-GPU, inter-GPU, and inter-machine:
(1) What is the bottleneck of speeding up deep learning training by parallelism? (2) With X amount of data, what is the size of each mini-batch (X mini ) and how to maximize GPU utilization? (3) How many GPUs (G) should be employed, and how should such a system be configured? (4) How many parameter servers (N ps ) should be deployed when building a distributed system?
In this work, we aim to answer the above questions by providing system configuration guidelines given the characteristics of the training data (the number of training instances and the size of each training instance), as well as hardware parameters (such as GPU memory size, internal transmission bandwidth, e.g. bus bandwidth, and external transmission bandwidth, e.g. network bandwidth). We identify computation bottlenecks and I/O overheads of representative frameworks. From the insights we observed in benchmarking, we propose guidelines allowing users to configure a high-performance deep learning system for their target tasks.
Related Work
Since deep-learning training is time-consuming, many previous studies devoted to speeding up the training performance. These prior contributions can be divided into two approaches: algorithmic and system. The algorithmic approach accelerates the training algorithm, whereas the system approach focuses on employing improved resources to achieve parallel training. To ensure scalability, the system approach may require enhancing the training algorithm to take full advantage of the increased resources.
1.1.1 Algorithmic Approach. Stochastic gradient descent (SGD) is the de facto optimization algorithm for training a deep architecture. Many SGD techniques have been developed for achieving faster convergence to the global minimum. The settings of hyper-parameters such as learning rate and mini-batch size are crucial to the training performance. Hinton and Bengio [6, 25] provide recommendations on setting hyper-parameters commonly used in gradient-based training. Batch renormalization can be an effective strategy to train a network with small or non-i.i.d minibatches [27] . Momentum-based acceleration schemes increase the speed of learning and damp oscillations in directions of high curvature [41] . Per-parameter adaptive learning rate methods help reduce large gradients and decrease the learning rate over time [17] .
More efficient algorithms can improve speed. The execution time of convolution consumes 70% to 90% of CNN-based training. Some FFT-based convolution schemes were proposed [37] to achieve speedup. Additionally, Firas et al. proposed three matrix layout schemes using lowering operations [23] . Caffe con Troll implements a CPU-GPU hybrid system that contains several lowering operations, and at the same time, employs a simple automatic optimizer to select the best lowering. Some compression algorithms [18] are developed for both good compression ratios and fast decompression speed to enable block-wise uncompressed operations, such as matrix multiplication are executed directly on the compressed representations.
System Approach.
A deep learning training job consists of two computationally intensive arithmetic operations: matrix multiplication and convolution. A GPU is well-suited for speeding up such operations since these operations are easy to be parallelized. To achieve further speedup, the next logical step is to employ multiple GPUs, and to configure a distributed clusters of CPUs and GPUs. The computation time can be largely reduced via data parallelism and/or model parallelism. Many projects have proven parallelism to be helpful [11, 15, 26, 30, 40, 46] .
According to Amdahl's law, the peak performance of a parallel architecture is capped by the overhead portion of the computation task. In the context of deep learning, its training overhead includes synchronization between distributed threads, disk I/O, communication I/O, and memory access. To reduce synchronization delay, Zinkevich et al. [48] proposed an asynchronous distributed SGD algorithm to guarantee parallel acceleration without tight latency constraints. Chen et al. [8] proposed adding backup workers in synchronous SGD algorithm to mitigate the bottleneck. To reduce the impact of I/O on the overall speedup, most open-source frameworks (see Section 1.1.3) attempt to conceal I/O behind computation via the pipeline approach proposed in [36] . Such approach requires a computation unit to be sufficiently long so as to hide I/O overheads as much as possible. The pipeline approach, however, demands carefully setting up the unit size of computation (or mini-batch size) and the number of parameter servers. We will propose how to best estimate these configuration parameters in Section 3. CPUs), and efficiency. Additionally, MXNet provides a superset programming interface to be compatible with other frameworks. MXNet is lightweight and it enjoys multiple programming language supports, e.g., Python, R, Julia and Scala.
• TensorFlow: TensorFlow [3] , which supports distributed computation, is an open-source framework developed by Google. TensorFlow's design philosophy is flexibility, portability, and high efficiency. TensorFlow takes computations described by using a dataflow model and maps them onto a wide variety of hardware platforms. TensorFlow allows clients to easily express various kinds of parallelism through replication and parallel execution of a core model dataflow graph, with many different computational devices all collaborating to update a set of shared parameters or states.
• Torch: Torch [12] is designed to be easy for developing and extending numerical algorithms. Based on this philosophy, Torch leverages Lua language, a fast interpreted language (with also the fastest Just In Time (JIT) compiler), to embedded in a C application and provides APIs in C, making library wrapping easily for the unifying interface to C/C++.
Among the introduced frameworks, MXNet and TensorFlow are built-in distributed training frameworks. Users can easily develop algorithms running on computing clusters with thousands of CPUs or GPUs. Several works are proposed to give users a glimpse on the factors that they must take into consideration. Bahrampour et al. [5] provide a comparative study on different frameworks with respect to extensibility, hardware utilization, and performance. Shi et al. [42] provides performance study on selected frameworks. These works offer practitioners a high-level guideline to select an appropriate framework. Given a selected framework, our work aims to provide further configuration guidelines to make training both fast and cost-effective.
Contribution Summary
In summary, this work makes the following contributions:
(1) Identifying computation bottlenecks and devising their remedies. We benchmark representative networks and datasets to identify the typical bottlenecks of large-scale training. We then devise remedies to reduce or mask computation overheads (I/O and communication) to improve training speed. (2) Quantifying remedies into an optimization model. We formulate our remedies into an optimization model to determine the optimal mini-batch size and carefully balance memory and speed tradeoffs so as to employ the fastest algorithms given the memory constraint. (3) Recommending distributed configuration involving multiple GPUs and parameter servers. When the workload cannot be handled by a single GPU or machine, we propose lemmas to recommend the number of GPUs and parameter servers to configure so as to achieve cost-effective speedup.
Both real-world deployment and empirical studies attest our remedies to be very effective.
PRELIMINARIES
This section presents a typical deep learning training process including performance factors and their relevant parameters. We then show the setup of the evaluation environment.
Deep Learning Training Process
Figure 1 depicts a general architecture of deep-learning training and data flow. A local architecture is basically a commodity computer equipped with G GPUs. When aiming to improve parallelism via a distributed architecture, a worker and a parameter server can be replicated into multiple copies connected by a network. The mini-batch processing pipeline in the training process consists of seven steps. After the model parameters W and the data processing pipeline is initialized, the training process repeats until all training data is seen.
(1) Parameter refresh. In distributed training, the latest copy of model parameters W is pulled from parameter servers at the beginning of each mini-batch processing. W is then loaded onto GPU memory. A distributed environment consists of N w workers and N ps parameter servers for managing shared parameters. (2) Data loading. A subset of the X training instances called mini-batch of size X mini is loaded from the persistent storage to the main memory. (3) Data preparation. X mini instances are transformed into the required input format. These instances may be augmented to mitigate the over-fitting problem and enrich sample diversity. Among the seven steps, step 5 performs computation, and the other steps that cannot be hidden behind step 5 are considered as overheads. The larger fraction of the time which those overhead steps take, the less effective parallelism can achieve. Therefore, our tasks are minimizing overhead time and hiding overheads via pipelining as much as possible. The remainder of this paper is to demonstrate how the following parameters can be carefully tuned to achieve such goals, organized into four sections. In section 3.1, we provide a procedure to recommend a mini-batch size that leads to maximum training performance. Section 3.2 provides an in-depth analysis on training in a multi-GPU environment. We provide a lemma to estimate the number of GPUs G for a desired factor of speedup. The increase of GPU number not only improves performance speedup, but also induces communication overheads between GPUs. We'll also discuss how to alleviate the impacts of these overheads. In section 3.3, we address issues involving distributed workers. When the training system scales horizontally, we need an extra cluster to manage the parameters in addition to training hosts in the distributed environment. The communication between training hosts and parameter servers is an overhead that could seriously degrade training speedup. We propose a scheme to estimate the number of parameter servers N ps given network bandwidth B ps .
Evaluation Environment
We set up our evaluation environment with Elastic Compute Cloud (EC2) of Amazon Web Services (AWS) 2 . All experiments run on EC2 P2 instances equipped with NVIDIA Tesla K80 Accelerators which contain a pair of NVIDIA GK210 GPUs. Each GPU provides 12 GB memory and 2, 496 parallel processing cores. The CPU is a customized version of Intel Broadwell processor running at 2.7 GHz. Table 1 shows hardware configurations of P2 type instances 3 . To avoid unexpected GPU clock rate adjustment in our experiments, we disable GPU autoboost function. We perform experiments and demonstrate our ideas by MXNet and TensorFlow. Virtual machines are launched from Amazon deep learning AMI (Amazon Machine Image) v2.1 preloaded with NVIDIA CUDA toolkit v7.5 and cuDNN v5.1. We conduct experiments on the ILSVRC-2012 dataset, the subset of ImageNet [16] containing 1, 000 categories and 1.2 million images on SSD. The other set containing 50, 000 labeled images is used as validation data.
CONFIGURATION OF HIGH PERFORMANCE TRAINING SYSTEM
We study configuration in three incremental steps, starting from a single GPU, then expanding our benchmarking to multiple GPUs, and finally to distributed nodes where each node consists of multiGPUs. Each of these three steps focuses on analyzing one system configuration.
In the single GPU study, we analyze how the mini-batch size X mini can be decided to achieve fast training speed. Most prior studies only consider tuning X mini algorithmically, that is, selecting a size that can achieve fast convergence. However, taking the minimum number of epochs to reach convergence does not directly translate to shortest training time. In Section 3.1 we provide system analysis to determine X mini and solve optimized minibatch selection with integer linear programming.
As multiple GPUs are employed to conduct training, data moving is the major bottleneck, which caps the speedup performance according to Amdahl's law. Therefore, to be cost-effective, we should not use more GPUs when speedup improvement has saturated. Section 3.2 presents a systematic procedure to estimate an effective number of GPUs G.
When training is conducted in a distributed environment, we further study communication overhead. Section 3.3 depicts the distributed training process and provides a lemma to estimate the required number of parameter servers in a cost-effective system configuration.
Training on single GPU instance
In this section, we first point out the common performance pitfalls in designing neural networks. We illustrate that the setting of mini-batch size is the primary factor that determines training speed. We then formulate selecting the mini-batch size X mini as an optimization problem and provide a procedure to solve for X mini that can achieve fastest training speed.
3.1.1 Identifying System Issues. Most neural networks are initially designed according to some heuristics. Researchers may not have the full picture about their model's feasibility, convergence quality, and prediction quality unless they conducted some experiments. During the experimental process, various hyper-parameter values may be tested exhaustively by a trial-and-error process. According to our own experience, it is typically unknown at the beginning to know how long it would take to run a round of training job, let alone configure a cost-effective system that can maximize training speed. A suboptimal system configuration can lead to excessive execution time because of encountering the following issues:
• Shortage of GPU memory space. A GPU cannot commence computation without the data, including model parameters, gradients, computation workspace, etc, being loaded into GPU memory. A neural network designed without system knowledge may require more memory capacity than available memory. This excessive memory use may cause unnecessary thrashing and prolong training time.
• Ineffective tradeoff between speed and memory. Deep learning frameworks may execute operations of a training task by using different algorithms, which have different speed and memory-use trade-offs. The selection of using which algorithm is a layer-dependent decision. The selection factors include input data size, layer parameters, minibatch size, and available GPU memory space. Consider the convolution operation as an example. An FFT-based algorithm runs faster than a GEMM-based one but it requires more memory. The training speed may be degraded when a large X mini exhausts memory capacity in order to run a faster FFT-based algorithm. Thus, when tuning factors mentioned above, we should consider the impact on memory consumption because the memory budget affects the selection of algorithm.
Both training convergence and training speed can be decided by mini-batch size. In other words, selecting a good mini-batch size, one must examine from both the algorithmic and system aspects. From the algorithmic aspect, the mini-batch size is suggested to be larger than the number of output classes and a mini-batch contains at least one sample from each class [25] . The diversified training data leads to more stable convergence. From the system aspect, a proper mini-batch size helps to improve the parallelism inside GPU and enables the faster implementation of an operator. Based on the suggested mini-batch size considering the algorithmic aspect, we introduce the system aspect into deciding X mini .
Choosing Convolution Algorithms.
There are two timeconsuming operations in deep learning: matrix multiplication and convolution. Parallelizing matrix multiplication is rather straightforward, whereas speeding up convolution involves memory and speed trade-off. Two representative convolution algorithms are [10] and FFT based [37] . GEMM-based algorithms converts convolution to a matrix multiplication, which can be slow but the up side is that it requires less memory space. FFT-based algorithms run faster than GEMM-based by using efficient matrix multiplication and reducing the number of floating point operations. However, FFT-based algorithms demand substantially more memory as the filters are padded to be the same size as the input. In addition, FFT-based algorithms require extra memory space for feature mapping on domain transformation. Table 2 shows five convolution layers of AlexNet and their memory-usage ratios of FFT over GEMM given mini-batch size 128. The memory space required by the first layer with FFT is 11.6 times of that required by GEMM. (The parameters B i × H i and B i+1 × H i+1 represent the number of pixels of the inputs and outputs at the i th layer, respectively. Similarly, the parameters D i and D i+1 represent the depths of the inputs and outputs at the i t h , respectively. The parameter F represents the size of filters.)
To further understand the impact of X mini , we experimented with MXNet and TensorFlow, and plot system throughout (y-axis) versus X mini (x-axis) in Figure 2 . Although different frameworks may yield different throughputs, the trend remains the same, that is, the system throughput degrades once after X mini reaches a threshold. The reason why the throughput drops is that MXNet and TensorFlow choose to run a slower convolution algorithm due to the constrained free memory caused by the increased X mini . How to determine the optimal X mini ? We next formulate the problem of determining X mini as an optimization problem.
Optimizing
Mini-batch Size. In order to formulate the problem of determining X mini , we first define a memory constraint M bound , which is built into the later optimization formulas for X mini . During our formulation, most of the symbols follow in the same fashion of [2] .
We assume that a CNN such as AlexNet [32] consists of two major components: feature extraction and classification. Further, we assume that the feature extraction part comprises of n layers where stacked convolution layers are optionally followed by pooling layers, and the classification part consists of m fully-connected layers. We use B i ×H i ×D i and B i+1 ×H i+1 ×D i+1 where i ∈ {0, 1, . . . , n} to represent the sizes of inputs and outputs of convolution layers (or pooling layers), respectively. In particular, the size B 0 ×H 0 ×D 0 represents the size of input data. If we take training AlexNet on the ImageNet [16] as the example, B 0 ×H 0 ×D 0 is equal to 224×224×3. For the i t h layer of convolution and pooling layers, we denote its spatial extent (i.e. the size of filters) as F i , its stride as S i , its amount of padding as P i , and its number of filters as K i . Please note that if the i t h layer is a pooling layer, its K i is equal to zero, i.e. K i = 0. Thus, the inputs and outputs in the feature extraction part have the following relations:
The memory allocated for the feature extraction part of CNNs includes the input data, outputs (i.e. feature maps) of all the layers, model parameters, and gradients. We assume that all the values are stored by using single precision floating point (32bits). Based on the aforementioned notations and Equation (1), the memory usage for the input data and outputs of all layers in the feature extraction part can be calculated as follows:
Regarding the model parameters, there are two kinds of parameters: weights and biases. Though the biases are often omitted for simplicity in the literature, we take them into account here in order to estimate the memory usage precisely. Besides, we assume that the size of the gradients is twice as the size of the model parameters 4 . Thus, we can derive the memory usage for the model parameters and their related gradients by the following equation:
Furthermore, the memory allocated for the classification part of CNNs contains the outputs of all neurons and model parameters. We use L j where j ∈ {1, . . . , m} to denote the number of neurons at j th layer. Again, we make the same assumption that the size of the gradients is twice as the size of the model parameters. Therefore, the memory usage for the classification part of CNNs is as follows:
According to Equations (2) to (4), the memory constraint M bound can be approximately determined by the following equation:
where M GPU is the total memory of a GPU in terms of bits.
Deriving X mini .
Assuming that there are p kinds of convolution algorithms, and q layers in the CNN. (In the case that we have illustrated so far, p = 2. Other choices of convolution algorithms can be Winograd minimal convolution algorithm [33] , Strassen algorithm [13] , fbfft [44] , etc.) The parameter x k,l ∈ {0, 1} represents whether the k th layer uses the l th convolution algorithm or not. When x k,l is evaluated to 1, it means that the k th layer uses the l th algorithm to compute convolution. The value T k,l is the time consumption at the k th layer for the l th algorithm. The value M k,l is the memory consumption at the k th layer for the l th algorithm. Thus, the problem of determining X mini can be formulated an optimization problem as follows: 4 For each training instance, we need to store the gradients of all model parameters. The aggregated gradients of all model parameters are also required for a specific batch. Obviously, Equation (6) is an integer linear programming (ILP) problem [38] , which is NP-hard. However, there are several off-theshelf heuristic methods and libraries (e.g. GLPK [1] ) for solving ILP problems. Given a range of mini-batch sizes that can attain good accuracy, we can derive the estimated training time for each mini-batch size by solving Equation (6) . The mini-batch size which leads to the minimal training time is then the suggested X mini .
Refining Model for
Speed. This far, we assume that a CNN model is given to determine X mini and layer-dependent convolution algorithms to maximize training speed. We can make two further adjustments:
• Permit X mini reduction. The researchers may need to compromise on smaller mini-batch size if the target one is not feasible or does not deliver acceptable performance under the constraint of GPU memory size. Ghadimi et al. [20] shows that the convergence rate of SGD on a non-convex function is bounded by O(1/ √ K), where K is the number of samples seen, i.e., mini-batch size. It can be interpreted that a range of mini-batch sizes can deliver similar convergence quality. In Figure 3 , the x-axis depicts the epoch number and the y-axis depicts the top-5 validation error rate 5 . The figure shows that indeed a range of mini-batch sizes enjoy similar convergence quality. Therefore, we could reduce X mini to increase M bound to permit more memory space to run a faster convolution execution to achieve overall speedup.
• Permit model adjustment. Suppose that the constrained space of memory prevents us from running a faster algorithm. We could adjust the CNN model to free up some memory. For instance, if the i t h layer can be sped up 5 AlexNet achieved 18.2% top-5 error rate in in the ILSVRC-2012 competition, whereas we obtained 21% in our experiments. This is because we did not perform all the tricks for data augmentation and fine-tuning. We choose 25% as the termination criterion to demonstrate convergence behavior when mini-batch sizes are different.
ten times and the j th only twice. To accommodate running a faster algorithm for the i th layer, we could adjust both layers to e.g., use a larger stride or memory-efficient filters.
Scale with Multiple GPUs
When one GPU cannot handle the training task timely, employing multiple GPUs is the next logical step to share the workload and achieve speedup. When G GPUs are used and the maximal 100% efficiency is achieved, the speedup is G times. Let α denote the system efficiency between 0% and 100%. Lemma 3.1 provides the estimated efficiency given G GPUs.
Lemma 3.1. Let T denote the total training time, where T can be divided into computation time T C and overhead T O . Let R O denote the ratio of overhead or R O = T O /T C . Suppose the desired efficiency of the system is α, where α ≤ 100%. The efficiency can be estimated as
Proof. Details of the proof is documented in Appendix A.1. □ Lemma 3.1 can be used to estimate system efficiency given R O and G, and also can be used to estimate the acceptable R O given α and G. For example, given four GPUs and target efficiency α = 80%, the ratio of overhead that cannot be hidden behind computation must not exceed 9%.
To estimate R O , a practitioner can quickly profile the training program for a couple of epochs. Some frameworks such as MXNet and TensorFlow provide the capability to visualize the execution of a training task, which can be used to derive R O . If a computation framework is not equipped with a profiling tool, one can visualize program execution using nvprof 6 . Suppose a practitioner is asked to make 3x speedup of a training task, and she measures R O = 10%. According to the lemma, she can configure a 4 GPU system to achieve the performance objective.
To evaluate Lemma 3.1, we conduct the training on four neural networks to compare the estimated speedup with actual speedup. Though the estimated R O is a constant and in real-time overheads could be stochastic, Figure 4 shows that in all cases the estimated speedup matches the the actual speedup. Therefore, the lemma can be used to estimate the performance gain of using G GPUs and devise a cost-effective training plan including system configuration and parameter settings.
The overall speedup can be improved by reducing computation overheads. We conclude this subsection by providing two overhead reduction suggestions.
• Data transfer pipelining. Low throughput of feeding training data is a major bottleneck that degrades the multi-GPU training performance as the demand for bus bandwidth for loading data grows with the number of GPUs. 6 nvprof only profiles GPU activities, so the CPU activities cannot be analyzed. Pipelining data loading (I/O) with computation is the effective way to reduce the overhead brought by data preparation. The impact of disk I/O can be further alleviated by using better disk or reducing expensive file operations like seek. Modern frameworks such as TensorFlow and MXNet provide the way to rearrange training samples so that the data can be read in sequentially. The load for decoding and augmenting training data may cause extreme high CPU usage and drags the performance of data provision. The computation intensive jobs should be avoided on CPUs.
• Peer-to-peer parameter updates. Synchronizing parameter updates among GPUs, as indicated in step 6 in Figure 1 , is another common bottleneck in multi-GPU training environment. A naive implementation is to keep the latest model at main memory, transfer the latest copy to GPUs at the beginning of batch processing, and aggregate updates from all GPUs. It leads to bus contention and huge data load between main memory and GPUs under CUDA programming model. To alleviate the hot spot issue, the weight updates can be completed via GPU high-speed DMA if GPU supports peer-to-peer transfer.
If multiple GPUs with low computing overhead still cannot meet the desired performance, distributed training is the option you can consider. We'll discuss the topic in the next section.
Distributed Training
Distributed training has become increasingly important because of the growth of dataset size and model complexity. To effectively orchestrate multiple machines for a training task, the system must provide a way to manage the globally shared model parameters. The parameter server architecture, i.e., a cluster of machines to manage parameters, is widely-used to reduce I/O latency for handling parameter updates [35, 36] . As shown in Figure 1 , parameter servers maintain latest parameter values and serve all workers. The workers retrieve updated parameters from the cluster, complete computation, and then push updates back to the cluster of parameter servers.
Parameter updates can be performed either synchronously or asynchronously. Employing synchronous updates ensures consistency but suffers from the performance dragger issue. Updating parameters asynchronously gains training speed and may not significantly affect training accuracy according to prior studies [15] . When I/Os can be performed asynchronously, fetching and updating parameters can be hidden behind computation and hence computation overhead can be mitigated. We assume that an asynchronous update policy is employed.
Let N ps denote the number of parameter servers. How many parameter servers should be configured to hide the computation overhead? We select N ps when N ps + 1 can no longer speed up the training task. Before we prove our lemma that derives the most effective N ps , we enumerate two desired subgoals or conditions.
The first subgoal is that the computation duration of a worker should be longer than its communication time with the parameter cluster. In other words, the I/O time between a worker thread and its designated parameter servers is shorter than the computation time of that worker. This condition allows parameters being prefetched before a new round of computation commences. Therefore, the I/O overhead can be hidden behind computation. The second subgoal is to distribute parameter-update workload evenly among parameter servers. We assume a dynamic load-balancing policy (e.g., [7] ) can be employed to distribute parameter retrieval and update workload almost evenly among N ps servers. Lemma 3.2. Given a round of GPU computation time T C on a worker, number of workers N w , network bandwidth B ps , and parameter size S p , the minimum number of parameter servers N ps required to mask communication I/Os is
Proof. The total size of communication I/O load generated in a round of pull to and push from parameter servers is 2 × S p × N w . Given that the I/O bandwidth is N ps and the load evenly distributed among N ps servers, the communication time can be written as 2 × S p × N w /N ps × B ps . The ideal pipeline case [36] is when the I/O time can be hidden behind computation time. Therefore, the I/O time must be smaller than or equal to the computation time T C . (The parameter update time on a parameter server is ignored because that time is relative small comparing with network transmission time.) We can write the constraint to be
Isolating N ps on the left-hand side of the equation, we obtain
□ Lemma 3.2 suggests a back-of-the-envelop estimate on N ps given two ideal conditions. When the conditions do not hold, more parameter servers should be employed to be able to mask I/O overhead. Three measures are recommended:
(1) Increase T C . When workload cannot be evenly distributed, the computation time should be longer to mask most I/Os. Therefore, a good strategy is to maintain a large T C . In other words, having a larger mini-batch size when the memory capacity permits is helpful. Besides, a larger mini-batch leads to less number of parameter updates and improves overall performance. (2) Improve B ps . Increasing network bandwidth can reduce I/O time. Insufficient network bandwidth of the communication channel may throttle the training performance. Take AlexNet as an example, pushing parameter updates produces around 180MB network traffic, which exceeds the capacity of commonly used 1Gbit Ethernet. Thus, high speed networking is highly recommended when applying distributed training. (3) Balance workload. Prior works [7, 36] propose effective data placement methods to balance dynamic workload. Such load balancing schemes can avoid I/O bottlenecks, and lead to overall overhead reduction.
CONCLUDING REMARKS
In this work, we investigated typical deep learning frameworks running on representative deep learning models and datasets. From analyses, we studied the computation bottlenecks in single-GPU, multi-GPU and distributed configurations. Furthermore, we derived the back-of-the-envelope estimation for the GPU number to configure a training system, given a budget or deadline. Finally, for distributed training, we suggested a formula for estimating the number of parameter servers to be configured to reduce communication overhead. AlphaGo showed that more training data can only be helpful towards improving machine intelligence and competitiveness. Recently, Residual Neural Networks [24, 43] shows that in both theory and practice, more layers of neural networks correlates to a higher achieved accuracy by a trained classifier. At a 2016 machine learning workshop [39] , Andrew Ng presented that the traditional biases and variance tradeoff have not appeared in training largescale deep architectures. In other words, the larger the scale, the better suited the architecture is for improving the intelligence of a "machine".
This "larger the better" conclusion certainly demands that database and machine learning communities devise data management and data mining systems that can handle an ever increasing workload. We foresee that not only will algorithmic research continue flourishing, but system research and development will as well.
Already we have seen that GPU vendors are enhancing distributed GPU implementations. Advances in interconnected technology and implementation will help reduce both I/O overhead in data loading and in parameter updates.
In this work, we provided practical guidelines to facilitate practitioners the configuration of a system to speed up training performance. Our future work will focus on effectively managing such large-scale training systems to achieve both high accuracy and cost-effectiveness in three specific areas:
• Flexibility. Prior work [47] provided a flexibility to work with any compatible open-source frameworks. For example, we expect to simultaneously work with multiple frameworks such as MXNet and TensorFlow to complete a large-scale training task running on Azure, AWS, GCE, and other available commercial clouds.
• Scalability and elasticity. In addition to the parameter estimation performed in this work, we will research dynamic schemes to adjust allocation and scheduling parameters according to the dynamic workload nature of distributed systems.
• Ease of management. We plan to devise tools with the good user experience for monitoring and managing the training system.
A APPENDICES A.1 Proof of Lemma 3.1
According to Amdahl's law, given G GPUs and the fraction of the execution time of the task that can be parallelized P, the theoretical speedup is 1 (1−P )+ P G . The maximum speedup G can not be achieved if there are parts cannot be parallelized. Thus: αG = 1 (1 − P) + P G (9) P can be expressed as:
Substituting P into equation 9 yields:
Then:
By rearranging equation 12, α can be expressed in terms of G and R O as follows:
