2 research outputs found

    Improving the performance of dataflow systems for deep neural network training

    No full text
    Deep neural networks (DNNs) have led to significant advancements in machine learning. With deep structure and flexible model parameterisation, they exhibit state-of-the-art accuracies for many complex tasks e.g. image recognition. To achieve this, models are trained iteratively over large datasets. This process involves expensive matrix operations, making it time-consuming to obtain converged models. To accelerate training, dataflow systems parallelise computation. A scalable approach is to use parameter server framework: it has workers that train model replicas in parallel and parameter servers that synchronise the replicas to ensure the convergence. With distributed DNN systems, there are three challenges that determine the training completion time. In this thesis, we propose practical and effective techniques to address each of these challenges. Since frequent model synchronisation results in high network utilisation, the parameter server approach can suffer from network bottlenecks, thus requiring decisions on resource allocation. Our idea is to use all available network bandwidth and synchronise subject to the available bandwidth. We present Ako, a DNN system that uses partial gradient exchange for synchronising replicas in a peer-to-peer fashion. We show that our technique exhibits a 25% lower convergence time than a hand-tuned parameter-server deployments. For a long training, the compute efficiency of worker nodes is important. We argue that processing hardware should be fully utilised for the best speed-up. The key observation is it is possible to overlap the execution of several matrix operations with other workloads. We describe Crossbow, a GPU-based system that maximises hardware utilisation. By using a multi-streaming scheduler, multiple models are trained in parallel on GPU and achieve a 2.3x speed-up compared to a state-of-the-art system. The choice of model configuration for replicas also directly determines convergence quality. Dataflow systems are used for exploring the promising configurations but provide little support for efficient exploratory workflows. We present Meta-dataflow (MDF), a dataflow model that expresses complex workflows. By taking into account all configurations as a unified workflow, MDFs efficiently reduce time spent on configuration exploration.Open Acces

    Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

    Full text link
    Distributed deep learning becomes very common to reduce the overall training time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of deep models and data sets increases. However, data communication between computing devices could be a potential bottleneck to limit the system scalability. How to address the communication problem in distributed deep learning is becoming a hot research topic recently. In this paper, we provide a comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations. In the system-level, we demystify the system design and implementation to reduce the communication cost. In algorithmic-level, we compare different algorithms with theoretical convergence bounds and communication complexity. Specifically, we first propose the taxonomy of data-parallel distributed training algorithms, which contains four main dimensions: communication synchronization, system architectures, compression techniques, and parallelism of communication and computing. Then we discuss the studies in addressing the problems of the four dimensions to compare the communication cost. We further compare the convergence rates of different algorithms, which enable us to know how fast the algorithms can converge to the solution in terms of iterations. According to the system-level communication cost analysis and theoretical convergence speed comparison, we provide the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations
    corecore