41,615 research outputs found

    Parity Models: A General Framework for Coding-Based Resilience in ML Inference

    Full text link
    Machine learning models are becoming the primary workhorses for many applications. Production services deploy models through prediction serving systems that take in queries and return predictions by performing inference on machine learning models. In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets. Current approaches to reducing tail latency are inadequate for the latency targets of prediction serving, incur high resource overhead, or are inapplicable to the computations performed during inference. We present ParM, a novel, general framework for making use of ideas from erasure coding and machine learning to achieve low-latency, resource-efficient resilience to slowdowns and failures in prediction serving systems. ParM encodes multiple queries together into a single parity query and performs inference on the parity query using a parity model. A decoder uses the output of a parity model to reconstruct approximations of unavailable predictions. ParM uses neural networks to learn parity models that enable simple, fast encoders and decoders to reconstruct unavailable predictions for a variety of inference tasks such as image classification, speech recognition, and object localization. We build ParM atop an open-source prediction serving system and through extensive evaluation show that ParM improves overall accuracy in the face of unavailability with low latency while using 2-4×\times less additional resources than replication-based approaches. ParM reduces the gap between 99.9th percentile and median latency by up to 3.5×3.5\times compared to approaches that use an equal amount of resources, while maintaining the same median.Comment: This paper is superseded by the ACM SOSP 2019 paper "Parity Models: Erasure-Coded Resilience for Prediction Serving Systems

    Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

    Full text link
    In data-parallel synchronous training of deep neural networks, different devices (replicas) run the same program with different partitions of the training batch, but weight update computation is repeated on all replicas, because the weights do not have a batch dimension to partition. This can be a bottleneck for performance and scalability in typical language models with large weights, and models with small per-replica batch size which is typical in large-scale training. This paper presents an approach to automatically shard the weight update computation across replicas with efficient communication primitives and data formatting, using static analysis and transformations on the training computation graph. We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs, requiring no change to model code. This technique helps close the gap between traditionally expensive (ADAM) and cheap (SGD) optimizers, as they will only take a small part of training step time and have similar peak memory usage. It helped us to achieve state-of-the-art training performance in Google's MLPerf 0.6 submission.Comment: 12 pages, 23 figures, 1 tabl

    Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

    Full text link
    It is inevitable to train large deep learning models on a large-scale cluster equipped with accelerators system. Deep gradient compression would highly increase the bandwidth utilization and speed up the training process but hard to implement on ring structure. In this paper, we find that redundant gradient and gradient staleness has negative effect on training. We have observed that in different epoch and different steps, the neural networks focus on updating different layers and different parameters. In order to save more communication bandwidth and preserve the accuracy on ring structure, which break the restrict as the node increase, we propose a new algorithm to measure the importance of gradients on large-scale cluster implementing ring all-reduce based on the size of the ratio of parameter calculation gradient to parameter value. Our importance weighted pruning approach achieved 64X and 58.8X of gradient compression ratio on AlexNet and ResNet50 on ImageNet. Meanwhile, in order to maintain the sparseness of the gradient propagation, we randomly broadcast the index of important gradients on each node. While the remaining nodes are ready for the index gradient and perform all-reduce update. This would speed up the convergence of the model and preserve the training accuracy

    Slim-DP: A Light Communication Data Parallelism for DNN

    Full text link
    Data parallelism has emerged as a necessary technique to accelerate the training of deep neural networks (DNN). In a typical data parallelism approach, the local workers push the latest updates of all the parameters to the parameter server and pull all merged parameters back periodically. However, with the increasing size of DNN models and the large number of workers in practice, this typical data parallelism cannot achieve satisfactory training acceleration, since it usually suffers from the heavy communication cost due to transferring huge amount of information between workers and the parameter server. In-depth understanding on DNN has revealed that it is usually highly redundant, that deleting a considerable proportion of the parameters will not significantly decline the model performance. This redundancy property exposes a great opportunity to reduce the communication cost by only transferring the information of those significant parameters during the parallel training. However, if we only transfer information of temporally significant parameters of the latest snapshot, we may miss the parameters that are insignificant now but have potential to become significant as the training process goes on. To this end, we design an Explore-Exploit framework to dynamically choose the subset to be communicated, which is comprised of the significant parameters in the latest snapshot together with a random explored set of other parameters. We propose to measure the significance of the parameter by the combination of its magnitude and gradient. Our experimental results demonstrate that our proposed Slim-DP can achieve better training acceleration than standard data parallelism and its communication-efficient version by saving communication time without loss of accuracy

    Data Management in Industry 4.0: State of the Art and Open Challenges

    Full text link
    Information and communication technologies are permeating all aspects of industrial and manufacturing systems, expediting the generation of large volumes of industrial data. This article surveys the recent literature on data management as it applies to networked industrial environments and identifies several open research challenges for the future. As a first step, we extract important data properties (volume, variety, traffic, criticality) and identify the corresponding data enabling technologies of diverse fundamental industrial use cases, based on practical applications. Secondly, we provide a detailed outline of recent industrial architectural designs with respect to their data management philosophy (data presence, data coordination, data computation) and the extent of their distributiveness. Then, we conduct a holistic survey of the recent literature from which we derive a taxonomy of the latest advances on industrial data enabling technologies and data centric services, spanning all the way from the field level deep in the physical deployments, up to the cloud and applications level. Finally, motivated by the rich conclusions of this critical analysis, we identify interesting open challenges for future research. The concepts presented in this article thematically cover the largest part of the industrial automation pyramid layers. Our approach is multidisciplinary, as the selected publications were drawn from two fields; the communications, networking and computation field as well as the industrial, manufacturing and automation field. The article can help the readers to deeply understand how data management is currently applied in networked industrial environments, and select interesting open research opportunities to pursue

    Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing

    Full text link
    With the breakthroughs in deep learning, the recent years have witnessed a booming of artificial intelligence (AI) applications and services, spanning from personal assistant to recommendation systems to video/audio surveillance. More recently, with the proliferation of mobile computing and Internet-of-Things (IoT), billions of mobile and IoT devices are connected to the Internet, generating zillions Bytes of data at the network edge. Driving by this trend, there is an urgent need to push the AI frontiers to the network edge so as to fully unleash the potential of the edge big data. To meet this demand, edge computing, an emerging paradigm that pushes computing tasks and services from the network core to the network edge, has been widely recognized as a promising solution. The resulted new inter-discipline, edge AI or edge intelligence, is beginning to receive a tremendous amount of interest. However, research on edge intelligence is still in its infancy stage, and a dedicated venue for exchanging the recent advances of edge intelligence is highly desired by both the computer system and artificial intelligence communities. To this end, we conduct a comprehensive survey of the recent research efforts on edge intelligence. Specifically, we first review the background and motivation for artificial intelligence running at the network edge. We then provide an overview of the overarching architectures, frameworks and emerging key technologies for deep learning model towards training/inference at the network edge. Finally, we discuss future research opportunities on edge intelligence. We believe that this survey will elicit escalating attentions, stimulate fruitful discussions and inspire further research ideas on edge intelligence.Comment: Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang, "Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing," Proceedings of the IEE

    Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification

    Full text link
    MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming increasingly popular. Hosting pre-trained machine learning models in the cloud enables elastic scalability as the demand grows. But providing low latency and reducing the latency variance is a key requirement. Variance is harder to control in a cloud deployment due to uncertainties in resource allocations across many virtual instances. We propose the collage inference technique which uses a novel convolutional neural network model, collage-cnn, to provide low-cost redundancy. A collage-cnn model takes a collage image formed by combining multiple images and performs multi-image classification in one shot, albeit at slightly lower accuracy. We augment a collection of traditional single image classifier models with a single collage-cnn classifier which acts as their low-cost redundant backup. Collage-cnn provides backup classification results if any single image classification requests experience slowdown. Deploying the collage-cnn models in the cloud, we demonstrate that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication based approaches while providing high accuracy. Variation in inference latency can be reduced by 1.8x to 15x.Comment: 10 pages, Under submissio

    Slack Squeeze Coded Computing for Adaptive Straggler Mitigation

    Full text link
    While performing distributed computations in today's cloud-based platforms, execution speed variations among compute nodes can significantly reduce the performance and create bottlenecks like stragglers. Coded computation techniques leverage coding theory to inject computational redundancy and mitigate stragglers in distributed computations. In this paper, we propose a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S2C2S^2C^2). S2C2S^2C^2 squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data. We implement an LSTM-based speed prediction algorithm to predict speeds of compute nodes. We evaluate S2C2S^2C^2 on linear algebraic algorithms, gradient descent, graph ranking, and graph filtering algorithms. We demonstrate 19% to 39% reduction in total computation latency using S2C2S^2C^2 compared to job replication and coded computation. We further show how S2C2S^2C^2 can be applied beyond matrix-vector multiplication.Comment: 13 pages, SC 201

    Private Machine Learning in TensorFlow using Secure Computation

    Full text link
    We present a framework for experimenting with secure multi-party computation directly in TensorFlow. By doing so we benefit from several properties valuable to both researchers and practitioners, including tight integration with ordinary machine learning processes, existing optimizations for distributed computation in TensorFlow, high-level abstractions for expressing complex algorithms and protocols, and an expanded set of familiar tooling. We give an open source implementation of a state-of-the-art protocol and report on concrete benchmarks using typical models from private machine learning

    Speeding Up Distributed Machine Learning Using Codes

    Full text link
    Codes are widely used in many engineering applications to offer robustness against noise. In large-scale systems there are several types of noise that can affect the performance of distributed machine learning algorithms -- straggler nodes, system failures, or communication bottlenecks -- but there has been little interaction cutting across codes, machine learning, and distributed systems. In this work, we provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling. For matrix multiplication, we use codes to alleviate the effect of stragglers, and show that if the number of homogeneous workers is nn, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of logn\log n. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction α\alpha of the data matrix can be cached at each worker, and nn is the number of workers, \emph{coded shuffling} reduces the communication cost by a factor of (α+1n)γ(n)(\alpha + \frac{1}{n})\gamma(n) compared to uncoded shuffling, where γ(n)\gamma(n) is the ratio of the cost of unicasting nn messages to nn users to multicasting a common message (of the same size) to nn users. For instance, γ(n)n\gamma(n) \simeq n if multicasting a message to nn users is as cheap as unicasting a message to one user. We also provide experiment results, corroborating our theoretical gains of the coded algorithms.Comment: This work is published in IEEE Transactions on Information Theory and presented in part at the NIPS 2015 Workshop on Machine Learning Systems and the IEEE ISIT 201
    corecore