41,615 research outputs found
Parity Models: A General Framework for Coding-Based Resilience in ML Inference
Machine learning models are becoming the primary workhorses for many
applications. Production services deploy models through prediction serving
systems that take in queries and return predictions by performing inference on
machine learning models. In order to scale to high query rates, prediction
serving systems are run on many machines in cluster settings, and thus are
prone to slowdowns and failures that inflate tail latency and cause violations
of strict latency targets. Current approaches to reducing tail latency are
inadequate for the latency targets of prediction serving, incur high resource
overhead, or are inapplicable to the computations performed during inference.
We present ParM, a novel, general framework for making use of ideas from
erasure coding and machine learning to achieve low-latency, resource-efficient
resilience to slowdowns and failures in prediction serving systems. ParM
encodes multiple queries together into a single parity query and performs
inference on the parity query using a parity model. A decoder uses the output
of a parity model to reconstruct approximations of unavailable predictions.
ParM uses neural networks to learn parity models that enable simple, fast
encoders and decoders to reconstruct unavailable predictions for a variety of
inference tasks such as image classification, speech recognition, and object
localization. We build ParM atop an open-source prediction serving system and
through extensive evaluation show that ParM improves overall accuracy in the
face of unavailability with low latency while using 2-4 less additional
resources than replication-based approaches. ParM reduces the gap between
99.9th percentile and median latency by up to compared to
approaches that use an equal amount of resources, while maintaining the same
median.Comment: This paper is superseded by the ACM SOSP 2019 paper "Parity Models:
Erasure-Coded Resilience for Prediction Serving Systems
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
In data-parallel synchronous training of deep neural networks, different
devices (replicas) run the same program with different partitions of the
training batch, but weight update computation is repeated on all replicas,
because the weights do not have a batch dimension to partition. This can be a
bottleneck for performance and scalability in typical language models with
large weights, and models with small per-replica batch size which is typical in
large-scale training. This paper presents an approach to automatically shard
the weight update computation across replicas with efficient communication
primitives and data formatting, using static analysis and transformations on
the training computation graph. We show this technique achieves substantial
speedups on typical image and language models on Cloud TPUs, requiring no
change to model code. This technique helps close the gap between traditionally
expensive (ADAM) and cheap (SGD) optimizers, as they will only take a small
part of training step time and have similar peak memory usage. It helped us to
achieve state-of-the-art training performance in Google's MLPerf 0.6
submission.Comment: 12 pages, 23 figures, 1 tabl
Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce
It is inevitable to train large deep learning models on a large-scale cluster
equipped with accelerators system. Deep gradient compression would highly
increase the bandwidth utilization and speed up the training process but hard
to implement on ring structure. In this paper, we find that redundant gradient
and gradient staleness has negative effect on training. We have observed that
in different epoch and different steps, the neural networks focus on updating
different layers and different parameters. In order to save more communication
bandwidth and preserve the accuracy on ring structure, which break the restrict
as the node increase, we propose a new algorithm to measure the importance of
gradients on large-scale cluster implementing ring all-reduce based on the size
of the ratio of parameter calculation gradient to parameter value. Our
importance weighted pruning approach achieved 64X and 58.8X of gradient
compression ratio on AlexNet and ResNet50 on ImageNet. Meanwhile, in order to
maintain the sparseness of the gradient propagation, we randomly broadcast the
index of important gradients on each node. While the remaining nodes are ready
for the index gradient and perform all-reduce update. This would speed up the
convergence of the model and preserve the training accuracy
Slim-DP: A Light Communication Data Parallelism for DNN
Data parallelism has emerged as a necessary technique to accelerate the
training of deep neural networks (DNN). In a typical data parallelism approach,
the local workers push the latest updates of all the parameters to the
parameter server and pull all merged parameters back periodically. However,
with the increasing size of DNN models and the large number of workers in
practice, this typical data parallelism cannot achieve satisfactory training
acceleration, since it usually suffers from the heavy communication cost due to
transferring huge amount of information between workers and the parameter
server. In-depth understanding on DNN has revealed that it is usually highly
redundant, that deleting a considerable proportion of the parameters will not
significantly decline the model performance. This redundancy property exposes a
great opportunity to reduce the communication cost by only transferring the
information of those significant parameters during the parallel training.
However, if we only transfer information of temporally significant parameters
of the latest snapshot, we may miss the parameters that are insignificant now
but have potential to become significant as the training process goes on. To
this end, we design an Explore-Exploit framework to dynamically choose the
subset to be communicated, which is comprised of the significant parameters in
the latest snapshot together with a random explored set of other parameters. We
propose to measure the significance of the parameter by the combination of its
magnitude and gradient. Our experimental results demonstrate that our proposed
Slim-DP can achieve better training acceleration than standard data parallelism
and its communication-efficient version by saving communication time without
loss of accuracy
Data Management in Industry 4.0: State of the Art and Open Challenges
Information and communication technologies are permeating all aspects of
industrial and manufacturing systems, expediting the generation of large
volumes of industrial data. This article surveys the recent literature on data
management as it applies to networked industrial environments and identifies
several open research challenges for the future. As a first step, we extract
important data properties (volume, variety, traffic, criticality) and identify
the corresponding data enabling technologies of diverse fundamental industrial
use cases, based on practical applications. Secondly, we provide a detailed
outline of recent industrial architectural designs with respect to their data
management philosophy (data presence, data coordination, data computation) and
the extent of their distributiveness. Then, we conduct a holistic survey of the
recent literature from which we derive a taxonomy of the latest advances on
industrial data enabling technologies and data centric services, spanning all
the way from the field level deep in the physical deployments, up to the cloud
and applications level. Finally, motivated by the rich conclusions of this
critical analysis, we identify interesting open challenges for future research.
The concepts presented in this article thematically cover the largest part of
the industrial automation pyramid layers. Our approach is multidisciplinary, as
the selected publications were drawn from two fields; the communications,
networking and computation field as well as the industrial, manufacturing and
automation field. The article can help the readers to deeply understand how
data management is currently applied in networked industrial environments, and
select interesting open research opportunities to pursue
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
With the breakthroughs in deep learning, the recent years have witnessed a
booming of artificial intelligence (AI) applications and services, spanning
from personal assistant to recommendation systems to video/audio surveillance.
More recently, with the proliferation of mobile computing and
Internet-of-Things (IoT), billions of mobile and IoT devices are connected to
the Internet, generating zillions Bytes of data at the network edge. Driving by
this trend, there is an urgent need to push the AI frontiers to the network
edge so as to fully unleash the potential of the edge big data. To meet this
demand, edge computing, an emerging paradigm that pushes computing tasks and
services from the network core to the network edge, has been widely recognized
as a promising solution. The resulted new inter-discipline, edge AI or edge
intelligence, is beginning to receive a tremendous amount of interest. However,
research on edge intelligence is still in its infancy stage, and a dedicated
venue for exchanging the recent advances of edge intelligence is highly desired
by both the computer system and artificial intelligence communities. To this
end, we conduct a comprehensive survey of the recent research efforts on edge
intelligence. Specifically, we first review the background and motivation for
artificial intelligence running at the network edge. We then provide an
overview of the overarching architectures, frameworks and emerging key
technologies for deep learning model towards training/inference at the network
edge. Finally, we discuss future research opportunities on edge intelligence.
We believe that this survey will elicit escalating attentions, stimulate
fruitful discussions and inspire further research ideas on edge intelligence.Comment: Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang,
"Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge
Computing," Proceedings of the IEE
Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification
MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming
increasingly popular. Hosting pre-trained machine learning models in the cloud
enables elastic scalability as the demand grows. But providing low latency and
reducing the latency variance is a key requirement. Variance is harder to
control in a cloud deployment due to uncertainties in resource allocations
across many virtual instances. We propose the collage inference technique which
uses a novel convolutional neural network model, collage-cnn, to provide
low-cost redundancy. A collage-cnn model takes a collage image formed by
combining multiple images and performs multi-image classification in one shot,
albeit at slightly lower accuracy. We augment a collection of traditional
single image classifier models with a single collage-cnn classifier which acts
as their low-cost redundant backup. Collage-cnn provides backup classification
results if any single image classification requests experience slowdown.
Deploying the collage-cnn models in the cloud, we demonstrate that the 99th
percentile tail latency of inference can be reduced by 1.2x to 2x compared to
replication based approaches while providing high accuracy. Variation in
inference latency can be reduced by 1.8x to 15x.Comment: 10 pages, Under submissio
Slack Squeeze Coded Computing for Adaptive Straggler Mitigation
While performing distributed computations in today's cloud-based platforms,
execution speed variations among compute nodes can significantly reduce the
performance and create bottlenecks like stragglers. Coded computation
techniques leverage coding theory to inject computational redundancy and
mitigate stragglers in distributed computations. In this paper, we propose a
dynamic workload distribution strategy for coded computation called Slack
Squeeze Coded Computation (). squeezes the compute slack
(i.e., overhead) that is built into the coded computing frameworks by
efficiently assigning work for all fast and slow nodes according to their
speeds and without needing to re-distribute data. We implement an LSTM-based
speed prediction algorithm to predict speeds of compute nodes. We evaluate
on linear algebraic algorithms, gradient descent, graph ranking, and
graph filtering algorithms. We demonstrate 19% to 39% reduction in total
computation latency using compared to job replication and coded
computation. We further show how can be applied beyond matrix-vector
multiplication.Comment: 13 pages, SC 201
Private Machine Learning in TensorFlow using Secure Computation
We present a framework for experimenting with secure multi-party computation
directly in TensorFlow. By doing so we benefit from several properties valuable
to both researchers and practitioners, including tight integration with
ordinary machine learning processes, existing optimizations for distributed
computation in TensorFlow, high-level abstractions for expressing complex
algorithms and protocols, and an expanded set of familiar tooling. We give an
open source implementation of a state-of-the-art protocol and report on
concrete benchmarks using typical models from private machine learning
Speeding Up Distributed Machine Learning Using Codes
Codes are widely used in many engineering applications to offer robustness
against noise. In large-scale systems there are several types of noise that can
affect the performance of distributed machine learning algorithms -- straggler
nodes, system failures, or communication bottlenecks -- but there has been
little interaction cutting across codes, machine learning, and distributed
systems. In this work, we provide theoretical insights on how coded solutions
can achieve significant gains compared to uncoded ones. We focus on two of the
most basic building blocks of distributed learning algorithms: matrix
multiplication and data shuffling. For matrix multiplication, we use codes to
alleviate the effect of stragglers, and show that if the number of homogeneous
workers is , and the runtime of each subtask has an exponential tail, coded
computation can speed up distributed matrix multiplication by a factor of . For data shuffling, we use codes to reduce communication bottlenecks,
exploiting the excess in storage. We show that when a constant fraction
of the data matrix can be cached at each worker, and is the number
of workers, \emph{coded shuffling} reduces the communication cost by a factor
of compared to uncoded shuffling, where
is the ratio of the cost of unicasting messages to users to
multicasting a common message (of the same size) to users. For instance,
if multicasting a message to users is as cheap as
unicasting a message to one user. We also provide experiment results,
corroborating our theoretical gains of the coded algorithms.Comment: This work is published in IEEE Transactions on Information Theory and
presented in part at the NIPS 2015 Workshop on Machine Learning Systems and
the IEEE ISIT 201
- …