33 research outputs found
Fast and Robust Distributed Learning in High Dimension
Could a gradient aggregation rule (GAR) for distributed machine learning be
both robust and fast? This paper answers by the affirmative through
multi-Bulyan. Given workers, of which are arbitrary malicious
(Byzantine) and are not, we prove that multi-Bulyan can ensure a strong
form of Byzantine resilience, as well as an slowdown, compared
to averaging, the fastest (but non Byzantine resilient) rule for distributed
machine learning. When (almost all workers are correct),
multi-Bulyan reaches the speed of averaging. We also prove that multi-Bulyan's
cost in local computation is (like averaging), an important feature for
ML where commonly reaches , while robust alternatives have at least
quadratic cost in .
Our theoretical findings are complemented with an experimental evaluation
which, in addition to supporting the linear complexity argument, conveys
the fact that multi-Bulyan's parallelisability further adds to its efficiency.Comment: preliminary theoretical draft, complements the SysML 2019 practical
paper of which the code is provided at
https://github.com/LPD-EPFL/AggregaThor. arXiv admin note: text overlap with
arXiv:1703.0275
On The Robustness of a Neural Network
With the development of neural networks based machine learning and their
usage in mission critical applications, voices are rising against the
\textit{black box} aspect of neural networks as it becomes crucial to
understand their limits and capabilities. With the rise of neuromorphic
hardware, it is even more critical to understand how a neural network, as a
distributed system, tolerates the failures of its computing nodes, neurons, and
its communication channels, synapses. Experimentally assessing the robustness
of neural networks involves the quixotic venture of testing all the possible
failures, on all the possible inputs, which ultimately hits a combinatorial
explosion for the first, and the impossibility to gather all the possible
inputs for the second.
In this paper, we prove an upper bound on the expected error of the output
when a subset of neurons crashes. This bound involves dependencies on the
network parameters that can be seen as being too pessimistic in the average
case. It involves a polynomial dependency on the Lipschitz coefficient of the
neurons activation function, and an exponential dependency on the depth of the
layer where a failure occurs. We back up our theoretical results with
experiments illustrating the extent to which our prediction matches the
dependencies between the network parameters and robustness. Our results show
that the robustness of neural networks to the average crash can be estimated
without the need to neither test the network on all failure configurations, nor
access the training set used to train the network, both of which are
practically impossible requirements.Comment: 36th IEEE International Symposium on Reliable Distributed Systems 26
- 29 September 2017. Hong Kong, Chin
Fast Machine Learning with Byzantine Workers and Servers
Machine Learning (ML) solutions are nowadays distributed and are prone to
various types of component failures, which can be encompassed in so-called
Byzantine behavior. This paper introduces LiuBei, a Byzantine-resilient ML
algorithm that does not trust any individual component in the network (neither
workers nor servers), nor does it induce additional communication rounds (on
average), compared to standard non-Byzantine resilient algorithms. LiuBei
builds upon gradient aggregation rules (GARs) to tolerate a minority of
Byzantine workers. Besides, LiuBei replicates the parameter server on multiple
machines instead of trusting it. We introduce a novel filtering mechanism that
enables workers to filter out replies from Byzantine server replicas without
requiring communication with all servers. Such a filtering mechanism is based
on network synchrony, Lipschitz continuity of the loss function, and the GAR
used to aggregate workers' gradients. We also introduce a protocol,
scatter/gather, to bound drifts between models on correct servers with a small
number of communication messages. We theoretically prove that LiuBei achieves
Byzantine resilience to both servers and workers and guarantees convergence. We
build LiuBei using TensorFlow, and we show that LiuBei tolerates Byzantine
behavior with an accuracy loss of around 5% and around 24% convergence overhead
compared to vanilla TensorFlow. We moreover show that the throughput gain of
LiuBei compared to another state-of-the-art Byzantine-resilient ML algorithm
(that assumes network asynchrony) is 70%.Comment: This paper has been merged with arXiv:1905.03853, which has been
accepted to appear in the ACM Symposium on Principles of Distributed
Computing (PODC) 202
Byzantine-Tolerant Machine Learning
The growth of data, the need for scalability and the complexity of models
used in modern machine learning calls for distributed implementations. Yet, as
of today, distributed machine learning frameworks have largely ignored the
possibility of arbitrary (i.e., Byzantine) failures. In this paper, we study
the robustness to Byzantine failures at the fundamental level of stochastic
gradient descent (SGD), the heart of most machine learning algorithms. Assuming
a set of workers, up to of them being Byzantine, we ask how robust can
SGD be, without limiting the dimension, nor the size of the parameter space.
We first show that no gradient descent update rule based on a linear
combination of the vectors proposed by the workers (i.e, current approaches)
tolerates a single Byzantine failure. We then formulate a resilience property
of the update rule capturing the basic requirements to guarantee convergence
despite Byzantine workers. We finally propose Krum, an update rule that
satisfies the resilience property aforementioned. For a -dimensional
learning problem, the time complexity of Krum is
Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning
In reinforcement learning, agents learn by performing actions and observing
their outcomes. Sometimes, it is desirable for a human operator to
\textit{interrupt} an agent in order to prevent dangerous situations from
happening. Yet, as part of their learning process, agents may link these
interruptions, that impact their reward, to specific states and deliberately
avoid them. The situation is particularly challenging in a multi-agent context
because agents might not only learn from their own past interruptions, but also
from those of other agents. Orseau and Armstrong defined \emph{safe
interruptibility} for one learner, but their work does not naturally extend to
multi-agent systems. This paper introduces \textit{dynamic safe
interruptibility}, an alternative definition more suited to decentralized
learning problems, and studies this notion in two learning frameworks:
\textit{joint action learners} and \textit{independent learners}. We give
realistic sufficient conditions on the learning algorithm to enable dynamic
safe interruptibility in the case of joint action learners, yet show that these
conditions are not sufficient for independent learners. We show however that if
agents can detect interruptions, it is possible to prune the observations to
ensure dynamic safe interruptibility even for independent learners
The Hidden Vulnerability of Distributed Learning in Byzantium
While machine learning is going through an era of celebrated success,
concerns have been raised about the vulnerability of its backbone: stochastic
gradient descent (SGD). Recent approaches have been proposed to ensure the
robustness of distributed SGD against adversarial (Byzantine) workers sending
poisoned gradients during the training phase. Some of these approaches have
been proven Byzantine-resilient: they ensure the convergence of SGD despite the
presence of a minority of adversarial workers.
We show in this paper that convergence is not enough. In high dimension , an adver\-sary can build on the loss function's non-convexity to make
SGD converge to ineffective models. More precisely, we bring to light that
existing Byzantine-resilient schemes leave a margin of poisoning of
, where increases at least like .
Based on this leeway, we build a simple attack, and experimentally show its
strong to utmost effectivity on CIFAR-10 and MNIST.
We introduce Bulyan, and prove it significantly reduces the attackers leeway
to a narrow bound. We empirically show that Bulyan
does not suffer the fragility of existing aggregation rules and, at a
reasonable cost in terms of required batch size, achieves convergence as if
only non-Byzantine gradients had been used to update the model.Comment: Accepted to ICML 2018 as a long tal
Distributed Momentum for Byzantine-resilient Learning
Momentum is a variant of gradient descent that has been proposed for its
benefits on convergence. In a distributed setting, momentum can be implemented
either at the server or the worker side. When the aggregation rule used by the
server is linear, commutativity with addition makes both deployments
equivalent. Robustness and privacy are however among motivations to abandon
linear aggregation rules. In this work, we demonstrate the benefits on
robustness of using momentum at the worker side. We first prove that computing
momentum at the workers reduces the variance-norm ratio of the gradient
estimation at the server, strengthening Byzantine resilient aggregation rules.
We then provide an extensive experimental demonstration of the robustness
effect of worker-side momentum on distributed SGD.Comment: Source code (for academic use only):
https://github.com/LPD-EPFL/ByzantineMomentu
Asynchronous Byzantine Machine Learning (the case of SGD)
Asynchronous distributed machine learning solutions have proven very
effective so far, but always assuming perfectly functioning workers. In
practice, some of the workers can however exhibit Byzantine behavior, caused by
hardware failures, software bugs, corrupt data, or even malicious attacks. We
introduce \emph{Kardam}, the first distributed asynchronous stochastic gradient
descent (SGD) algorithm that copes with Byzantine workers. Kardam consists of
two complementary components: a filtering and a dampening component. The first
is scalar-based and ensures resilience against Byzantine workers.
Essentially, this filter leverages the Lipschitzness of cost functions and acts
as a self-stabilizer against Byzantine workers that would attempt to corrupt
the progress of SGD. The dampening component bounds the convergence rate by
adjusting to stale information through a generic gradient weighting scheme. We
prove that Kardam guarantees almost sure convergence in the presence of
asynchrony and Byzantine behavior, and we derive its convergence rate. We
evaluate Kardam on the CIFAR-100 and EMNIST datasets and measure its overhead
with respect to non Byzantine-resilient solutions. We empirically show that
Kardam does not introduce additional noise to the learning procedure but does
induce a slowdown (the cost of Byzantine resilience) that we both theoretically
and empirically show to be less than , where is the number of
Byzantine failures tolerated and the total number of workers.
Interestingly, we also empirically observe that the dampening component is
interesting in its own right for it enables to build an SGD algorithm that
outperforms alternative staleness-aware asynchronous competitors in
environments with honest workers.Comment: accepted to ICML 201
Host-Pathongen Co-evolution Inspired Algorithm Enables Robust GAN Training
Generative adversarial networks (GANs) are pairs of artificial neural
networks that are trained one against each other. The outputs from a generator
are mixed with the real-world inputs to the discriminator and both networks are
trained until an equilibrium is reached, where the discriminator cannot
distinguish generated inputs from real ones. Since their introduction, GANs
have allowed for the generation of impressive imitations of real-life films,
images and texts, whose fakeness is barely noticeable to humans. Despite their
impressive performance, training GANs remains to this day more of an art than a
reliable procedure, in a large part due to training process stability.
Generators are susceptible to mode dropping and convergence to random patterns,
which have to be mitigated by computationally expensive multiple restarts.
Curiously, GANs bear an uncanny similarity to a co-evolution of a pathogen and
its host's immune system in biology. In a biological context, the majority of
potential pathogens indeed never make it and are kept at bay by the hots'
immune system. Yet some are efficient enough to present a risk of a serious
condition and recurrent infections. Here, we explore that similarity to propose
a more robust algorithm for GANs training. We empirically show the increased
stability and a better ability to generate high-quality images while using less
computational power.Comment: 8 pages, 10 figure
AKSEL: Fast Byzantine SGD
Modern machine learning architectures distinguish servers and workers. Typically, a d-dimensional model is hosted by a server and trained by n workers, using a distributed stochastic gradient descent (SGD) optimization scheme. At each SGD step, the goal is to estimate the gradient of a cost function. The simplest way to do this is to average the gradients estimated by the workers. However, averaging is not resilient to even one single Byzantine failure of a worker. Many alternative gradient aggregation rules (GARs) have recently been proposed to tolerate a maximum number f of Byzantine workers. These GARs differ according to (1) the complexity of their computation time, (2) the maximal number of Byzantine workers despite which convergence can still be ensured (breakdown point), and (3) their accuracy, which can be captured by (3.1) their angular error, namely the angle with the true gradient, as well as (3.2) their ability to aggregate full gradients. In particular, many are not full gradients for they operate on each dimension separately, which results in a coordinate-wise blended gradient, leading to low accuracy in practical situations where the number (s) of workers that are actually Byzantine in an execution is small (s < < f).
We propose Aksel, a new scalable median-based GAR with optimal time complexity (?(nd)), optimal breakdown point (n > 2f) and the lowest upper bound on the expected angular error (?(?d)) among full gradient approaches. We also study the actual angular error of Aksel when the gradient distribution is normal and show that it only grows in ?(?dlog{n}), which is the first logarithmic upper bound ever proven on the number of workers n assuming an optimal breakdown point. We also report on an empirical evaluation of Aksel on various classification tasks, which we compare to alternative GARs against state-of-the-art attacks. Aksel is the only GAR reaching top accuracy when there is actually none or few Byzantine workers while maintaining a good defense even under the extreme case (s = f). For simplicity of presentation, we consider a scheme with a single server. However, as we explain in the paper, Aksel can also easily be adapted to multi-server architectures that tolerate the Byzantine behavior of a fraction of the servers