693 research outputs found
On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective
We consider training over-parameterized two-layer neural networks with
Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a
recent line of work, we study the evolutions of network prediction errors
across GD iterations, which can be neatly described in a matrix form. When the
network is sufficiently over-parameterized, these matrices individually
approximate {\em an} integral operator which is determined by the feature
vector distribution only. Consequently, GD method can be viewed as {\em
approximately} applying the powers of this integral operator on the
underlying/target function that generates the responses/labels.
We show that if admits a low-rank approximation with respect to the
eigenspaces of this integral operator, then the empirical risk decreases to
this low-rank approximation error at a linear rate which is determined by
and only, i.e., the rate is independent of the sample size .
Furthermore, if has zero low-rank approximation error, then, as long as
the width of the neural network is , the empirical risk
decreases to . To the best of our knowledge, this is the
first result showing the sufficiency of nearly-linear network
over-parameterization. We provide an application of our general results to the
setting where is the uniform distribution on the spheres and is a
polynomial. Throughout this paper, we consider the scenario where the input
dimension is fixed
Finite-time Guarantees for Byzantine-Resilient Distributed State Estimation with Noisy Measurements
This work considers resilient, cooperative state estimation in unreliable
multi-agent networks. A network of agents aims to collaboratively estimate the
value of an unknown vector parameter, while an {\em unknown} subset of agents
suffer Byzantine faults. Faulty agents malfunction arbitrarily and may send out
{\em highly unstructured} messages to other agents in the network. As opposed
to fault-free networks, reaching agreement in the presence of Byzantine faults
is far from trivial. In this paper, we propose a computationally-efficient
algorithm that is provably robust to Byzantine faults. At each iteration of the
algorithm, a good agent (1) performs a gradient descent update based on noisy
local measurements, (2) exchanges its update with other agents in its
neighborhood, and (3) robustly aggregates the received messages using
coordinate-wise trimmed means. Under mild technical assumptions, we establish
that good agents learn the true parameter asymptotically in almost sure sense.
We further complement our analysis by proving (high probability) {\em
finite-time} convergence rate, encapsulating network characteristics
Securing Distributed Gradient Descent in High Dimensional Statistical Learning
We consider unreliable distributed learning systems wherein the training data
is kept confidential by external workers, and the learner has to interact
closely with those workers to train a model. In particular, we assume that
there exists a system adversary that can adaptively compromise some workers;
the compromised workers deviate from their local designed specifications by
sending out arbitrarily malicious messages.
We assume in each communication round, up to out of the workers
suffer Byzantine faults. Each worker keeps a local sample of size and the
total sample size is . We propose a secured variant of the gradient
descent method that can tolerate up to a constant fraction of Byzantine
workers, i.e., . Moreover, we show the statistical estimation error
of the iterates converges in rounds to , where is the model dimension. As long as , our
proposed algorithm achieves the optimal error rate . Our results
are obtained under some technical assumptions. Specifically, we assume
strongly-convex population risk. Nevertheless, the empirical risk (sample
version) is allowed to be non-convex. The core of our method is to robustly
aggregate the gradients computed by the workers based on the filtering
procedure proposed by Steinhardt et al. On the technical front, deviating from
the existing literature on robustly estimating a finite-dimensional mean
vector, we establish a {\em uniform} concentration of the sample covariance
matrix of gradients, and show that the aggregated gradient, as a function of
model parameter, converges uniformly to the true gradient function. To get a
near-optimal uniform concentration bound, we develop a new matrix concentration
inequality, which might be of independent interest
Byzantine Multi-Agent Optimization: Part I
We study Byzantine fault-tolerant distributed optimization of a sum of convex
(cost) functions with real-valued scalar input/ouput. In particular, the goal
is to optimize a global cost function , where is the set of non-faulty agents, and
is agent 's local cost function, which is initially known only to
agent . In general, when some of the agents may be Byzantine faulty, the
above goal is unachievable, because the identity of the faulty agents is not
necessarily known to the non-faulty agents, and the faulty agents may behave
arbitrarily. Since the above global cost function cannot be optimized exactly
in presence of Byzantine agents, we define a weaker version of the problem.
The goal for the weaker problem is to generate an output that is an optimum
of a function formed as a convex combination of local cost functions of the
non-faulty agents. More precisely, for some choice of weights for
such that and , the output must be an optimum of the cost function
. Ideally, we would like
for all -- however, this
cannot be guaranteed due to the presence of faulty agents. In fact, we show
that the maximum achievable number of nonzero weights ('s) is
, where is the upper bound on the number of Byzantine
agents. In addition, we present algorithms that ensure that at least
agents have weights that are bounded away from 0. We also
propose a low-complexity suboptimal algorithm, which ensures that at least
agents have weights that are bounded away from
0, where is the total number of agents, and () is the
actual number of Byzantine agents
Fault-Tolerant Multi-Agent Optimization: Part III
We study fault-tolerant distributed optimization of a sum of convex (cost)
functions with real-valued scalar input/output in the presence of crash faults
or Byzantine faults. In particular, the goal is to optimize a global cost
function , where is the collection of agents, and is agent 's local
cost function, which is initially known only to agent . Since the above
global cost function cannot be optimized exactly in presence of crash faults or
Byzantine faults, we define two weaker versions of the problem for crash faults
and Byzantine faults, respectively.
When some agents may crash, the goal for the weaker problem is to generate an
output that is an optimum of a function formed as where is the
set of non-faulty agents, is the set of faulty agents (crashed
agents), for each and is a
normalization constant such that . We present an iterative algorithm in which each agent only needs
to perform local computation, and send one message per iteration.
When some agents may be Byzantine, the system cannot take full advantage of
the data kept by non-faulty agents. The goal for the associated weaker problem
is to generate an output that is an optimum of a function formed as
such that for each
and . We present an
iterative algorithm, where only local computation is needed and only one
message per agent is sent in each iteration, that ensures that at least
agents have weights ('s) that are lower bounded by
Defending Non-Bayesian Learning against Adversarial Attacks
This paper addresses the problem of non-Bayesian learning over multi-agent
networks, where agents repeatedly collect partially informative observations
about an unknown state of the world, and try to collaboratively learn the true
state. We focus on the impact of the adversarial agents on the performance of
consensus-based non-Bayesian learning, where non-faulty agents combine local
learning updates with consensus primitives. In particular, we consider the
scenario where an unknown subset of agents suffer Byzantine faults -- agents
suffering Byzantine faults behave arbitrarily. Two different learning rules are
proposed
Collaboratively Learning the Best Option, Using Bounded Memory
We consider multi-armed bandit problems in social groups wherein each
individual has bounded memory and shares the common goal of learning the best
arm/option. We say an individual learns the best option if eventually (as ) it pulls only the arm with the highest average reward. While this
goal is provably impossible for an isolated individual, we show that, in social
groups, this goal can be achieved easily with the aid of social persuasion,
i.e., communication. Specifically, we study the learning dynamics wherein an
individual sequentially decides on which arm to pull next based on not only its
private reward feedback but also the suggestions provided by randomly chosen
peers. Our learning dynamics are hard to analyze via explicit probabilistic
calculations due to the stochastic dependency induced by social interaction.
Instead, we employ the mean-field approximation method from statistical physics
and we show:
(1) With probability as the social group size , every
individual in the social group learns the best option.
(2) Over an arbitrary finite time horizon , with high probability (in
), the fraction of individuals that prefer the best option grows to 1
exponentially fast as increases ().
A major innovation of our mean-filed analysis is a simple yet powerful
technique to deal with absorbing states in the interchange of limits and . The mean-field approximation method allows us to
approximate the probabilistic sample paths of our learning dynamics by a
deterministic and smooth trajectory that corresponds to the unique solution of
a well-behaved system of ordinary differential equations (ODEs). Such an
approximation is desired because the analysis of a system of ODEs is relatively
easier than that of the original stochastic system.Comment: Authors's comments: This is a preliminary preprint of our work on
complete graphs. New aspects of our approach on general graphs have moved to:
Collaboratively Learning the Best Option on Graphs, Using Bounded Local
Memory, arXiv:1811.0396
Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory
We consider multi-armed bandit problems in social groups wherein each
individual has bounded memory and shares the common goal of learning the best
arm/option. We say an individual learns the best option if eventually (as ) it pulls only the arm with the highest expected reward. While this
goal is provably impossible for an isolated individual due to bounded memory,
we show that, in social groups, this goal can be achieved easily with the aid
of social persuasion (i.e., communication) as long as the communication
networks/graphs satisfy some mild conditions. To deal with the interplay
between the randomness in the rewards and in the social interaction, we employ
the {\em mean-field approximation} method. Considering the possibility that the
individuals in the networks may not be exchangeable when the communication
networks are not cliques, we go beyond the classic mean-field techniques and
apply a refined version of mean-field approximation:
(1) Using coupling we show that, if the communication graph is connected and
is either regular or has doubly-stochastic degree-weighted adjacency matrix,
with probability as the social group size , every
individual in the social group learns the best option.
(2) If the minimum degree of the graph diverges as , over an
arbitrary but given finite time horizon, the sample paths describing the
opinion evolutions of the individuals are asymptotically independent. In
addition, the proportions of the population with different opinions converge to
the unique solution of a system of ODEs. In the solution of the obtained ODEs,
the proportion of the population holding the correct opinion converges to
exponentially fast in time.
Notably, our results hold even if the communication graphs are highly sparse.Comment: arXiv admin note: text overlap with arXiv:1802.08159. Authors' note:
This work shares some overlap with our preliminary preprint arXiv:1802.08159
which focuses on complete graphs. arXiv:1802.08159 is combined with this wor
Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent
We consider the problem of distributed statistical machine learning in
adversarial settings, where some unknown and time-varying subset of working
machines may be compromised and behave arbitrarily to prevent an accurate model
from being learned. This setting captures the potential adversarial attacks
faced by Federated Learning -- a modern machine learning paradigm that is
proposed by Google researchers and has been intensively studied for ensuring
user privacy. Formally, we focus on a distributed system consisting of a
parameter server and working machines. Each working machine keeps
data samples, where is the total number of samples. The goal is to
collectively learn the underlying true model parameter of dimension .
In classical batch gradient descent methods, the gradients reported to the
server by the working machines are aggregated via simple averaging, which is
vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine
gradient descent method based on the geometric median of means of the
gradients. We show that our method can tolerate Byzantine
failures, and the parameter estimate converges in rounds with an
estimation error of , hence approaching the optimal error
rate in the centralized and failure-free setting. The total
computational complexity of our algorithm is of at each
working machine and at the central server, and the total
communication cost is of . We further provide an application of
our general results to the linear regression problem.
A key challenge arises in the above problem is that Byzantine failures create
arbitrary and unspecified dependency among the iterations and the aggregated
gradients. We prove that the aggregated gradient converges uniformly to the
true gradient function
Spike-Based Winner-Take-All Computation: Fundamental Limits and Order-Optimal Circuits
Winner-Take-All (WTA) refers to the neural operation that selects a
(typically small) group of neurons from a large neuron pool. It is conjectured
to underlie many of the brain's fundamental computational abilities. However,
not much is known about the robustness of a spike-based WTA network to the
inherent randomness of the input spike trains. In this work, we consider a
spike-based --WTA model wherein randomly generated input spike trains
compete with each other based on their underlying statistics, and winners
are supposed to be selected. We slot the time evenly with each time slot of
length , and model the input spike trains as independent
Bernoulli processes. The Bernoulli process is a good approximation of the
popular Poisson process but is more biologically relevant as it takes the
refractory periods into account. Due to the randomness in the input spike
trains, no circuits can guarantee to successfully select the correct winners in
finite time. We focus on analytically characterizing the minimal amount of time
needed so that a target minimax decision accuracy (success probability) can be
reached.
We first derive an information-theoretic lower bound on the decision time. We
show that to have a (minimax) decision error (where ), the computation time of any WTA circuit is at least where is a difficulty
parameter of a WTA task that is independent of , , and . We then
design a simple WTA circuit whose decision time is It turns out that for any
fixed , this decision time is order-optimal in terms of its
scaling in , , and
- β¦