Search CORE

693 research outputs found

On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

Author: Su Lili
Yang Pengkun
Publication venue
Publication date: 01/09/2019
Field of study

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\em an} integral operator which is determined by the feature vector distribution

\rho

only. Consequently, GD method can be viewed as {\em approximately} applying the powers of this integral operator on the underlying/target function

f^*

that generates the responses/labels. We show that if

f^*

admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low-rank approximation error at a linear rate which is determined by

f^*

and

\rho

only, i.e., the rate is independent of the sample size

n

. Furthermore, if

f^*

has zero low-rank approximation error, then, as long as the width of the neural network is

\Omega(n\log n)

, the empirical risk decreases to

\Theta(1/\sqrt{n})

. To the best of our knowledge, this is the first result showing the sufficiency of nearly-linear network over-parameterization. We provide an application of our general results to the setting where

\rho

is the uniform distribution on the spheres and

f^*

is a polynomial. Throughout this paper, we consider the scenario where the input dimension

d

is fixed

arXiv.org e-Print Archive

Finite-time Guarantees for Byzantine-Resilient Distributed State Estimation with Noisy Measurements

Author: Shahrampour Shahin
Su Lili
Publication venue
Publication date: 16/10/2018
Field of study

This work considers resilient, cooperative state estimation in unreliable multi-agent networks. A network of agents aims to collaboratively estimate the value of an unknown vector parameter, while an {\em unknown} subset of agents suffer Byzantine faults. Faulty agents malfunction arbitrarily and may send out {\em highly unstructured} messages to other agents in the network. As opposed to fault-free networks, reaching agreement in the presence of Byzantine faults is far from trivial. In this paper, we propose a computationally-efficient algorithm that is provably robust to Byzantine faults. At each iteration of the algorithm, a good agent (1) performs a gradient descent update based on noisy local measurements, (2) exchanges its update with other agents in its neighborhood, and (3) robustly aggregates the received messages using coordinate-wise trimmed means. Under mild technical assumptions, we establish that good agents learn the true parameter asymptotically in almost sure sense. We further complement our analysis by proving (high probability) {\em finite-time} convergence rate, encapsulating network characteristics

arXiv.org e-Print Archive

Securing Distributed Gradient Descent in High Dimensional Statistical Learning

Author: Su Lili
Xu Jiaming
Publication venue
Publication date: 09/05/2019
Field of study

We consider unreliable distributed learning systems wherein the training data is kept confidential by external workers, and the learner has to interact closely with those workers to train a model. In particular, we assume that there exists a system adversary that can adaptively compromise some workers; the compromised workers deviate from their local designed specifications by sending out arbitrarily malicious messages. We assume in each communication round, up to

q

out of the

m

workers suffer Byzantine faults. Each worker keeps a local sample of size

n

and the total sample size is

N=nm

. We propose a secured variant of the gradient descent method that can tolerate up to a constant fraction of Byzantine workers, i.e.,

q/m = O(1)

. Moreover, we show the statistical estimation error of the iterates converges in

O(\log N)

rounds to

O(\sqrt{q/N} + \sqrt{d/N})

, where

d

is the model dimension. As long as

q=O(d)

, our proposed algorithm achieves the optimal error rate

O(\sqrt{d/N})

. Our results are obtained under some technical assumptions. Specifically, we assume strongly-convex population risk. Nevertheless, the empirical risk (sample version) is allowed to be non-convex. The core of our method is to robustly aggregate the gradients computed by the workers based on the filtering procedure proposed by Steinhardt et al. On the technical front, deviating from the existing literature on robustly estimating a finite-dimensional mean vector, we establish a {\em uniform} concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. To get a near-optimal uniform concentration bound, we develop a new matrix concentration inequality, which might be of independent interest

arXiv.org e-Print Archive

Byzantine Multi-Agent Optimization: Part I

Author: Su Lili
Vaidya Nitin
Publication venue
Publication date: 07/07/2015
Field of study

We study Byzantine fault-tolerant distributed optimization of a sum of convex (cost) functions with real-valued scalar input/ouput. In particular, the goal is to optimize a global cost function

\frac{1}{|\mathcal{N}|}\sum_{i\in \mathcal{N}} h_i(x)

, where

\mathcal{N}

is the set of non-faulty agents, and

h_i(x)

is agent

i

's local cost function, which is initially known only to agent

i

. In general, when some of the agents may be Byzantine faulty, the above goal is unachievable, because the identity of the faulty agents is not necessarily known to the non-faulty agents, and the faulty agents may behave arbitrarily. Since the above global cost function cannot be optimized exactly in presence of Byzantine agents, we define a weaker version of the problem. The goal for the weaker problem is to generate an output that is an optimum of a function formed as a convex combination of local cost functions of the non-faulty agents. More precisely, for some choice of weights

\alpha_i

for

i\in \mathcal{N}

such that

\alpha_i\geq 0

and

\sum_{i\in \mathcal{N}}\alpha_i=1

, the output must be an optimum of the cost function

\sum_{i\in \mathcal{N}} \alpha_ih_i(x)

. Ideally, we would like

\alpha_i=\frac{1}{|\mathcal{N}|}

for all

i\in \mathcal{N}

-- however, this cannot be guaranteed due to the presence of faulty agents. In fact, we show that the maximum achievable number of nonzero weights (

\alpha_i

's) is

|\mathcal{N}|-f

, where

f

is the upper bound on the number of Byzantine agents. In addition, we present algorithms that ensure that at least

|\mathcal{N}|-f

agents have weights that are bounded away from 0. We also propose a low-complexity suboptimal algorithm, which ensures that at least

\lceil \frac{n}{2}\rceil-\phi

agents have weights that are bounded away from 0, where

n

is the total number of agents, and

\phi

(

\phi\le f

) is the actual number of Byzantine agents

arXiv.org e-Print Archive

Fault-Tolerant Multi-Agent Optimization: Part III

Author: Su Lili
Vaidya Nitin
Publication venue
Publication date: 06/09/2015
Field of study

We study fault-tolerant distributed optimization of a sum of convex (cost) functions with real-valued scalar input/output in the presence of crash faults or Byzantine faults. In particular, the goal is to optimize a global cost function

\frac{1}{n}\sum_{i\in \mathcal{V}} h_i(x)

, where

\mathcal{V}=\{1, \ldots, n\}

is the collection of agents, and

h_i(x)

is agent

i

's local cost function, which is initially known only to agent

i

. Since the above global cost function cannot be optimized exactly in presence of crash faults or Byzantine faults, we define two weaker versions of the problem for crash faults and Byzantine faults, respectively. When some agents may crash, the goal for the weaker problem is to generate an output that is an optimum of a function formed as

C(\sum_{i\in \mathcal{N}} h_i(x)+\sum_{i\in \mathcal{F}} \alpha_i h_i(x)),

where

\mathcal{N}

is the set of non-faulty agents,

\mathcal{F}

is the set of faulty agents (crashed agents),

0\le \alpha_i\le 1

for each

i\in \mathcal{F}

and

C

is a normalization constant such that

C(|\mathcal{N}|+\sum_{i\in \mathcal{F}} \alpha_i)=1

. We present an iterative algorithm in which each agent only needs to perform local computation, and send one message per iteration. When some agents may be Byzantine, the system cannot take full advantage of the data kept by non-faulty agents. The goal for the associated weaker problem is to generate an output that is an optimum of a function formed as

\sum_{i\in \mathcal{N}}\alpha_i h_i(x),

such that

\alpha_i\geq 0

for each

i\in \mathcal{N}

and

\sum_{i\in \mathcal{N}}\alpha_i=1

. We present an iterative algorithm, where only local computation is needed and only one message per agent is sent in each iteration, that ensures that at least

|\mathcal{N}|-f

agents have weights (

\alpha_i

's) that are lower bounded by

\frac{1}{2(|\mathcal{N}|-f)}

arXiv.org e-Print Archive

Defending Non-Bayesian Learning against Adversarial Attacks

Author: Su Lili
Vaidya Nitin H.
Publication venue
Publication date: 28/06/2016
Field of study

This paper addresses the problem of non-Bayesian learning over multi-agent networks, where agents repeatedly collect partially informative observations about an unknown state of the world, and try to collaboratively learn the true state. We focus on the impact of the adversarial agents on the performance of consensus-based non-Bayesian learning, where non-faulty agents combine local learning updates with consensus primitives. In particular, we consider the scenario where an unknown subset of agents suffer Byzantine faults -- agents suffering Byzantine faults behave arbitrarily. Two different learning rules are proposed

arXiv.org e-Print Archive

Collaboratively Learning the Best Option, Using Bounded Memory

Author: Lynch Nancy
Su Lili
Zubeldia Martin
Publication venue
Publication date: 11/11/2018
Field of study

We consider multi-armed bandit problems in social groups wherein each individual has bounded memory and shares the common goal of learning the best arm/option. We say an individual learns the best option if eventually (as

t \to \infty

) it pulls only the arm with the highest average reward. While this goal is provably impossible for an isolated individual, we show that, in social groups, this goal can be achieved easily with the aid of social persuasion, i.e., communication. Specifically, we study the learning dynamics wherein an individual sequentially decides on which arm to pull next based on not only its private reward feedback but also the suggestions provided by randomly chosen peers. Our learning dynamics are hard to analyze via explicit probabilistic calculations due to the stochastic dependency induced by social interaction. Instead, we employ the mean-field approximation method from statistical physics and we show: (1) With probability

\to 1

as the social group size

N \to \infty

, every individual in the social group learns the best option. (2) Over an arbitrary finite time horizon

[0, T]

, with high probability (in

N

), the fraction of individuals that prefer the best option grows to 1 exponentially fast as

t

increases (

t\in [0, T]

). A major innovation of our mean-filed analysis is a simple yet powerful technique to deal with absorbing states in the interchange of limits

N \to \infty

and

t \to \infty

. The mean-field approximation method allows us to approximate the probabilistic sample paths of our learning dynamics by a deterministic and smooth trajectory that corresponds to the unique solution of a well-behaved system of ordinary differential equations (ODEs). Such an approximation is desired because the analysis of a system of ODEs is relatively easier than that of the original stochastic system.Comment: Authors's comments: This is a preliminary preprint of our work on complete graphs. New aspects of our approach on general graphs have moved to: Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory, arXiv:1811.0396

arXiv.org e-Print Archive

Collaboratively Learning the Best Option on Graphs, Using Bounded Local Memory

Author: Lynch Nancy
Su Lili
Zubeldia Martin
Publication venue
Publication date: 23/12/2018
Field of study

t\to \infty

) it pulls only the arm with the highest expected reward. While this goal is provably impossible for an isolated individual due to bounded memory, we show that, in social groups, this goal can be achieved easily with the aid of social persuasion (i.e., communication) as long as the communication networks/graphs satisfy some mild conditions. To deal with the interplay between the randomness in the rewards and in the social interaction, we employ the {\em mean-field approximation} method. Considering the possibility that the individuals in the networks may not be exchangeable when the communication networks are not cliques, we go beyond the classic mean-field techniques and apply a refined version of mean-field approximation: (1) Using coupling we show that, if the communication graph is connected and is either regular or has doubly-stochastic degree-weighted adjacency matrix, with probability

\to 1

as the social group size

N \to \infty

, every individual in the social group learns the best option. (2) If the minimum degree of the graph diverges as

N \to \infty

, over an arbitrary but given finite time horizon, the sample paths describing the opinion evolutions of the individuals are asymptotically independent. In addition, the proportions of the population with different opinions converge to the unique solution of a system of ODEs. In the solution of the obtained ODEs, the proportion of the population holding the correct opinion converges to

1

exponentially fast in time. Notably, our results hold even if the communication graphs are highly sparse.Comment: arXiv admin note: text overlap with arXiv:1802.08159. Authors' note: This work shares some overlap with our preliminary preprint arXiv:1802.08159 which focuses on complete graphs. arXiv:1802.08159 is combined with this wor

arXiv.org e-Print Archive

Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

Author: Chen Yudong
Su Lili
Xu Jiaming
Publication venue
Publication date: 22/10/2017
Field of study

We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning -- a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and

m

working machines. Each working machine keeps

N/m

data samples, where

N

is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension

d

. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate

q \le (m-1)/2

Byzantine failures, and the parameter estimate converges in

O(\log N)

rounds with an estimation error of

\sqrt{d(2q+1)/N}

, hence approaching the optimal error rate

\sqrt{d/N}

in the centralized and failure-free setting. The total computational complexity of our algorithm is of

O((Nd/m) \log N)

at each working machine and

O(md + kd \log^3 N)

at the central server, and the total communication cost is of

O(m d \log N)

. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function

arXiv.org e-Print Archive

Spike-Based Winner-Take-All Computation: Fundamental Limits and Order-Optimal Circuits

Author: Chang Chia-Jung
Lynch Nancy
Su Lili
Publication venue
Publication date: 20/04/2019
Field of study

Winner-Take-All (WTA) refers to the neural operation that selects a (typically small) group of neurons from a large neuron pool. It is conjectured to underlie many of the brain's fundamental computational abilities. However, not much is known about the robustness of a spike-based WTA network to the inherent randomness of the input spike trains. In this work, we consider a spike-based

k

--WTA model wherein

n

randomly generated input spike trains compete with each other based on their underlying statistics, and

k

winners are supposed to be selected. We slot the time evenly with each time slot of length

1\, ms

, and model the

n

input spike trains as

n

independent Bernoulli processes. The Bernoulli process is a good approximation of the popular Poisson process but is more biologically relevant as it takes the refractory periods into account. Due to the randomness in the input spike trains, no circuits can guarantee to successfully select the correct winners in finite time. We focus on analytically characterizing the minimal amount of time needed so that a target minimax decision accuracy (success probability) can be reached. We first derive an information-theoretic lower bound on the decision time. We show that to have a (minimax) decision error

\le \delta

(where

\delta \in (0,1)

), the computation time of any WTA circuit is at least

((1-\delta) \log(k(n -k)+1) -1)T_{\mathcal{R}},

where

T_{\mathcal{R}}

is a difficulty parameter of a WTA task that is independent of

\delta

n

, and

k

. We then design a simple WTA circuit whose decision time is

O( \log\frac{1}{\delta}+\log k(n-k))T_{\mathcal{R}}).

It turns out that for any fixed

\delta \in (0,1)

, this decision time is order-optimal in terms of its scaling in

n

k

, and

T_{\mathcal{R}}

arXiv.org e-Print Archive