    Distributed Stochastic Optimization over Time-Varying Noisy Network

    This paper is concerned with distributed stochastic multi-agent optimization problem over a class of time-varying network with slowly decreasing communication noise effects. This paper considers the problem in composite optimization setting which is more general in noisy network optimization. It is noteworthy that existing methods for noisy network optimization are Euclidean projection based. We present two related different classes of non-Euclidean methods and investigate their convergence behavior. One is distributed stochastic composite mirror descent type method (DSCMD-N) which provides a more general algorithm framework than former works in this literature. As a counterpart, we also consider a composite dual averaging type method (DSCDA-N) for noisy network optimization. Some main error bounds for DSCMD-N and DSCDA-N are obtained. The trade-off among stepsizes, noise decreasing rates, convergence rates of algorithm is analyzed in detail. To the best of our knowledge, this is the first work to analyze and derive convergence rates of optimization algorithm in noisy network optimization. We show that an optimal rate of O(1/T)O(1/\sqrt{T}) in nonsmooth convex optimization can be obtained for proposed methods under appropriate communication noise condition. Moveover, convergence rates in different orders are comprehensively derived in both expectation convergence and high probability convergence sense.Comment: 27 page

    Cooperative Online Learning: Keeping your Neighbors Updated

    We study an asynchronous online learning setting with a network of agents. At each time step, some of the agents are activated, requested to make a prediction, and pay the corresponding loss. The loss function is then revealed to these agents and also to their neighbors in the network. Our results characterize how much knowing the network structure affects the regret as a function of the model of agent activations. When activations are stochastic, the optimal regret (up to constant factors) is shown to be of order αT\sqrt{\alpha T}, where TT is the horizon and α\alpha is the independence number of the network. We prove that the upper bound is achieved even when agents have no information about the network structure. When activations are adversarial the situation changes dramatically: if agents ignore the network structure, a Ω(T)\Omega(T) lower bound on the regret can be proven, showing that learning is impossible. However, when agents can choose to ignore some of their neighbors based on the knowledge of the network structure, we prove a O(χ‾T)O(\sqrt{\overline{\chi} T}) sublinear regret bound, where χ‾≥α\overline{\chi} \ge \alpha is the clique-covering number of the network

    D2^2: Decentralized Training over Decentralized Data

    While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D2^2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D2^2 is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from O(σnT+(nζ2)13T2/3)O\left({\sigma \over \sqrt{nT}} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right) to O(σnT)O\left({\sigma \over \sqrt{nT}}\right) where ζ2\zeta^{2} denotes the variance among data on different workers. As a result, D2^2 is robust to data variance among workers. We empirically evaluated D2^2 on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D2^2 significantly outperforms D-PSGD

    Distributed Learning with Infinitely Many Hypotheses

    We consider a distributed learning setup where a network of agents sequentially access realizations of a set of random variables with unknown distributions. The network objective is to find a parametrized distribution that best describes their joint observations in the sense of the Kullback-Leibler divergence. Apart from recent efforts in the literature, we analyze the case of countably many hypotheses and the case of a continuum of hypotheses. We provide non-asymptotic bounds for the concentration rate of the agents' beliefs around the correct hypothesis in terms of the number of agents, the network parameters, and the learning abilities of the agents. Additionally, we provide a novel motivation for a general set of distributed Non-Bayesian update rules as instances of the distributed stochastic mirror descent algorithm.Comment: Submitted to CDC201
