56 research outputs found

    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates

    Full text link
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms with no need to tune a learning rate.Comment: ICML 202

    Best-Arm Identification for Quantile Bandits with Privacy

    Full text link
    We study the best-arm identification problem in multi-armed bandits with stochastic, potentially private rewards, when the goal is to identify the arm with the highest quantile at a fixed, prescribed level. First, we propose a (non-private) successive elimination algorithm for strictly optimal best-arm identification, we show that our algorithm is δ\delta-PAC and we characterize its sample complexity. Further, we provide a lower bound on the expected number of pulls, showing that the proposed algorithm is essentially optimal up to logarithmic factors. Both upper and lower complexity bounds depend on a special definition of the associated suboptimality gap, designed in particular for the quantile bandit problem, as we show when the gap approaches zero, best-arm identification is impossible. Second, motivated by applications where the rewards are private, we provide a differentially private successive elimination algorithm whose sample complexity is finite even for distributions with infinite support-size, and we characterize its sample complexity as well. Our algorithms do not require prior knowledge of either the suboptimality gap or other statistical information related to the bandit problem at hand.Comment: 24 pages, 4 figure

    Towards Understanding the Condensation of Neural Networks at Initial Training

    Full text link
    Implicit regularization is important for understanding the learning of neural networks (NNs). Empirical works show that input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) condense on isolated orientations with a small initialization. The condensation dynamics implies that the training implicitly regularizes a NN towards one with much smaller effective size. In this work, we utilize multilayer networks to show that the maximal number of condensed orientations in the initial training stage is twice the multiplicity of the activation function, where "multiplicity" is multiple roots of activation function at origin. Our theoretical analysis confirms experiments for two cases, one is for the activation function of multiplicity one with arbitrary dimension input, which contains many common activation functions, and the other is for the layer with one-dimensional input and arbitrary multiplicity. This work makes a step towards understanding how small initialization implicitly leads NNs to condensation at initial training stage, which lays a foundation for the future study of the nonlinear dynamics of NNs and its implicit regularization effect at a later stage of training

    Online Learning and Bandits with Queried Hints

    Get PDF

    Online Learning and Bandits with Queried Hints

    Get PDF
    We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number (kk) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with k=2k=2 suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from O(nT)O(\sqrt{nT}) to O(n2logT)O(n^2 \log T), where nn is the number of arms and TT is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case, k=3k=3 probes suffice to achieve parameter-independent constant regret, O(n2)O(n^2). Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited ``advice'' via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.Comment: To appear in ITCS 202

    Differentially private sampling from distributions

    Full text link
    We initiate an investigation of private sampling from distributions. Given a dataset with n independent observations from an unknown distribution P, a sampling algorithm must output a single observation from a distribution that is close in total variation distance to P while satisfying differential privacy. Sampling abstracts the goal of generating small amounts of realistic-looking data. We provide tight upper and lower bounds for the dataset size needed for this task for three natural families of distributions: arbitrary distributions on {1,…,k}, arbitrary product distributions on {0,1}d, and product distributions on on {0,1}d with bias in each coordinate bounded away from 0 and 1. We demonstrate that, in some parameter regimes, private sampling requires asymptotically fewer observations than learning a description of P nonprivately; in other regimes, however, private sampling proves to be as difficult as private learning. Notably, for some classes of distributions, the overhead in the number of observations needed for private learning compared to non-private learning is completely captured by the number of observations needed for private sampling.https://proceedings.neurips.cc/paper/2021/hash/f2b5e92f61b6de923b063588ee6e7c48-Abstract.htm

    Structured Semidefinite Programming for Recovering Structured Preconditioners

    Full text link
    We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. We give an algorithm which, given positive definite KRd×d\mathbf{K} \in \mathbb{R}^{d \times d} with nnz(K)\mathrm{nnz}(\mathbf{K}) nonzero entries, computes an ϵ\epsilon-optimal diagonal preconditioner in time O~(nnz(K)poly(κ,ϵ1))\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{-1})), where κ\kappa^\star is the optimal condition number of the rescaled matrix. We give an algorithm which, given MRd×d\mathbf{M} \in \mathbb{R}^{d \times d} that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in M\mathbf{M} in O~(d2)\widetilde{O}(d^2) time. Our diagonal preconditioning results improve state-of-the-art runtimes of Ω(d3.5)\Omega(d^{3.5}) attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of Ω(dω)\Omega(d^{\omega}) where ω>2.3\omega > 2.3 is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call matrix-dictionary approximation SDPs, which we leverage to solve an associated problem we call matrix-dictionary recovery.Comment: Merge of arXiv:1812.06295 and arXiv:2008.0172
    corecore