Search CORE

11 research outputs found

Coordinate Descent with Bandit Sampling

Author: Celis L. Elisa
Salehi Farnood
Thiran Patrick
Publication venue
Publication date: 04/12/2018
Field of study

Coordinate descent methods usually minimize a cost function by updating a random decision variable (corresponding to one coordinate) at a time. Ideally, we would update the decision variable that yields the largest decrease in the cost function. However, finding this coordinate would require checking all of them, which would effectively negate the improvement in computational tractability that coordinate descent is intended to afford. To address this, we propose a new adaptive method for selecting a coordinate. First, we find a lower bound on the amount the cost function decreases when a coordinate is updated. We then use a multi-armed bandit algorithm to learn which coordinates result in the largest lower bound by interleaving this learning with conventional coordinate descent updates except that the coordinate is selected proportionately to the expected decrease. We show that our approach improves the convergence of coordinate descent methods both theoretically and experimentally.Comment: appearing at NeurIPS 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Online Variance Reduction for Stochastic Optimization

Author: Borsos Zalán
Krause Andreas
Levy Kfir Y.
Publication venue
Publication date: 01/01/2018
Field of study

Modern stochastic optimization methods often rely on uniform sampling which is agnostic to the underlying characteristics of the data. This might degrade the convergence by yielding estimates that suffer from a high variance. A possible remedy is to employ non-uniform importance sampling techniques, which take the structure of the dataset into account. In this work, we investigate a recently proposed setting which poses variance reduction as an online optimization problem with bandit feedback. We devise a novel and efficient algorithm for this setting that finds a sequence of importance sampling distributions competitive with the best fixed distribution in hindsight, the first result of this kind. While we present our method for sampling datapoints, it naturally extends to selecting coordinates or even blocks of thereof. Empirical validations underline the benefits of our method in several settings.Comment: COLT 201

arXiv.org e-Print Archive

Repository for Publications and Research Data

Proximal Gradient methods with Adaptive Subspace Sampling

Author: Grishchenko Dmitry
Iutzeler Franck
Malick Jérôme
Publication venue
Publication date: 28/04/2020
Field of study

Many applications in machine learning or signal processing involve nonsmooth optimization problems. This nonsmoothness brings a low-dimensional structure to the optimal solutions. In this paper, we propose a randomized proximal gradient method harnessing this underlying structure. We introduce two key components: i) a random subspace proximal gradient algorithm; ii) an identification-based sampling of the subspaces. Their interplay brings a significant performance improvement on typical learning problems in terms of dimensions explored

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

Author: Laradji Issam
Nutini Julie
Schmidt Mark
Publication venue
Publication date: 23/12/2017
Field of study

Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper we explore all three of these building blocks and propose variations for each that can lead to significantly faster BCD methods. We (i) propose new greedy block-selection strategies that guarantee more progress per iteration than the Gauss-Southwell rule; (ii) explore practical issues like how to implement the new rules when using "variable" blocks; (iii) explore the use of message-passing to compute matrix or Newton updates efficiently on huge blocks for problems with a sparse dependency between variables; and (iv) consider optimal active manifold identification, which leads to bounds on the "active set complexity" of BCD methods and leads to superlinear convergence for certain problems with sparse solutions (and in some cases finite termination at an optimal solution). We support all of our findings with numerical results for the classic machine learning problems of least squares, logistic regression, multi-class logistic regression, label propagation, and L1-regularization

arXiv.org e-Print Archive

Multi-Resolution Hashing for Fast Pairwise Summations

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 03/11/2018
Field of study

A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector

y

(query) that is unknown a priori. Given a set of points

X\subset \mathbb{R}^{d}

and a pairwise function

w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1]

, we study the problem of designing a data-structure that enables sublinear-time approximation of the summation

Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y)

for any query

y\in \mathbb{R}^{d}

. By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of

T\geq 1

hashing schemes with collision probabilities

p_{1},\ldots, p_{T}

such that

\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)})

. This leads to a data-structure that approximates

Z_{w}(y)

using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function

w(x,y)=e^{\phi(\langle x,y\rangle)}

of the inner product on the unit sphere

x,y\in \mathcal{S}^{d-1}

. Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to

\mathbb{R}^{d}

and from scalar functions to vector functions.Comment: 39 pages, 3 figure

arXiv.org e-Print Archive

Crossref