15 research outputs found
Improved asynchronous parallel optimization analysis for stochastic incremental methods
As datasets continue to increase in size and multi-core computer
architectures are developed, asynchronous parallel optimization algorithms
become more and more essential to the field of Machine Learning. Unfortunately,
conducting the theoretical analysis asynchronous methods is difficult, notably
due to the introduction of delay and inconsistency in inherently sequential
algorithms. Handling these issues often requires resorting to simplifying but
unrealistic assumptions. Through a novel perspective, we revisit and clarify a
subtle but important technical issue present in a large fraction of the recent
convergence rate proofs for asynchronous parallel optimization algorithms, and
propose a simplification of the recently introduced "perturbed iterate"
framework that resolves it. We demonstrate the usefulness of our new framework
by analyzing three distinct asynchronous parallel incremental optimization
algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and
ASAGA, a novel asynchronous parallel version of the incremental gradient
algorithm SAGA that enjoys fast linear convergence rates. We are able to both
remove problematic assumptions and obtain better theoretical results. Notably,
we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on
multi-core systems even without sparsity assumptions. We present results of an
implementation on a 40-core architecture illustrating the practical speedups as
well as the hardware overhead. Finally, we investigate the overlap constant, an
ill-understood but central quantity for the theoretical analysis of
asynchronous parallel algorithms. We find that it encompasses much more
complexity than suggested in previous work, and often is order-of-magnitude
bigger than traditionally thought.Comment: 67 pages, published in JMLR, can be found online at
http://jmlr.org/papers/v19/17-650.html. arXiv admin note: substantial text
overlap with arXiv:1606.0480
Sparsified SGD with Memory
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications