4,382 research outputs found
A Formal Framework for Speedup Learning from Problems and Solutions
Speedup learning seeks to improve the computational efficiency of problem
solving with experience. In this paper, we develop a formal framework for
learning efficient problem solving from random problems and their solutions. We
apply this framework to two different representations of learned knowledge,
namely control rules and macro-operators, and prove theorems that identify
sufficient conditions for learning in each representation. Our proofs are
constructive in that they are accompanied with learning algorithms. Our
framework captures both empirical and explanation-based speedup learning in a
unified fashion. We illustrate our framework with implementations in two
domains: symbolic integration and Eight Puzzle. This work integrates many
strands of experimental and theoretical work in machine learning, including
empirical learning of control rules, macro-operator learning, Explanation-Based
Learning (EBL), and Probably Approximately Correct (PAC) Learning.Comment: See http://www.jair.org/ for any accompanying file
Local SGD Converges Fast and Communicates Little
Mini-batch stochastic gradient descent (SGD) is state of the art in large
scale distributed training. The scheme can reach a linear speedup with respect
to the number of workers, but this is rarely seen in practice as the scheme
often suffers from large network delays and bandwidth limits. To overcome this
communication bottleneck recent works propose to reduce the communication
frequency. An algorithm of this type is local SGD that runs SGD independently
in parallel on different workers and averages the sequences only once in a
while.
This scheme shows promising results in practice, but eluded thorough
theoretical analysis. We prove concise convergence rates for local SGD on
convex problems and show that it converges at the same rate as mini-batch SGD
in terms of number of evaluated gradients, that is, the scheme achieves linear
speedup in the number of workers and mini-batch size. The number of
communication rounds can be reduced up to a factor of T^{1/2}---where T denotes
the number of total steps---compared to mini-batch SGD. This also holds for
asynchronous implementations. Local SGD can also be used for large scale
training of deep learning models.
The results shown here aim serving as a guideline to further explore the
theoretical and practical aspects of local SGD in these applications.Comment: to appear at ICLR 2019, 19 page
Scaling Nonparametric Bayesian Inference via Subsample-Annealing
We describe an adaptation of the simulated annealing algorithm to
nonparametric clustering and related probabilistic models. This new algorithm
learns nonparametric latent structure over a growing and constantly churning
subsample of training data, where the portion of data subsampled can be
interpreted as the inverse temperature beta(t) in an annealing schedule. Gibbs
sampling at high temperature (i.e., with a very small subsample) can more
quickly explore sketches of the final latent state by (a) making longer jumps
around latent space (as in block Gibbs) and (b) lowering energy barriers (as in
simulated annealing). We prove subsample annealing speeds up mixing time N^2 ->
N in a simple clustering model and exp(N) -> N in another class of models,
where N is data size. Empirically subsample-annealing outperforms naive Gibbs
sampling in accuracy-per-wallclock time, and can scale to larger datasets and
deeper hierarchical models. We demonstrate improved inference on million-row
subsamples of US Census data and network log data and a 307-row hospital rating
dataset, using a Pitman-Yor generalization of the Cross Categorization model.Comment: To appear in AISTATS 201
- …