9,419 research outputs found
Slice sampling covariance hyperparameters of latent Gaussian models
The Gaussian process (GP) is a popular way to specify dependencies between
random variables in a probabilistic model. In the Bayesian framework the
covariance structure can be specified using unknown hyperparameters.
Integrating over these hyperparameters considers different possible
explanations for the data when making predictions. This integration is often
performed using Markov chain Monte Carlo (MCMC) sampling. However, with
non-Gaussian observations standard hyperparameter sampling approaches require
careful tuning and may converge slowly. In this paper we present a slice
sampling approach that requires little tuning while mixing well in both strong-
and weak-data regimes.Comment: 9 pages, 4 figures, 4 algorithms. Minor corrections to previous
version. This version to appear in Advances in Neural Information Processing
Systems (NIPS) 23, 201
Efficient Optimization of Loops and Limits with Randomized Telescoping Sums
We consider optimization problems in which the objective requires an inner
loop with many steps or is the limit of a sequence of increasingly costly
approximations. Meta-learning, training recurrent neural networks, and
optimization of the solutions to differential equations are all examples of
optimization problems with this character. In such problems, it can be
expensive to compute the objective function value and its gradient, but
truncating the loop or using less accurate approximations can induce biases
that damage the overall solution. We propose randomized telescope (RT) gradient
estimators, which represent the objective as the sum of a telescoping series
and sample linear combinations of terms to provide cheap unbiased gradient
estimates. We identify conditions under which RT estimators achieve
optimization convergence rates independent of the length of the loop or the
required accuracy of the approximation. We also derive a method for tuning RT
estimators online to maximize a lower bound on the expected decrease in loss
per unit of computation. We evaluate our adaptive RT estimators on a range of
applications including meta-optimization of learning rates, variational
inference of ODE parameters, and training an LSTM to model long sequences
Graph-Sparse LDA: A Topic Model with Structured Sparsity
Originally designed to model text, topic modeling has become a powerful tool
for uncovering latent structure in domains including medicine, finance, and
vision. The goals for the model vary depending on the application: in some
cases, the discovered topics may be used for prediction or some other
downstream task. In other cases, the content of the topic itself may be of
intrinsic scientific interest.
Unfortunately, even using modern sparse techniques, the discovered topics are
often difficult to interpret due to the high dimensionality of the underlying
space. To improve topic interpretability, we introduce Graph-Sparse LDA, a
hierarchical topic model that leverages knowledge of relationships between
words (e.g., as encoded by an ontology). In our model, topics are summarized by
a few latent concept-words from the underlying graph that explain the observed
words. Graph-Sparse LDA recovers sparse, interpretable summaries on two
real-world biomedical datasets while matching state-of-the-art prediction
performance
Practical Bayesian Optimization of Machine Learning Algorithms
Machine learning algorithms frequently require careful tuning of model
hyperparameters, regularization terms, and optimization parameters.
Unfortunately, this tuning is often a "black art" that requires expert
experience, unwritten rules of thumb, or sometimes brute-force search. Much
more appealing is the idea of developing automatic approaches which can
optimize the performance of a given learning algorithm to the task at hand. In
this work, we consider the automatic tuning problem within the framework of
Bayesian optimization, in which a learning algorithm's generalization
performance is modeled as a sample from a Gaussian process (GP). The tractable
posterior distribution induced by the GP leads to efficient use of the
information gathered by previous experiments, enabling optimal choices about
what parameters to try next. Here we show how the effects of the Gaussian
process prior and the associated inference procedure can have a large impact on
the success or failure of Bayesian optimization. We show that thoughtful
choices can lead to results that exceed expert-level performance in tuning
machine learning algorithms. We also describe new algorithms that take into
account the variable cost (duration) of learning experiments and that can
leverage the presence of multiple cores for parallel experimentation. We show
that these proposed algorithms improve on previous automatic procedures and can
reach or surpass human expert-level optimization on a diverse set of
contemporary algorithms including latent Dirichlet allocation, structured SVMs
and convolutional neural networks
- …