14,785 research outputs found
State Space LSTM Models with Particle MCMC Inference
Long Short-Term Memory (LSTM) is one of the most powerful sequence models.
Despite the strong performance, however, it lacks the nice interpretability as
in state space models. In this paper, we present a way to combine the best of
both worlds by introducing State Space LSTM (SSL) models that generalizes the
earlier work \cite{zaheer2017latent} of combining topic models with LSTM.
However, unlike \cite{zaheer2017latent}, we do not make any factorization
assumptions in our inference algorithm. We present an efficient sampler based
on sequential Monte Carlo (SMC) method that draws from the joint posterior
directly. Experimental results confirms the superiority and stability of this
SMC inference algorithm on a variety of domains
A General Method for Amortizing Variational Filtering
We introduce the variational filtering EM algorithm, a simple,
general-purpose method for performing variational inference in dynamical latent
variable models using information from only past and present variables, i.e.
filtering. The algorithm is derived from the variational objective in the
filtering setting and consists of an optimization procedure at each time step.
By performing each inference optimization procedure with an iterative amortized
inference model, we obtain a computationally efficient implementation of the
algorithm, which we call amortized variational filtering. We present
experiments demonstrating that this general-purpose method improves performance
across several deep dynamical latent variable models.Comment: Advances in Neural Information Processing Systems (NIPS) 201
Gradient Estimation Using Stochastic Computation Graphs
In a variety of problems originating in supervised, unsupervised, and
reinforcement learning, the loss function is defined by an expectation over a
collection of random variables, which might be part of a probabilistic model or
the external world. Estimating the gradient of this loss function, using
samples, lies at the core of gradient-based learning algorithms for these
problems. We introduce the formalism of stochastic computation
graphs---directed acyclic graphs that include both deterministic functions and
conditional probability distributions---and describe how to easily and
automatically derive an unbiased estimator of the loss function's gradient. The
resulting algorithm for computing the gradient estimator is a simple
modification of the standard backpropagation algorithm. The generic scheme we
propose unifies estimators derived in variety of prior work, along with
variance-reduction techniques therein. It could assist researchers in
developing intricate models involving a combination of stochastic and
deterministic operations, enabling, for example, attention, memory, and control
actions.Comment: Advances in Neural Information Processing Systems 28 (NIPS 2015
A Tutorial on Deep Latent Variable Models of Natural Language
There has been much recent, exciting work on combining the complementary
strengths of latent variable models and deep learning. Latent variable modeling
makes it easy to explicitly specify model constraints through conditional
independence properties, while deep learning makes it possible to parameterize
these conditional likelihoods with powerful function approximators. While these
"deep latent variable" models provide a rich, flexible framework for modeling
many real-world phenomena, difficulties exist: deep parameterizations of
conditional likelihoods usually make posterior inference intractable, and
latent variable objectives often complicate backpropagation by introducing
points of non-differentiability. This tutorial explores these issues in depth
through the lens of variational inference.Comment: EMNLP 2018 Tutoria
Advances in Variational Inference
Many modern unsupervised or semi-supervised machine learning algorithms rely
on Bayesian probabilistic models. These models are usually intractable and thus
require approximate inference. Variational inference (VI) lets us approximate a
high-dimensional Bayesian posterior with a simpler variational distribution by
solving an optimization problem. This approach has been successfully used in
various models and large-scale applications. In this review, we give an
overview of recent trends in variational inference. We first introduce standard
mean field variational inference, then review recent advances focusing on the
following aspects: (a) scalable VI, which includes stochastic approximations,
(b) generic VI, which extends the applicability of VI to a large class of
otherwise intractable models, such as non-conjugate models, (c) accurate VI,
which includes variational models beyond the mean field approximation or with
atypical divergences, and (d) amortized VI, which implements the inference over
local latent variables with inference networks. Finally, we provide a summary
of promising future research directions
ZhuSuan: A Library for Bayesian Deep Learning
In this paper we introduce ZhuSuan, a python probabilistic programming
library for Bayesian deep learning, which conjoins the complimentary advantages
of Bayesian methods and deep learning. ZhuSuan is built upon Tensorflow. Unlike
existing deep learning libraries, which are mainly designed for deterministic
neural networks and supervised tasks, ZhuSuan is featured for its deep root
into Bayesian inference, thus supporting various kinds of probabilistic models,
including both the traditional hierarchical Bayesian models and recent deep
generative models. We use running examples to illustrate the probabilistic
programming on ZhuSuan, including Bayesian logistic regression, variational
auto-encoders, deep sigmoid belief networks and Bayesian recurrent neural
networks.Comment: The GitHub page is at https://github.com/thu-ml/zhusua
End-to-end Learning of Deterministic Decision Trees
Conventional decision trees have a number of favorable properties, including
interpretability, a small computational footprint and the ability to learn from
little training data. However, they lack a key quality that has helped fuel the
deep learning revolution: that of being end-to-end trainable, and to learn from
scratch those features that best allow to solve a given supervised learning
problem. Recent work (Kontschieder 2015) has addressed this deficit, but at the
cost of losing a main attractive trait of decision trees: the fact that each
sample is routed along a small subset of tree nodes only. We here propose a
model and Expectation-Maximization training scheme for decision trees that are
fully probabilistic at train time, but after a deterministic annealing process
become deterministic at test time. We also analyze the learned oblique split
parameters on image datasets and show that Neural Networks can be trained at
each split node. In summary, we present the first end-to-end learning scheme
for deterministic decision trees and present results on par with or superior to
published standard oblique decision tree algorithms
Variational Message Passing with Structured Inference Networks
Recent efforts on combining deep models with probabilistic graphical models
are promising in providing flexible models that are also easy to interpret. We
propose a variational message-passing algorithm for variational inference in
such models. We make three contributions. First, we propose structured
inference networks that incorporate the structure of the graphical model in the
inference network of variational auto-encoders (VAE). Second, we establish
conditions under which such inference networks enable fast amortized inference
similar to VAE. Finally, we derive a variational message passing algorithm to
perform efficient natural-gradient inference while retaining the efficiency of
the amortized inference. By simultaneously enabling structured, amortized, and
natural-gradient inference for deep structured models, our method simplifies
and generalizes existing methods.Comment: Added a missing term in the gradient of the lower boun
When Gaussian Process Meets Big Data: A Review of Scalable GPs
The vast quantity of information brought by big data as well as the evolving
computer hardware encourages success stories in the machine learning community.
In the meanwhile, it poses challenges for the Gaussian process (GP) regression,
a well-known non-parametric and interpretable Bayesian model, which suffers
from cubic complexity to data size. To improve the scalability while retaining
desirable prediction quality, a variety of scalable GPs have been presented.
But they have not yet been comprehensively reviewed and analyzed in order to be
well understood by both academia and industry. The review of scalable GPs in
the GP community is timely and important due to the explosion of data size. To
this end, this paper is devoted to the review on state-of-the-art scalable GPs
involving two main categories: global approximations which distillate the
entire data and local approximations which divide the data for subspace
learning. Particularly, for global approximations, we mainly focus on sparse
approximations comprising prior approximations which modify the prior but
perform exact inference, posterior approximations which retain exact prior but
perform approximate inference, and structured sparse approximations which
exploit specific structures in kernel matrix; for local approximations, we
highlight the mixture/product of experts that conducts model averaging from
multiple local experts to boost predictions. To present a complete review,
recent advances for improving the scalability and capability of scalable GPs
are reviewed. Finally, the extensions and open issues regarding the
implementation of scalable GPs in various scenarios are reviewed and discussed
to inspire novel ideas for future research avenues.Comment: 20 pages, 6 figure
Reconciling meta-learning and continual learning with online mixtures of tasks
Learning-to-learn or meta-learning leverages data-driven inductive bias to
increase the efficiency of learning on a novel task. This approach encounters
difficulty when transfer is not advantageous, for instance, when tasks are
considerably dissimilar or change over time. We use the connection between
gradient-based meta-learning and hierarchical Bayes to propose a Dirichlet
process mixture of hierarchical Bayesian models over the parameters of an
arbitrary parametric model such as a neural network. In contrast to
consolidating inductive biases into a single set of hyperparameters, our
approach of task-dependent hyperparameter selection better handles latent
distribution shift, as demonstrated on a set of evolving, image-based, few-shot
learning benchmarks.Comment: updated experimental result
- …