865 research outputs found

### Gibbs flow for approximate transport with applications to Bayesian computation

Let $\pi_{0}$ and $\pi_{1}$ be two distributions on the Borel space
$(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d}))$. Any measurable function
$T:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ such that $Y=T(X)\sim\pi_{1}$ if
$X\sim\pi_{0}$ is called a transport map from $\pi_{0}$ to $\pi_{1}$. For any
$\pi_{0}$ and $\pi_{1}$, if one could obtain an analytical expression for a
transport map from $\pi_{0}$ to $\pi_{1}$, then this could be straightforwardly
applied to sample from any distribution. One would map draws from an
easy-to-sample distribution $\pi_{0}$ to the target distribution $\pi_{1}$
using this transport map. Although it is usually impossible to obtain an
explicit transport map for complex target distributions, we show here how to
build a tractable approximation of a novel transport map. This is achieved by
moving samples from $\pi_{0}$ using an ordinary differential equation with a
velocity field that depends on the full conditional distributions of the
target. Even when this ordinary differential equation is time-discretized and
the full conditional distributions are numerically approximated, the resulting
distribution of mapped samples can be efficiently evaluated and used as a
proposal within sequential Monte Carlo samplers. We demonstrate significant
gains over state-of-the-art sequential Monte Carlo samplers at a fixed
computational complexity on a variety of applications.Comment: Significantly revised with new methodology and numerical example

### Controlled Sequential Monte Carlo

Sequential Monte Carlo methods, also known as particle methods, are a popular
set of techniques for approximating high-dimensional probability distributions
and their normalizing constants. These methods have found numerous applications
in statistics and related fields; e.g. for inference in non-linear non-Gaussian
state space models, and in complex static models. Like many Monte Carlo
sampling schemes, they rely on proposal distributions which crucially impact
their performance. We introduce here a class of controlled sequential Monte
Carlo algorithms, where the proposal distributions are determined by
approximating the solution to an associated optimal control problem using an
iterative scheme. This method builds upon a number of existing algorithms in
econometrics, physics, and statistics for inference in state space models, and
generalizes these methods so as to accommodate complex static models. We
provide a theoretical analysis concerning the fluctuation and stability of this
methodology that also provides insight into the properties of related
algorithms. We demonstrate significant gains over state-of-the-art methods at a
fixed computational complexity on a variety of applications

### A Multilevel Approach for Stochastic Nonlinear Optimal Control

We consider a class of finite time horizon nonlinear stochastic optimal
control problem, where the control acts additively on the dynamics and the
control cost is quadratic. This framework is flexible and has found
applications in many domains. Although the optimal control admits a path
integral representation for this class of control problems, efficient
computation of the associated path integrals remains a challenging Monte Carlo
task. The focus of this article is to propose a new Monte Carlo approach that
significantly improves upon existing methodology. Our proposed methodology
first tackles the issue of exponential growth in variance with the time horizon
by casting optimal control estimation as a smoothing problem for a state space
model associated with the control problem, and applying smoothing algorithms
based on particle Markov chain Monte Carlo. To further reduce computational
cost, we then develop a multilevel Monte Carlo method which allows us to obtain
an estimator of the optimal control with $\mathcal{O}(\epsilon^2)$ mean squared
error with a computational cost of
$\mathcal{O}(\epsilon^{-2}\log(\epsilon)^2)$. In contrast, a computational cost
of $\mathcal{O}(\epsilon^{-3})$ is required for existing methodology to achieve
the same mean squared error. Our approach is illustrated on two numerical
examples, which validate our theory

### Diffusion Schr\"odinger Bridge with Applications to Score-Based Generative Modeling

Progressively applying Gaussian noise transforms complex data distributions
to approximately Gaussian. Reversing this dynamic defines a generative model.
When the forward noising process is given by a Stochastic Differential Equation
(SDE), Song et al. (2021) demonstrate how the time inhomogeneous drift of the
associated reverse-time SDE may be estimated using score-matching. A limitation
of this approach is that the forward-time SDE must be run for a sufficiently
long time for the final distribution to be approximately Gaussian. In contrast,
solving the Schr\"odinger Bridge problem (SB), i.e. an entropy-regularized
optimal transport problem on path spaces, yields diffusions which generate
samples from the data distribution in finite time. We present Diffusion SB
(DSB), an original approximation of the Iterative Proportional Fitting (IPF)
procedure to solve the SB problem, and provide theoretical analysis along with
generative modeling experiments. The first DSB iteration recovers the
methodology proposed by Song et al. (2021), with the flexibility of using
shorter time intervals, as subsequent DSB iterations reduce the discrepancy
between the final-time marginal of the forward (resp. backward) SDE with
respect to the prior (resp. data) distribution. Beyond generative modeling, DSB
offers a widely applicable computational optimal transport tool as the
continuous state-space analogue of the popular Sinkhorn algorithm (Cuturi,
2013).Comment: 58 pages, 18 figures (correction of Proposition 5

### An invitation to sequential Monte Carlo samplers

Sequential Monte Carlo samplers provide consistent approximations of
sequences of probability distributions and of their normalizing constants, via
particles obtained with a combination of importance weights and Markov
transitions. This article presents this class of methods and a number of recent
advances, with the goal of helping statisticians assess the applicability and
usefulness of these methods for their purposes. Our presentation emphasizes the
role of bridging distributions for computational and statistical purposes.
Numerical experiments are provided on simple settings such as multivariate
Normals, logistic regression and a basic susceptible-infected-recovered model,
illustrating the impact of the dimension, the ability to perform inference
sequentially and the estimation of normalizing constants.Comment: review article, 34 pages, 10 figure

Recommended from our members

### Ensemble generation and compression for speech recognition

For many tasks in machine learning, performance gains can often be obtained by combining together an ensemble of multiple systems. In Automatic Speech Recognition (ASR), a range of approaches can be used to combine an ensemble when performing recognition. However, many of these have computational costs that scale linearly with the ensemble size. One method to address this is teacher-student learning, which compresses the ensemble into a single student. The student is trained to emulate the combined ensemble, and only the student needs to be used when performing recognition. This thesis investigates both methods for ensemble generation and methods for ensemble compression.
The first contribution of this thesis is to explore approaches of generating multiple systems for an ensemble. The combined ensemble performance depends on both the accuracy of the individual members of the ensemble, as well as the diversity between their behaviours. The structured nature of speech allows for many ways that systems can be made different from each other. The experiments suggest that significant combination gains can be obtained by combining systems with different acoustic models, sets of state clusters, and sets of sub-word units. When performing recognition, these ensembles can be combined at the hypothesis and frame levels. However, these combination methods can be computationally expensive, as data is processed by multiple systems.
This thesis also considers approaches to compress an ensemble, and reduce the computational cost when performing recognition. Teacher-student learning is one such method. In standard teacher-student learning, information about the per-frame state cluster posteriors is propagated from the teacher ensemble to the student, to train the student to emulate the ensemble. However, this has two limitations. First, it requires that the teachers and student all use the same set of state clusters. This limits the allowed forms of diversities that the ensemble can have. Second, ASR is a sequence modelling task, and the frame-level posteriors that are propagated may not effectively convey all information about the sequence-level behaviours of the teachers. This thesis addresses both of these limitations.
The second contribution of this thesis is to address the first limitation, and allow for different sets of state clusters between systems. The proposed method maps the state cluster posteriors from the teachers' sets of state clusters to that of the student. The map is derived by considering a distance measure between posteriors of unclustered logical context-dependent states, instead of the usual state cluster. The experiments suggest that this proposed method can allow a student to effectively learn from an ensemble that has a diversity of state cluster sets. However, the experiments also suggest that the student may need to have a large set of state clusters to effectively emulate this ensemble. This thesis proposes to use a student with a multi-task topology, with an output layer for each of the different sets of state clusters. This can capture the phonetic resolution of having multiple sets of state clusters, while having fewer parameters than a student with a single large output layer.
The third contribution of this thesis is to address the second limitation of standard teacher-student learning, that only frame-level information is propagated to emulate the ensemble behaviour for the sequence modelling ASR task. This thesis proposes to generalise teacher-student learning to the sequence level, and propagate sequence posterior information. The proposed methods can also allow for many forms of ensemble diversities. The experiments suggest that by using these sequence-level methods, a student can learn to emulate the ensemble better. Recently, the lattice-free method has been proposed to train a system directly toward a sequence discriminative criterion. Ensembles of these systems can exhibit highly diverse behaviours, because the systems are not biased toward any cross-entropy forced alignments. It is difficult to apply standard frame-level teacher-student learning with these lattice-free systems, as they are often not designed to produce state cluster posteriors. Sequence-level teacher-student learning operates directly on the sequence posteriors, and can therefore be used directly with these lattice-free systems.
The proposals in this thesis are assessed on four ASR tasks. These are the augmented multi-party interaction meeting transcription, IARPA Babel Tok Pisin conversational telephone speech, English broadcast news, and multi-genre broadcast tasks. These datasets provide a variety of quantities of training data, recording environments, and speaking styles

- …