121 research outputs found
Efficient Transition Probability Computation for Continuous-Time Branching Processes via Compressed Sensing
Branching processes are a class of continuous-time Markov chains (CTMCs) with
ubiquitous applications. A general difficulty in statistical inference under
partially observed CTMC models arises in computing transition probabilities
when the discrete state space is large or uncountable. Classical methods such
as matrix exponentiation are infeasible for large or countably infinite state
spaces, and sampling-based alternatives are computationally intensive,
requiring a large integration step to impute over all possible hidden events.
Recent work has successfully applied generating function techniques to
computing transition probabilities for linear multitype branching processes.
While these techniques often require significantly fewer computations than
matrix exponentiation, they also become prohibitive in applications with large
populations. We propose a compressed sensing framework that significantly
accelerates the generating function method, decreasing computational cost up to
a logarithmic factor by only assuming the probability mass of transitions is
sparse. We demonstrate accurate and efficient transition probability
computations in branching process models for hematopoiesis and transposable
element evolution.Comment: 18 pages, 4 figures, 2 table
Phylogenetic Stochastic Mapping without Matrix Exponentiation
Phylogenetic stochastic mapping is a method for reconstructing the history of
trait changes on a phylogenetic tree relating species/organisms carrying the
trait. State-of-the-art methods assume that the trait evolves according to a
continuous-time Markov chain (CTMC) and work well for small state spaces. The
computations slow down considerably for larger state spaces (e.g. space of
codons), because current methodology relies on exponentiating CTMC
infinitesimal rate matrices -- an operation whose computational complexity
grows as the size of the CTMC state space cubed. In this work, we introduce a
new approach, based on a CTMC technique called uniformization, that does not
use matrix exponentiation for phylogenetic stochastic mapping. Our method is
based on a new Markov chain Monte Carlo (MCMC) algorithm that targets the
distribution of trait histories conditional on the trait data observed at the
tips of the tree. The computational complexity of our MCMC method grows as the
size of the CTMC state space squared. Moreover, in contrast to competing matrix
exponentiation methods, if the rate matrix is sparse, we can leverage this
sparsity and increase the computational efficiency of our algorithm further.
Using simulated data, we illustrate advantages of our MCMC algorithm and
investigate how large the state space needs to be for our method to outperform
matrix exponentiation approaches. We show that even on the moderately large
state space of codons our MCMC method can be significantly faster than
currently used matrix exponentiation methods.Comment: 33 pages, including appendice
Locally adaptive smoothing with Markov random fields and shrinkage priors
We present a locally adaptive nonparametric curve fitting method that
operates within a fully Bayesian framework. This method uses shrinkage priors
to induce sparsity in order-k differences in the latent trend function,
providing a combination of local adaptation and global control. Using a scale
mixture of normals representation of shrinkage priors, we make explicit
connections between our method and kth order Gaussian Markov random field
smoothing. We call the resulting processes shrinkage prior Markov random fields
(SPMRFs). We use Hamiltonian Monte Carlo to approximate the posterior
distribution of model parameters because this method provides superior
performance in the presence of the high dimensionality and strong parameter
correlations exhibited by our models. We compare the performance of three prior
formulations using simulated data and find the horseshoe prior provides the
best compromise between bias and precision. We apply SPMRF models to two
benchmark data examples frequently used to test nonparametric methods. We find
that this method is flexible enough to accommodate a variety of data generating
models and offers the adaptive properties and computational tractability to
make it a useful addition to the Bayesian nonparametric toolbox.Comment: 38 pages, to appear in Bayesian Analysi
A linear noise approximation for stochastic epidemic models fit to partially observed incidence counts
Stochastic epidemic models (SEMs) fit to incidence data are critical to
elucidating outbreak dynamics, shaping response strategies, and preparing for
future epidemics. SEMs typically represent counts of individuals in discrete
infection states using Markov jump processes (MJPs), but are computationally
challenging as imperfect surveillance, lack of subject-level information, and
temporal coarseness of the data obscure the true epidemic. Analytic integration
over the latent epidemic process is impossible, and integration via Markov
chain Monte Carlo (MCMC) is cumbersome due to the dimensionality and
discreteness of the latent state space. Simulation-based computational
approaches can address the intractability of the MJP likelihood, but are
numerically fragile and prohibitively expensive for complex models. A linear
noise approximation (LNA) that approximates the MJP transition density with a
Gaussian density has been explored for analyzing prevalence data in
large-population settings, but requires modification for analyzing incidence
counts without assuming that the data are normally distributed. We demonstrate
how to reparameterize SEMs to appropriately analyze incidence data, and fold
the LNA into a data augmentation MCMC framework that outperforms deterministic
methods, statistically, and simulation-based methods, computationally. Our
framework is computationally robust when the model dynamics are complex and
applies to a broad class of SEMs. We evaluate our method in simulations that
reflect Ebola, influenza, and SARS-CoV-2 dynamics, and apply our method to
national surveillance counts from the 2013--2015 West Africa Ebola outbreak
Imputation Estimators Partially Correct for Model Misspecification
Inference problems with incomplete observations often aim at estimating
population properties of unobserved quantities. One simple way to accomplish
this estimation is to impute the unobserved quantities of interest at the
individual level and then take an empirical average of the imputed values. We
show that this simple imputation estimator can provide partial protection
against model misspecification. We illustrate imputation estimators' robustness
to model specification on three examples: mixture model-based clustering,
estimation of genotype frequencies in population genetics, and estimation of
Markovian evolutionary distances. In the final example, using a representative
model misspecification, we demonstrate that in non-degenerate cases, the
imputation estimator dominates the plug-in estimate asymptotically. We conclude
by outlining a Bayesian implementation of the imputation-based estimation.Comment: major rewrite, beta-binomial example removed, model based clustering
is added to the mixture model example, Bayesian approach is now illustrated
with the genetics exampl
Efficient data augmentation for fitting stochastic epidemic models to prevalence data
Stochastic epidemic models describe the dynamics of an epidemic as a disease
spreads through a population. Typically, only a fraction of cases are observed
at a set of discrete times. The absence of complete information about the time
evolution of an epidemic gives rise to a complicated latent variable problem in
which the state space size of the epidemic grows large as the population size
increases. This makes analytically integrating over the missing data infeasible
for populations of even moderate size. We present a data augmentation Markov
chain Monte Carlo (MCMC) framework for Bayesian estimation of stochastic
epidemic model parameters, in which measurements are augmented with
subject-level disease histories. In our MCMC algorithm, we propose each new
subject-level path, conditional on the data, using a time-inhomogeneous
continuous-time Markov process with rates determined by the infection histories
of other individuals. The method is general, and may be applied, with minimal
modifications, to a broad class of stochastic epidemic models. We present our
algorithm in the context of multiple stochastic epidemic models in which the
data are binomially sampled prevalence counts, and apply our method to data
from an outbreak of influenza in a British boarding school
Fitting stochastic epidemic models to gene genealogies using linear noise approximation
Phylodynamics is a set of population genetics tools that aim at
reconstructing demographic history of a population based on molecular sequences
of individuals sampled from the population of interest. One important task in
phylodynamics is to estimate changes in (effective) population size. When
applied to infectious disease sequences such estimation of population size
trajectories can provide information about changes in the number of infections.
To model changes in the number of infected individuals, current phylodynamic
methods use non-parametric approaches, parametric approaches, and stochastic
modeling in conjunction with likelihood-free Bayesian methods. The first class
of methods yields results that are hard-to-interpret epidemiologically. The
second class of methods provides estimates of important epidemiological
parameters, such as infection and removal/recovery rates, but ignores variation
in the dynamics of infectious disease spread. The third class of methods is the
most advantageous statistically, but relies on computationally intensive
particle filtering techniques that limits its applications. We propose a
Bayesian model that combines phylodynamic inference and stochastic epidemic
models, and achieves computational tractability by using a linear noise
approximation (LNA) --- a technique that allows us to approximate probability
densities of stochastic epidemic model trajectories. LNA opens the door for
using modern Markov chain Monte Carlo tools to approximate the joint posterior
distribution of the disease transmission parameters and of high dimensional
vectors describing unobserved changes in the stochastic epidemic model
compartment sizes (e.g., numbers of infectious and susceptible individuals). We
apply our estimation technique to Ebola genealogies estimated using viral
genetic data from the 2014 epidemic in Sierra Leone and Liberia.Comment: 43 pages, 6 figures in the main tex
rbrothers: R Package for Bayesian Multiple Change-Point Recombination Detection.
Phylogenetic recombination detection is a fundamental task in bioinformatics and evolutionary biology. Most of the computational tools developed to attack this important problem are not integrated into the growing suite of R packages for statistical analysis of molecular sequences. Here, we present an R package, rbrothers, that makes a Bayesian multiple change-point model, one of the most sophisticated model-based phylogenetic recombination tools, available to R users. Moreover, we equip the Bayesian change-point model with a set of pre- and post- processing routines that will broaden the application domain of this recombination detection framework. Specifically, we implement an algorithm that forms the set of input trees required by multiple change-point models. We also provide functionality for checking Markov chain Monte Carlo convergence and creating estimation result summaries and graphics. Using rbrothers, we perform a comparative analysis of two Salmonella enterica genes, fimA and fimH, that encode major and adhesive subunits of the type 1 fimbriae, respectively. We believe that rbrothers, available at R-Forge: http://evolmod.r-forge.r-project.org/, will allow researchers to incorporate recombination detection into phylogenetic workflows already implemented in R
- …