270 research outputs found
Optimal detection of changepoints with a linear computational cost
We consider the problem of detecting multiple changepoints in large data
sets. Our focus is on applications where the number of changepoints will
increase as we collect more data: for example in genetics as we analyse larger
regions of the genome, or in finance as we observe time-series over longer
periods. We consider the common approach of detecting changepoints through
minimising a cost function over possible numbers and locations of changepoints.
This includes several established procedures for detecting changing points,
such as penalised likelihood and minimum description length. We introduce a new
method for finding the minimum of such cost functions and hence the optimal
number and location of changepoints that has a computational cost which, under
mild conditions, is linear in the number of observations. This compares
favourably with existing methods for the same problem whose computational cost
can be quadratic or even cubic. In simulation studies we show that our new
method can be orders of magnitude faster than these alternative exact methods.
We also compare with the Binary Segmentation algorithm for identifying
changepoints, showing that the exactness of our approach can lead to
substantial improvements in the accuracy of the inferred segmentation of the
data.Comment: 25 pages, 4 figures, To appear in Journal of the American Statistical
Associatio
Bayesian Detection of Changepoints in Finite-State Markov Chains for Multiple Sequences
We consider the analysis of sets of categorical sequences consisting of
piecewise homogeneous Markov segments. The sequences are assumed to be governed
by a common underlying process with segments occurring in the same order for
each sequence. Segments are defined by a set of unobserved changepoints where
the positions and number of changepoints can vary from sequence to sequence. We
propose a Bayesian framework for analyzing such data, placing priors on the
locations of the changepoints and on the transition matrices and using Markov
chain Monte Carlo (MCMC) techniques to obtain posterior samples given the data.
Experimental results using simulated data illustrates how the methodology can
be used for inference of posterior distributions for parameters and
changepoints, as well as the ability to handle considerable variability in the
locations of the changepoints across different sequences. We also investigate
the application of the approach to sequential data from two applications
involving monsoonal rainfall patterns and branching patterns in trees
On Optimal Multiple Changepoint Algorithms for Large Data
There is an increasing need for algorithms that can accurately detect
changepoints in long time-series, or equivalent, data. Many common approaches
to detecting changepoints, for example based on penalised likelihood or minimum
description length, can be formulated in terms of minimising a cost over
segmentations. Dynamic programming methods exist to solve this minimisation
problem exactly, but these tend to scale at least quadratically in the length
of the time-series. Algorithms, such as Binary Segmentation, exist that have a
computational cost that is close to linear in the length of the time-series,
but these are not guaranteed to find the optimal segmentation. Recently pruning
ideas have been suggested that can speed up the dynamic programming algorithms,
whilst still being guaranteed to find true minimum of the cost function. Here
we extend these pruning methods, and introduce two new algorithms for
segmenting data, FPOP and SNIP. Empirical results show that FPOP is
substantially faster than existing dynamic programming methods, and unlike the
existing methods its computational efficiency is robust to the number of
changepoints in the data. We evaluate the method at detecting Copy Number
Variations and observe that FPOP has a computational cost that is competitive
with that of Binary Segmentation.Comment: 20 page
Doubly robust Bayesian inference for non-stationary streaming data with β-divergences
We present the very first robust Bayesian Online Changepoint Detection algorithm through General Bayesian Inference (GBI) with β-divergences. The resulting inference procedure is doubly robust for both the predictive and the changepoint (CP) posterior, with linear time and constant space complexity. We provide a construction for exponential models and demonstrate it on the Bayesian Linear Regression model. In so doing, we make two additional contributions: Firstly, we make GBI scalable using Structural Variational approximations that are exact as β→0 . Secondly, we give a principled way of choosing the divergence parameter β by minimizing expected predictive loss on-line. We offer the state of the art and improve the False Discovery Rate of CP S by more than 80% on real world data
Bayesian Change Point Analysis of Copy Number Variants Using Human Next Generation Sequencing Data
Title from PDF of title page, viewed on June 8, 2015Dissertation advisor: Jie ChenVitaIncludes bibliographic references (pages 127-134)Thesis (Ph.D.)--Department of Mathematics and Statistics and School of Biological Sciences. University of Missouri--Kansas City, 2014Read count analysis is the principal strategy implemented in detection of copy number variants using human next generation sequencing (NGS) data. Read count data
from NGS has been demonstrated to follow non homogeneous Poisson distributions.
The current change point analysis methods for detection of copy number variants are
based on normal distribution assumption and used ordinary normal approximation in
their algorithms. To improve sensitivity and reduce false positive rate for detection
of copy number variants, we developed three models: one Bayesian Anscombe normal approximation model for single genome, one Bayesian Poisson model for single
genome, and a Bayesian Anscome normal approximation model for paired genome.
The Bayesian statistics have been optimized for detection of change points and copy
numbers at single and multiple change points through Monte Carlo simulations. Three
R packages based on these models have been built up to simulate Poisson distribution data, estimate and display copy number variants in table and graphics. The high
sensitivity and specificity of these models have been demonstrated in simulated read
count data with known Poisson distribution and in human NGS read count data as
well in comparison to other popular packages.Background -- Single genome Bayesian approaches in NGS read count analysis -- Normal approximation Batesian change point model for paired genomes -- Conclusion and future wor
- …