270 research outputs found

    Optimal detection of changepoints with a linear computational cost

    Full text link
    We consider the problem of detecting multiple changepoints in large data sets. Our focus is on applications where the number of changepoints will increase as we collect more data: for example in genetics as we analyse larger regions of the genome, or in finance as we observe time-series over longer periods. We consider the common approach of detecting changepoints through minimising a cost function over possible numbers and locations of changepoints. This includes several established procedures for detecting changing points, such as penalised likelihood and minimum description length. We introduce a new method for finding the minimum of such cost functions and hence the optimal number and location of changepoints that has a computational cost which, under mild conditions, is linear in the number of observations. This compares favourably with existing methods for the same problem whose computational cost can be quadratic or even cubic. In simulation studies we show that our new method can be orders of magnitude faster than these alternative exact methods. We also compare with the Binary Segmentation algorithm for identifying changepoints, showing that the exactness of our approach can lead to substantial improvements in the accuracy of the inferred segmentation of the data.Comment: 25 pages, 4 figures, To appear in Journal of the American Statistical Associatio

    Bayesian Detection of Changepoints in Finite-State Markov Chains for Multiple Sequences

    Full text link
    We consider the analysis of sets of categorical sequences consisting of piecewise homogeneous Markov segments. The sequences are assumed to be governed by a common underlying process with segments occurring in the same order for each sequence. Segments are defined by a set of unobserved changepoints where the positions and number of changepoints can vary from sequence to sequence. We propose a Bayesian framework for analyzing such data, placing priors on the locations of the changepoints and on the transition matrices and using Markov chain Monte Carlo (MCMC) techniques to obtain posterior samples given the data. Experimental results using simulated data illustrates how the methodology can be used for inference of posterior distributions for parameters and changepoints, as well as the ability to handle considerable variability in the locations of the changepoints across different sequences. We also investigate the application of the approach to sequential data from two applications involving monsoonal rainfall patterns and branching patterns in trees

    On Optimal Multiple Changepoint Algorithms for Large Data

    Get PDF
    There is an increasing need for algorithms that can accurately detect changepoints in long time-series, or equivalent, data. Many common approaches to detecting changepoints, for example based on penalised likelihood or minimum description length, can be formulated in terms of minimising a cost over segmentations. Dynamic programming methods exist to solve this minimisation problem exactly, but these tend to scale at least quadratically in the length of the time-series. Algorithms, such as Binary Segmentation, exist that have a computational cost that is close to linear in the length of the time-series, but these are not guaranteed to find the optimal segmentation. Recently pruning ideas have been suggested that can speed up the dynamic programming algorithms, whilst still being guaranteed to find true minimum of the cost function. Here we extend these pruning methods, and introduce two new algorithms for segmenting data, FPOP and SNIP. Empirical results show that FPOP is substantially faster than existing dynamic programming methods, and unlike the existing methods its computational efficiency is robust to the number of changepoints in the data. We evaluate the method at detecting Copy Number Variations and observe that FPOP has a computational cost that is competitive with that of Binary Segmentation.Comment: 20 page

    Doubly robust Bayesian inference for non-stationary streaming data with β-divergences

    Get PDF
    We present the very first robust Bayesian Online Changepoint Detection algorithm through General Bayesian Inference (GBI) with β-divergences. The resulting inference procedure is doubly robust for both the predictive and the changepoint (CP) posterior, with linear time and constant space complexity. We provide a construction for exponential models and demonstrate it on the Bayesian Linear Regression model. In so doing, we make two additional contributions: Firstly, we make GBI scalable using Structural Variational approximations that are exact as β→0 . Secondly, we give a principled way of choosing the divergence parameter β by minimizing expected predictive loss on-line. We offer the state of the art and improve the False Discovery Rate of CP S by more than 80% on real world data

    Bayesian Change Point Analysis of Copy Number Variants Using Human Next Generation Sequencing Data

    Get PDF
    Title from PDF of title page, viewed on June 8, 2015Dissertation advisor: Jie ChenVitaIncludes bibliographic references (pages 127-134)Thesis (Ph.D.)--Department of Mathematics and Statistics and School of Biological Sciences. University of Missouri--Kansas City, 2014Read count analysis is the principal strategy implemented in detection of copy number variants using human next generation sequencing (NGS) data. Read count data from NGS has been demonstrated to follow non homogeneous Poisson distributions. The current change point analysis methods for detection of copy number variants are based on normal distribution assumption and used ordinary normal approximation in their algorithms. To improve sensitivity and reduce false positive rate for detection of copy number variants, we developed three models: one Bayesian Anscombe normal approximation model for single genome, one Bayesian Poisson model for single genome, and a Bayesian Anscome normal approximation model for paired genome. The Bayesian statistics have been optimized for detection of change points and copy numbers at single and multiple change points through Monte Carlo simulations. Three R packages based on these models have been built up to simulate Poisson distribution data, estimate and display copy number variants in table and graphics. The high sensitivity and specificity of these models have been demonstrated in simulated read count data with known Poisson distribution and in human NGS read count data as well in comparison to other popular packages.Background -- Single genome Bayesian approaches in NGS read count analysis -- Normal approximation Batesian change point model for paired genomes -- Conclusion and future wor
    • …
    corecore