146 research outputs found

    A pruned dynamic programming algorithm to recover the best segmentations with 11 to KmaxK_{max} change-points

    Get PDF
    A common computational problem in multiple change-point models is to recover the segmentations with 11 to KmaxK_{max} change-points of minimal cost with respect to some loss function. Here we present an algorithm to prune the set of candidate change-points which is based on a functional representation of the cost of segmentations. We study the worst case complexity of the algorithm when there is a unidimensional parameter per segment and demonstrate that it is at worst equivalent to the complexity of the segment neighbourhood algorithm: O(Kmaxn2)\mathcal{O}(K_{max} n^2). For a particular loss function we demonstrate that pruning is on average efficient even if there are no change-points in the signal. Finally, we empirically study the performance of the algorithm in the case of the quadratic loss and show that it is faster than the segment neighbourhood algorithm.Comment: 31 pages, An extended version of the pre-prin

    Performance evaluation of DNA copy number segmentation methods

    Full text link
    A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. We have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling real SNP microarray data from genomic regions with known copy-number state. The original real data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. In this paper, we describe this framework and illustrate some of the benefits of the proposed data generation approach on a practical use case: a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons for the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. Availability: R package jointSeg: http://r-forge.r-project.org/R/?group\_id=156

    On Optimal Multiple Changepoint Algorithms for Large Data

    Get PDF
    There is an increasing need for algorithms that can accurately detect changepoints in long time-series, or equivalent, data. Many common approaches to detecting changepoints, for example based on penalised likelihood or minimum description length, can be formulated in terms of minimising a cost over segmentations. Dynamic programming methods exist to solve this minimisation problem exactly, but these tend to scale at least quadratically in the length of the time-series. Algorithms, such as Binary Segmentation, exist that have a computational cost that is close to linear in the length of the time-series, but these are not guaranteed to find the optimal segmentation. Recently pruning ideas have been suggested that can speed up the dynamic programming algorithms, whilst still being guaranteed to find true minimum of the cost function. Here we extend these pruning methods, and introduce two new algorithms for segmenting data, FPOP and SNIP. Empirical results show that FPOP is substantially faster than existing dynamic programming methods, and unlike the existing methods its computational efficiency is robust to the number of changepoints in the data. We evaluate the method at detecting Copy Number Variations and observe that FPOP has a computational cost that is competitive with that of Binary Segmentation.Comment: 20 page

    SegAnnot: an R package for fast segmentation of annotated piecewise constant signals

    Get PDF
    We describe and propose an implementation of a dynamic programming algorithm for the segmentation of annotated piecewise constant signals. The algorithm is exact in the sense that it recovers the best possible segmentation w.r.t. the quadratic loss that agrees with the annotations

    New efficient algorithms for multiple change-point detection with kernels

    Get PDF
    Several statistical approaches based on reproducing kernels have been proposed to detect abrupt changes arising in the full distribution of the observations and not only in the mean or variance. Some of these approaches enjoy good statistical properties (oracle inequality, \ldots). Nonetheless, they have a high computational cost both in terms of time and memory. This makes their application difficult even for small and medium sample sizes (n<104n< 10^4). This computational issue is addressed by first describing a new efficient and exact algorithm for kernel multiple change-point detection with an improved worst-case complexity that is quadratic in time and linear in space. It allows dealing with medium size signals (up to n≈105n \approx 10^5). Second, a faster but approximation algorithm is described. It is based on a low-rank approximation to the Gram matrix. It is linear in time and space. This approximation algorithm can be applied to large-scale signals (n≥106n \geq 10^6). These exact and approximation algorithms have been implemented in \texttt{R} and \texttt{C} for various kernels. The computational and statistical performances of these new algorithms have been assessed through empirical experiments. The runtime of the new algorithms is observed to be faster than that of other considered procedures. Finally, simulations confirmed the higher statistical accuracy of kernel-based approaches to detect changes that are not only in the mean. These simulations also illustrate the flexibility of kernel-based approaches to analyze complex biological profiles made of DNA copy number and allele B frequencies. An R package implementing the approach will be made available on github

    Changepoint Detection in the Presence of Outliers

    Get PDF
    Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data often needs to be pre-processed to remove outliers, though this is difficult for applications where the data needs to be analysed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalised cost approaches for detecting changes so that they use loss functions that are less sensitive to outliers. We argue that loss functions that are bounded, such as the classical biweight loss, are particularly suitable -- as we show that only bounded loss functions are robust to arbitrarily extreme outliers. We present an efficient dynamic programming algorithm that can find the optimal segmentation under our penalised cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analysed online. We show that we can consistently estimate the number of changepoints, and accurately estimate their locations, using the biweight loss function. We demonstrate the usefulness of our approach for applications such as analysing well-log data, detecting copy number variation, and detecting tampering of wireless devices
    • …
    corecore