146 research outputs found
A pruned dynamic programming algorithm to recover the best segmentations with to change-points
A common computational problem in multiple change-point models is to recover
the segmentations with to change-points of minimal cost with
respect to some loss function. Here we present an algorithm to prune the set of
candidate change-points which is based on a functional representation of the
cost of segmentations. We study the worst case complexity of the algorithm when
there is a unidimensional parameter per segment and demonstrate that it is at
worst equivalent to the complexity of the segment neighbourhood algorithm:
. For a particular loss function we demonstrate that
pruning is on average efficient even if there are no change-points in the
signal. Finally, we empirically study the performance of the algorithm in the
case of the quadratic loss and show that it is faster than the segment
neighbourhood algorithm.Comment: 31 pages, An extended version of the pre-prin
Performance evaluation of DNA copy number segmentation methods
A number of bioinformatic or biostatistical methods are available for
analyzing DNA copy number profiles measured from microarray or sequencing
technologies. In the absence of rich enough gold standard data sets, the
performance of these methods is generally assessed using unrealistic simulation
studies, or based on small real data analyses. We have designed and implemented
a framework to generate realistic DNA copy number profiles of cancer samples
with known truth. These profiles are generated by resampling real SNP
microarray data from genomic regions with known copy-number state. The original
real data have been extracted from dilutions series of tumor cell lines with
matched blood samples at several concentrations. Therefore, the signal-to-noise
ratio of the generated profiles can be controlled through the (known)
percentage of tumor cells in the sample. In this paper, we describe this
framework and illustrate some of the benefits of the proposed data generation
approach on a practical use case: a comparison study between methods for
segmenting DNA copy number profiles from SNP microarrays. This study indicates
that no single method is uniformly better than all others. It also helps
identifying pros and cons for the compared methods as a function of
biologically informative parameters, such as the fraction of tumor cells in the
sample and the proportion of heterozygous markers. Availability: R package
jointSeg: http://r-forge.r-project.org/R/?group\_id=156
On Optimal Multiple Changepoint Algorithms for Large Data
There is an increasing need for algorithms that can accurately detect
changepoints in long time-series, or equivalent, data. Many common approaches
to detecting changepoints, for example based on penalised likelihood or minimum
description length, can be formulated in terms of minimising a cost over
segmentations. Dynamic programming methods exist to solve this minimisation
problem exactly, but these tend to scale at least quadratically in the length
of the time-series. Algorithms, such as Binary Segmentation, exist that have a
computational cost that is close to linear in the length of the time-series,
but these are not guaranteed to find the optimal segmentation. Recently pruning
ideas have been suggested that can speed up the dynamic programming algorithms,
whilst still being guaranteed to find true minimum of the cost function. Here
we extend these pruning methods, and introduce two new algorithms for
segmenting data, FPOP and SNIP. Empirical results show that FPOP is
substantially faster than existing dynamic programming methods, and unlike the
existing methods its computational efficiency is robust to the number of
changepoints in the data. We evaluate the method at detecting Copy Number
Variations and observe that FPOP has a computational cost that is competitive
with that of Binary Segmentation.Comment: 20 page
SegAnnot: an R package for fast segmentation of annotated piecewise constant signals
We describe and propose an implementation of a dynamic programming algorithm for the segmentation of annotated piecewise constant signals. The algorithm is exact in the sense that it recovers the best possible segmentation w.r.t. the quadratic loss that agrees with the annotations
New efficient algorithms for multiple change-point detection with kernels
Several statistical approaches based on reproducing kernels have been
proposed to detect abrupt changes arising in the full distribution of the
observations and not only in the mean or variance. Some of these approaches
enjoy good statistical properties (oracle inequality, \ldots). Nonetheless,
they have a high computational cost both in terms of time and memory. This
makes their application difficult even for small and medium sample sizes (). This computational issue is addressed by first describing a new
efficient and exact algorithm for kernel multiple change-point detection with
an improved worst-case complexity that is quadratic in time and linear in
space. It allows dealing with medium size signals (up to ).
Second, a faster but approximation algorithm is described. It is based on a
low-rank approximation to the Gram matrix. It is linear in time and space. This
approximation algorithm can be applied to large-scale signals ().
These exact and approximation algorithms have been implemented in \texttt{R}
and \texttt{C} for various kernels. The computational and statistical
performances of these new algorithms have been assessed through empirical
experiments. The runtime of the new algorithms is observed to be faster than
that of other considered procedures. Finally, simulations confirmed the higher
statistical accuracy of kernel-based approaches to detect changes that are not
only in the mean. These simulations also illustrate the flexibility of
kernel-based approaches to analyze complex biological profiles made of DNA copy
number and allele B frequencies. An R package implementing the approach will be
made available on github
Changepoint Detection in the Presence of Outliers
Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data often needs to be pre-processed to remove outliers, though this is difficult for applications where the data needs to be analysed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalised cost approaches for detecting changes so that they use loss functions that are less sensitive to outliers. We argue that loss functions that are bounded, such as the classical biweight loss, are particularly suitable -- as we show that only bounded loss functions are robust to arbitrarily extreme outliers. We present an efficient dynamic programming algorithm that can find the optimal segmentation under our penalised cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analysed online. We show that we can consistently estimate the number of changepoints, and accurately estimate their locations, using the biweight loss function. We demonstrate the usefulness of our approach for applications such as analysing well-log data, detecting copy number variation, and detecting tampering of wireless devices
- …