119 research outputs found
Performance evaluation of DNA copy number segmentation methods
A number of bioinformatic or biostatistical methods are available for
analyzing DNA copy number profiles measured from microarray or sequencing
technologies. In the absence of rich enough gold standard data sets, the
performance of these methods is generally assessed using unrealistic simulation
studies, or based on small real data analyses. We have designed and implemented
a framework to generate realistic DNA copy number profiles of cancer samples
with known truth. These profiles are generated by resampling real SNP
microarray data from genomic regions with known copy-number state. The original
real data have been extracted from dilutions series of tumor cell lines with
matched blood samples at several concentrations. Therefore, the signal-to-noise
ratio of the generated profiles can be controlled through the (known)
percentage of tumor cells in the sample. In this paper, we describe this
framework and illustrate some of the benefits of the proposed data generation
approach on a practical use case: a comparison study between methods for
segmenting DNA copy number profiles from SNP microarrays. This study indicates
that no single method is uniformly better than all others. It also helps
identifying pros and cons for the compared methods as a function of
biologically informative parameters, such as the fraction of tumor cells in the
sample and the proportion of heterozygous markers. Availability: R package
jointSeg: http://r-forge.r-project.org/R/?group\_id=156
SegAnnot: an R package for fast segmentation of annotated piecewise constant signals
We describe and propose an implementation of a dynamic programming algorithm for the segmentation of annotated piecewise constant signals. The algorithm is exact in the sense that it recovers the best possible segmentation w.r.t. the quadratic loss that agrees with the annotations
New efficient algorithms for multiple change-point detection with kernels
Several statistical approaches based on reproducing kernels have been
proposed to detect abrupt changes arising in the full distribution of the
observations and not only in the mean or variance. Some of these approaches
enjoy good statistical properties (oracle inequality, \ldots). Nonetheless,
they have a high computational cost both in terms of time and memory. This
makes their application difficult even for small and medium sample sizes (). This computational issue is addressed by first describing a new
efficient and exact algorithm for kernel multiple change-point detection with
an improved worst-case complexity that is quadratic in time and linear in
space. It allows dealing with medium size signals (up to ).
Second, a faster but approximation algorithm is described. It is based on a
low-rank approximation to the Gram matrix. It is linear in time and space. This
approximation algorithm can be applied to large-scale signals ().
These exact and approximation algorithms have been implemented in \texttt{R}
and \texttt{C} for various kernels. The computational and statistical
performances of these new algorithms have been assessed through empirical
experiments. The runtime of the new algorithms is observed to be faster than
that of other considered procedures. Finally, simulations confirmed the higher
statistical accuracy of kernel-based approaches to detect changes that are not
only in the mean. These simulations also illustrate the flexibility of
kernel-based approaches to analyze complex biological profiles made of DNA copy
number and allele B frequencies. An R package implementing the approach will be
made available on github
Changepoint Detection in the Presence of Outliers
Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data often needs to be pre-processed to remove outliers, though this is difficult for applications where the data needs to be analysed online. We present an approach to changepoint detection that is robust to the presence of outliers. The idea is to adapt existing penalised cost approaches for detecting changes so that they use loss functions that are less sensitive to outliers. We argue that loss functions that are bounded, such as the classical biweight loss, are particularly suitable -- as we show that only bounded loss functions are robust to arbitrarily extreme outliers. We present an efficient dynamic programming algorithm that can find the optimal segmentation under our penalised cost criteria. Importantly, this algorithm can be used in settings where the data needs to be analysed online. We show that we can consistently estimate the number of changepoints, and accurately estimate their locations, using the biweight loss function. We demonstrate the usefulness of our approach for applications such as analysing well-log data, detecting copy number variation, and detecting tampering of wireless devices
Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data
Peak detection in genomic data involves segmenting counts of DNA sequence reads aligned to different locations of a chromosome. The goal is to detect peaks with higher counts, and filter out background noise with lower counts. Most existing algorithms for this problem are unsupervised heuristics tailored to patterns in specific data types. We propose a supervised framework for this problem, using optimal changepoint detection models with learned penalty functions. We propose the first dynamic programming algorithm that is guaranteed to compute the optimal solution to changepoint detection problems with constraints between adjacent segment mean parameters. Implementing this algorithm requires the choice of penalty parameter that determines the number of segments that are estimated. We show how the supervised learning ideas of Rigaill et al. (2013) can be used to choose this penalty. We compare the resulting implementation of our algorithm to several baselines in a benchmark of labeled ChIP-seq data sets with two dierent patterns (broad H3K36me3 data and sharp H3K4me3 data). Whereas baseline unsupervised methods only provide accurate peak detection for a single pattern, our supervised method achieves state-of-the-art accuracy in all data sets. The log-linear timings of our proposed dynamic programming algorithm make it scalable to the large genomic data sets that are now common. Our implementation is available in the PeakSegOptimal R package on CRAN
Online Multivariate Changepoint Detection: Leveraging Links With Computational Geometry
The increasing volume of data streams poses significant computational
challenges for detecting changepoints online. Likelihood-based methods are
effective, but their straightforward implementation becomes impractical online.
We develop two online algorithms that exactly calculate the likelihood ratio
test for a single changepoint in p-dimensional data streams by leveraging
fascinating connections with computational geometry. Our first algorithm is
straightforward and empirically quasi-linear. The second is more complex but
provably quasi-linear: for data points.
Through simulations, we illustrate, that they are fast and allow us to process
millions of points within a matter of minutes up to .Comment: 31 pages,15 figure
Fast Online Changepoint Detection via Functional Pruning CUSUM statistics
Many modern applications of online changepoint detection require the ability to process high-frequency observations, sometimes with limited available computational resources. Online algorithms for detecting a change in mean often involve using a moving window, or specifying the expected size of change. Such choices affect which changes the algorithms have most power to detect. We introduce an algorithm, Functional Online CuSUM (FOCuS), which is equivalent to running these earlier methods simultaneously for all sizes of window, or all possible values for the size of change. Our theoretical results give tight bounds on the expected computational cost per iteration of FOCuS, with this being logarithmic in the number of observations. We show how FOCuS can be applied to a number of different change in mean scenarios, and demonstrate its practical utility through its state-of-the art performance at detecting anomalous behaviour in computer server data
gfpop: an R Package for Univariate Graph-Constrained Change-point Detection
In a world with data that change rapidly and abruptly, it is important to
detect those changes accurately. In this paper we describe an R package
implementing an algorithm recently proposed by Hocking et al. [2017] for
penalised maximum likelihood inference of constrained multiple change-point
models. This algorithm can be used to pinpoint the precise locations of abrupt
changes in large data sequences. There are many application domains for such
models, such as medicine, neuroscience or genomics. Often, practitioners have
prior knowledge about the changes they are looking for. For example in genomic
data, biologists sometimes expect peaks: up changes followed by down changes.
Taking advantage of such prior information can substantially improve the
accuracy with which we can detect and estimate changes. Hocking et al. [2017]
described a graph framework to encode many examples of such prior information
and a generic algorithm to infer the optimal model parameters, but implemented
the algorithm for just a single scenario. We present the gfpop package that
implements the algorithm in a generic manner in R/C++. gfpop works for a
user-defined graph that can encode the prior nformation of the types of change
and implements several loss functions (Gauss, Poisson, Binomial, Biweight and
Huber). We then illustrate the use of gfpop on isotonic simulations and several
applications in biology. For a number of graphs the algorithm runs in a matter
of seconds or minutes for 10^5 datapoints
- …