95 research outputs found
A statistical approach for array CGH data analysis
BACKGROUND: Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile. RESULTS: We demonstrate that existing methods for estimating the number of segments are not well adapted in the case of array CGH data, and we propose an adaptive criterion that detects previously mapped chromosomal aberrations. The performances of this method are discussed based on simulations and publicly available data sets. Then we discuss the choice of modeling for array CGH data and show that the model with a homogeneous variance is adapted to this context. CONCLUSIONS: Array CGH data analysis is an emerging field that needs appropriate statistical tools. Process segmentation and model selection provide a theoretical framework that allows precise biological interpretations. Adaptive methods for model selection give promising results concerning the estimation of the number of altered regions on the genome
Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome
Tiling arrays make possible a large scale exploration of the genome thanks to
probes which cover the whole genome with very high density until 2 000 000
probes. Biological questions usually addressed are either the expression
difference between two conditions or the detection of transcribed regions. In
this work we propose to consider simultaneously both questions as an
unsupervised classification problem by modeling the joint distribution of the
two conditions. In contrast to previous methods, we account for all available
information on the probes as well as biological knowledge like annotation and
spatial dependence between probes. Since probes are not biologically relevant
units we propose a classification rule for non-connected regions covered by
several probes. Applications to transcriptomic and ChIP-chip data of
Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the
importance of a precise modeling and the region classification
Joint segmentation of many aCGH profiles using fast group LARS
Array-Based Comparative Genomic Hybridization (aCGH) is a method used to
search for genomic regions with copy numbers variations. For a given aCGH
profile, one challenge is to accurately segment it into regions of constant
copy number. Subjects sharing the same disease status, for example a type of
cancer, often have aCGH profiles with similar copy number variations, due to
duplications and deletions relevant to that particular disease. We introduce a
constrained optimization algorithm that jointly segments aCGH profiles of many
subjects. It simultaneously penalizes the amount of freedom the set of profiles
have to jump from one level of constant copy number to another, at genomic
locations known as breakpoints. We show that breakpoints shared by many
different profiles tend to be found first by the algorithm, even in the
presence of significant amounts of noise. The algorithm can be formulated as a
group LARS problem. We propose an extremely fast way to find the solution path,
i.e., a sequence of shared breakpoints in order of importance. For no extra
cost the algorithm smoothes all of the aCGH profiles into piecewise-constant
regions of equal copy number, giving low-dimensional versions of the original
data. These can be shown for all profiles on a single graph, allowing for
intuitive visual interpretation. Simulations and an implementation of the
algorithm on bladder cancer aCGH profiles are provided
Optimal detection of changepoints with a linear computational cost
We consider the problem of detecting multiple changepoints in large data
sets. Our focus is on applications where the number of changepoints will
increase as we collect more data: for example in genetics as we analyse larger
regions of the genome, or in finance as we observe time-series over longer
periods. We consider the common approach of detecting changepoints through
minimising a cost function over possible numbers and locations of changepoints.
This includes several established procedures for detecting changing points,
such as penalised likelihood and minimum description length. We introduce a new
method for finding the minimum of such cost functions and hence the optimal
number and location of changepoints that has a computational cost which, under
mild conditions, is linear in the number of observations. This compares
favourably with existing methods for the same problem whose computational cost
can be quadratic or even cubic. In simulation studies we show that our new
method can be orders of magnitude faster than these alternative exact methods.
We also compare with the Binary Segmentation algorithm for identifying
changepoints, showing that the exactness of our approach can lead to
substantial improvements in the accuracy of the inferred segmentation of the
data.Comment: 25 pages, 4 figures, To appear in Journal of the American Statistical
Associatio
On-the-fly Approximation of Multivariate Total Variation Minimization
In the context of change-point detection, addressed by Total Variation
minimization strategies, an efficient on-the-fly algorithm has been designed
leading to exact solutions for univariate data. In this contribution, an
extension of such an on-the-fly strategy to multivariate data is investigated.
The proposed algorithm relies on the local validation of the Karush-Kuhn-Tucker
conditions on the dual problem. Showing that the non-local nature of the
multivariate setting precludes to obtain an exact on-the-fly solution, we
devise an on-the-fly algorithm delivering an approximate solution, whose
quality is controlled by a practitioner-tunable parameter, acting as a
trade-off between quality and computational cost. Performance assessment shows
that high quality solutions are obtained on-the-fly while benefiting of
computational costs several orders of magnitude lower than standard iterative
procedures. The proposed algorithm thus provides practitioners with an
efficient multivariate change-point detection on-the-fly procedure
RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions
Summary: Several methods have been proposed to detect copy number changes and recurrent regions of copy number variation from aCGH, but few methods return probabilities of alteration explicitly, which are the direct answer to the question âis this probe/region altered?â RJaCGH fits a Non-Homogeneous Hidden Markov model to the aCGH data using Markov Chain Monte Carlo with Reversible Jump, and returns the probability that each probe is gained or lost. Using these probabilites, recurrent regions (over sets of individuals) of copy number alteration can be found
- âŠ