14,062 research outputs found
A statistical approach for array CGH data analysis
BACKGROUND: Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile. RESULTS: We demonstrate that existing methods for estimating the number of segments are not well adapted in the case of array CGH data, and we propose an adaptive criterion that detects previously mapped chromosomal aberrations. The performances of this method are discussed based on simulations and publicly available data sets. Then we discuss the choice of modeling for array CGH data and show that the model with a homogeneous variance is adapted to this context. CONCLUSIONS: Array CGH data analysis is an emerging field that needs appropriate statistical tools. Process segmentation and model selection provide a theoretical framework that allows precise biological interpretations. Adaptive methods for model selection give promising results concerning the estimation of the number of altered regions on the genome
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Detection of Polyps via Shape and Appearance Modeling
Presented at the MICCAI 2008 Workshop on Computational and Visualization Challenges in the New Era of Virtual Colonoscopy, September 6, 2008, New York, USA.This paper describes a CAD system for the detection of colorectal polyps in CT. It is based on stochastic shape and appearance modeling of structures of the colon and rectum, in contrast to the data-driven approaches more commonly found in the literature it derives predictive stochastic models for the features used for classification. The method makes extensive use of medical domain knowledge in the design of the models and in the setting of their parameters. The proposed approach was successfully tested on challenging datasets acquired under a protocol with little colonic preparation; such protocol reduces patient discomfort and potentially improves compliance
Use of pre-transformation to cope with outlying values in important candidate genes
Outlying values in predictors often strongly affect the results of statistical analyses in high-dimensional settings. Although they frequently occur with most high-throughput techniques, the problem is often ignored in the literature. We suggest to use a very simple transformation, proposed before in a different context by Royston and Sauerbrei, as an intermediary step between array normalization and high-level statistical analysis. This straightforward univariate transformation identifies extreme values and reduces the influence of outlying values considerably in all further steps of statistical analysis without eliminating the incriminated observation or feature. The use of the transformation and its effects are demonstrated for diverse univariate and multivariate statistical analyses using nine publicly available microarray data sets
Data Improving in Time Series Using ARX and ANN Models
Anomalous data can negatively impact energy forecasting by causing model parameters to be incorrectly estimated. This paper presents two approaches for the detection and imputation of anomalies in time series data. Autoregressive with exogenous inputs (ARX) and artificial neural network (ANN) models are used to extract the characteristics of time series. Anomalies are detected by performing hypothesis testing on the extrema of the residuals, and the anomalous data points are imputed using the ARX and ANN models. Because the anomalies affect the model coefficients, the data cleaning process is performed iteratively. The models are re-learned on “cleaner” data after an anomaly is imputed. The anomalous data are reimputed to each iteration using the updated ARX and ANN models. The ARX and ANN data cleaning models are evaluated on natural gas time series data. This paper demonstrates that the proposed approaches are able to identify and impute anomalous data points. Forecasting models learned on the unclean data and the cleaned data are tested on an uncleaned out-of-sample dataset. The forecasting model learned on the cleaned data outperforms the model learned on the unclean data with 1.67% improvement in the mean absolute percentage errors and a 32.8% improvement in the root mean squared error. Existing challenges include correctly identifying specific types of anomalies such as negative flows
Approximating Cross-validatory Predictive P-values with Integrated IS for Disease Mapping Models
An important statistical task in disease mapping problems is to identify out-
lier/divergent regions with unusually high or low residual risk of disease.
Leave-one-out cross-validatory (LOOCV) model assessment is a gold standard for
computing predictive p-value that can flag such outliers. However, actual LOOCV
is time-consuming because one needs to re-simulate a Markov chain for each
posterior distribution in which an observation is held out as a test case. This
paper introduces a new method, called iIS, for approximating LOOCV with only
Markov chain samples simulated from a posterior based on a full data set. iIS
is based on importance sampling (IS). iIS integrates the p-value and the
likelihood of the test observation with respect to the distribution of the
latent variable without reference to the actual observation. The predictive
p-values computed with iIS can be proved to be equivalent to the LOOCV
predictive p-values, following the general theory for IS. We com- pare iIS and
other three existing methods in the literature with a lip cancer dataset
collected in Scotland. Our empirical results show that iIS provides predictive
p-values that are al- most identical to the actual LOOCV predictive p-values
and outperforms the existing three methods, including the recently proposed
ghosting method by Marshall and Spiegelhalter (2007).Comment: 21 page
Mixture Modeling and Outlier Detection in Microarray Data Analysis
Microarray technology has become a dynamic tool in gene expression analysis
because it allows for the simultaneous measurement of thousands of gene expressions.
Uniqueness in experimental units and microarray data platforms, coupled with how
gene expressions are obtained, make the field open for interesting research questions.
In this dissertation, we present our investigations of two independent studies related
to microarray data analysis.
First, we study a recent platform in biology and bioinformatics that compares
the quality of genetic information from exfoliated colonocytes in fecal matter with
genetic material from mucosa cells within the colon. Using the intraclass correlation
coe�cient (ICC) as a measure of reproducibility, we assess the reliability of density
estimation obtained from preliminary analysis of fecal and mucosa data sets. Numerical findings clearly show that the distribution is comprised of two components.
For measurements between 0 and 1, it is natural to assume that the data points are
from a beta-mixture distribution. We explore whether ICC values should be modeled
with a beta mixture or transformed first and fit with a normal mixture. We find that
the use of mixture of normals in the inverse-probit transformed scale is less sensitive toward model mis-specification; otherwise a biased conclusion could be reached. By
using the normal mixture approach to compare the ICC distributions of fecal and
mucosa samples, we observe the quality of reproducible genes in fecal array data to
be comparable with that in mucosa arrays.
For microarray data, within-gene variance estimation is often challenging due
to the high frequency of low replication studies. Several methodologies have been
developed to strengthen variance terms by borrowing information across genes. However, even with such accommodations, variance may be initiated by the presence of
outliers. For our second study, we propose a robust modification of optimal shrinkage variance estimation to improve outlier detection. In order to increase power, we
suggest grouping standardized data so that information shared across genes is similar
in distribution. Simulation studies and analysis of real colon cancer microarray data
reveal that our methodology provides a technique which is insensitive to outliers, free of distributional assumptions, effective for small sample size, and data adaptive
- …