14,062 research outputs found

    A statistical approach for array CGH data analysis

    Get PDF
    BACKGROUND: Microarray-CGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile. RESULTS: We demonstrate that existing methods for estimating the number of segments are not well adapted in the case of array CGH data, and we propose an adaptive criterion that detects previously mapped chromosomal aberrations. The performances of this method are discussed based on simulations and publicly available data sets. Then we discuss the choice of modeling for array CGH data and show that the model with a homogeneous variance is adapted to this context. CONCLUSIONS: Array CGH data analysis is an emerging field that needs appropriate statistical tools. Process segmentation and model selection provide a theoretical framework that allows precise biological interpretations. Adaptive methods for model selection give promising results concerning the estimation of the number of altered regions on the genome

    Automatic Bayesian Density Analysis

    Full text link
    Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for {exploratory data analysis} are usually not flexible enough to deal with the uncertainty inherent to real-world data: they are often restricted to fixed latent interaction models and homogeneous likelihoods; they are sensitive to missing, corrupt and anomalous data; moreover, their expressiveness generally comes at the price of intractable inference. As a result, supervision from statisticians is usually needed to find the right model for the data. However, since domain experts are not necessarily also experts in statistics, we propose Automatic Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible at large. Specifically, ABDA allows for automatic and efficient missing value estimation, statistical data type and likelihood discovery, anomaly detection and dependency structure mining, on top of providing accurate density estimation. Extensive empirical evidence shows that ABDA is a suitable tool for automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    Detection of Polyps via Shape and Appearance Modeling

    Get PDF
    Presented at the MICCAI 2008 Workshop on Computational and Visualization Challenges in the New Era of Virtual Colonoscopy, September 6, 2008, New York, USA.This paper describes a CAD system for the detection of colorectal polyps in CT. It is based on stochastic shape and appearance modeling of structures of the colon and rectum, in contrast to the data-driven approaches more commonly found in the literature it derives predictive stochastic models for the features used for classification. The method makes extensive use of medical domain knowledge in the design of the models and in the setting of their parameters. The proposed approach was successfully tested on challenging datasets acquired under a protocol with little colonic preparation; such protocol reduces patient discomfort and potentially improves compliance

    Use of pre-transformation to cope with outlying values in important candidate genes

    Get PDF
    Outlying values in predictors often strongly affect the results of statistical analyses in high-dimensional settings. Although they frequently occur with most high-throughput techniques, the problem is often ignored in the literature. We suggest to use a very simple transformation, proposed before in a different context by Royston and Sauerbrei, as an intermediary step between array normalization and high-level statistical analysis. This straightforward univariate transformation identifies extreme values and reduces the influence of outlying values considerably in all further steps of statistical analysis without eliminating the incriminated observation or feature. The use of the transformation and its effects are demonstrated for diverse univariate and multivariate statistical analyses using nine publicly available microarray data sets

    Data Improving in Time Series Using ARX and ANN Models

    Get PDF
    Anomalous data can negatively impact energy forecasting by causing model parameters to be incorrectly estimated. This paper presents two approaches for the detection and imputation of anomalies in time series data. Autoregressive with exogenous inputs (ARX) and artificial neural network (ANN) models are used to extract the characteristics of time series. Anomalies are detected by performing hypothesis testing on the extrema of the residuals, and the anomalous data points are imputed using the ARX and ANN models. Because the anomalies affect the model coefficients, the data cleaning process is performed iteratively. The models are re-learned on “cleaner” data after an anomaly is imputed. The anomalous data are reimputed to each iteration using the updated ARX and ANN models. The ARX and ANN data cleaning models are evaluated on natural gas time series data. This paper demonstrates that the proposed approaches are able to identify and impute anomalous data points. Forecasting models learned on the unclean data and the cleaned data are tested on an uncleaned out-of-sample dataset. The forecasting model learned on the cleaned data outperforms the model learned on the unclean data with 1.67% improvement in the mean absolute percentage errors and a 32.8% improvement in the root mean squared error. Existing challenges include correctly identifying specific types of anomalies such as negative flows

    Approximating Cross-validatory Predictive P-values with Integrated IS for Disease Mapping Models

    Full text link
    An important statistical task in disease mapping problems is to identify out- lier/divergent regions with unusually high or low residual risk of disease. Leave-one-out cross-validatory (LOOCV) model assessment is a gold standard for computing predictive p-value that can flag such outliers. However, actual LOOCV is time-consuming because one needs to re-simulate a Markov chain for each posterior distribution in which an observation is held out as a test case. This paper introduces a new method, called iIS, for approximating LOOCV with only Markov chain samples simulated from a posterior based on a full data set. iIS is based on importance sampling (IS). iIS integrates the p-value and the likelihood of the test observation with respect to the distribution of the latent variable without reference to the actual observation. The predictive p-values computed with iIS can be proved to be equivalent to the LOOCV predictive p-values, following the general theory for IS. We com- pare iIS and other three existing methods in the literature with a lip cancer dataset collected in Scotland. Our empirical results show that iIS provides predictive p-values that are al- most identical to the actual LOOCV predictive p-values and outperforms the existing three methods, including the recently proposed ghosting method by Marshall and Spiegelhalter (2007).Comment: 21 page

    Mixture Modeling and Outlier Detection in Microarray Data Analysis

    Get PDF
    Microarray technology has become a dynamic tool in gene expression analysis because it allows for the simultaneous measurement of thousands of gene expressions. Uniqueness in experimental units and microarray data platforms, coupled with how gene expressions are obtained, make the field open for interesting research questions. In this dissertation, we present our investigations of two independent studies related to microarray data analysis. First, we study a recent platform in biology and bioinformatics that compares the quality of genetic information from exfoliated colonocytes in fecal matter with genetic material from mucosa cells within the colon. Using the intraclass correlation coe�cient (ICC) as a measure of reproducibility, we assess the reliability of density estimation obtained from preliminary analysis of fecal and mucosa data sets. Numerical findings clearly show that the distribution is comprised of two components. For measurements between 0 and 1, it is natural to assume that the data points are from a beta-mixture distribution. We explore whether ICC values should be modeled with a beta mixture or transformed first and fit with a normal mixture. We find that the use of mixture of normals in the inverse-probit transformed scale is less sensitive toward model mis-specification; otherwise a biased conclusion could be reached. By using the normal mixture approach to compare the ICC distributions of fecal and mucosa samples, we observe the quality of reproducible genes in fecal array data to be comparable with that in mucosa arrays. For microarray data, within-gene variance estimation is often challenging due to the high frequency of low replication studies. Several methodologies have been developed to strengthen variance terms by borrowing information across genes. However, even with such accommodations, variance may be initiated by the presence of outliers. For our second study, we propose a robust modification of optimal shrinkage variance estimation to improve outlier detection. In order to increase power, we suggest grouping standardized data so that information shared across genes is similar in distribution. Simulation studies and analysis of real colon cancer microarray data reveal that our methodology provides a technique which is insensitive to outliers, free of distributional assumptions, effective for small sample size, and data adaptive