1,081 research outputs found

    Statistical Methods for Analysis of Multi-Sample Copy Number Variants and ChIP-seq Data

    Get PDF
    This dissertation addresses the statistical problems related to multiple-sample copy number variants (CNVs) analysis and analysis of differential enrichment of histone modifications (HMs) between two or more biological conditions based on the Chromatin Immunoprecipitation and sequencing (ChIP-seq) data. The first part of the dissertation develops methods for identifying the copy number variants that are associated with trait values. We develop a novel method, CNVtest, to directly identify the trait-associated CNVs without the need of identifying sample-specific CNVs. Asymptotic theory is developed to show that CNVtest controls the Type I error asymptotically and identifies the true trait-associated CNVs with a high probability. The performance of this method is demonstrated through simulations and an application to identify the CNVs that are associated with population differentiation. The second part of the dissertation develops methods for detecting genes with differential enrichment of histone modification between two or more experimental conditions based on the ChIP-seq data. We apply several nonparametric methods to identify the genes with differential enrichment. The methods can be applied to the ChIP-seq data of histone modification even without replicates. It is based on nonparametric hypothesis testing in order to capture the spatial differences in protein-enriched profiles. The key of our approaches is to use null genes or input ChIP-seq data to choose the biologically relevant null values of the tests. We demonstrate the method using ChIP-seq data on a comparative epigenomic profiling of adipogenesis of murine adipose stromal cells. Our method detects many genes with differential H3K27ac levels at gene promoter regions between proliferating preadipocytes and mature adipocytes in murine 3T3-L1 cells. The test statistics also correlate well with the gene expression changes and are predictive of gene expression changes, indicating that the identified differential enrichment regions are indeed biologically meaningful. We further extend these tests to time-course ChIP-seq experiments by evaluating the maximum and mean of the adjacent pair-wise statistics for detecting differentially enriched genes across several time points. We compare and evaluate different nonparametric tests for differential enrichment analysis and observe that the kernel-smoothing methods perform better in controlling the Type I errors, although the ranking of genes with differentially enriched regions are comparable using different test statistics

    Models and Methods for Automated Background Density Estimation in Hyperspectral Anomaly Detection

    Get PDF
    Detecting targets with unknown spectral signatures in hyperspectral imagery has been proven to be a topic of great interest in several applications. Because no knowledge about the targets of interest is assumed, this task is performed by searching the image for anomalous pixels, i.e. those pixels deviating from a statistical model of the background. According to the hyperspectral literature, there are two main approaches to Anomaly Detection (AD) thus leading to the definition of different ways for background modeling: global and local. Global AD algorithms are designed to locate small rare objects that are anomalous with respect to the global background, identified by a large portion of the image. On the other hand, in local AD strategies, pixels with significantly different spectral features from a local neighborhood just surrounding the observed pixel are detected as anomalies. In this thesis work, a new scheme is proposed for detecting both global and local anomalies. Specifically, a simplified Likelihood Ratio Test (LRT) decision strategy is derived that involves thresholding the background log-likelihood and, thus, only needs the specification of the background Probability Density Function (PDF). Within this framework, the use of parametric, semi-parametric (in particular finite mixtures), and non-parametric models is investigated for the background PDF estimation. Although such approaches are well known and have been widely employed in multivariate data analysis, they have been seldom applied to estimate the hyperspectral background PDF, mostly due to the difficulty of reliably learning the model parameters without the need of operator intervention, which is highly desirable in practical AD tasks. In fact, this work represents the first attempt to jointly examine such methods in order to asses and discuss the most critical issues related to their employment for PDF estimation of hyperspectral background with specific reference to the detection of anomalous objects in a scene. Specifically, semi- and non-parametric estimators have been successfully employed to estimate the image background PDF with the aim of detecting global anomalies in a scene by means of the use of ad hoc learning procedures. In particular, strategies developed within a Bayesian framework have been considered for automatically estimating the parameters of mixture models and one of the most well-known non-parametric techniques, i.e. the fixed kernel density estimator (FKDE). In this latter, the performance and the modeling ability depend on scale parameters, called bandwidths. It has been shown that the use of bandwidths that are fixed across the entire feature space, as done in the FKDE, is not effective when the sample data exhibit different local peculiarities across the entire data domain, which generally occurs in practical applications. Therefore, some possibilities are investigated to improve the image background PDF estimation of FKDE by allowing the bandwidths to vary over the estimation domain, thus adapting the amount of smoothing to the local density of the data so as to more reliably and accurately follow the background data structure of hyperspectral images of a scene. The use of such variable bandwidth kernel density estimators (VKDE) is also proposed for estimating the background PDF within the considered AD scheme for detecting local anomalies. Such a choice is done with the aim to cope with the problem of non-Gaussian background for improving classical local AD algorithms involving parametric and non-parametric background models. The locally data-adaptive non-parametric model has been chosen since it encompasses the potential, typical of non-parametric PDF estimators, in modeling data regardless of specific distributional assumption together with the benefits deriving from the employment of bandwidths that vary across the data domain. The ability of the proposed AD scheme resulting from the application of different background PDF models and learning methods is experimentally evaluated by employing real hyperspectral images containing objects that are anomalous with respect to the background

    Deriving probabilistic short-range forecasts from a deterministic high-resolution model

    Get PDF
    In order to take full advantage of short-range forecasts from deterministic high-resolution NWP models, the direct model output must be addressed in a probabilistic framework. A promising approach is mesoscale ensemble prediction. However, its operational use is still hampered by conceptual deficiencies and large computational costs. This study tackles two relevant issues: (1) the representation of model-related forecast uncertainty in mesoscale ensemble prediction systems and (2) the development of post-processing procedures that retrieve additional probabilistic information from a single model simulation. Special emphasis is laid on mesoscale forecast uncertainty of summer precipitation and 2m-temperature in Europe. Source of forecast guidance is the deterministic high-resolution model Lokal-Modell (LM) of the German Weather Service. This study gains more insight into the effect and usefulness of stochastic parametrisation schemes in the representation of short-range forecast uncertainty. A stochastic parametrisation scheme is implemented into the LM in an attempt to simulate the stochastic effect of sub-grid scale processes. Experimental ensembles show that the scheme has a substantial effect on the forecast of precipitation amount. However, objective verification reveals that the ensemble does not attain better forecast goodness than a single LM simulation. Urgent issues for future research are identified. In the context of statistical post-processing, two schemes are designed: the neighbourhood method and wavelet smoothing. Both approaches fall under the framework of estimating a large array of statistical parameters on the basis of a single realisation on each parameter. The neighbourhood method is based on the notion of spatio-temporal ergodicity including explicit corrections for enhanced predictability from topographic forcing. The neighbourhood method derives estimates of quantiles, exceedance probabilities and expected values at each grid point of the LM. If the post-processed precipitation forecast is formulated in terms of probabilities or quantiles, it attains clear superiority in comparison to the raw model output. Wavelet smoothing originates from the field of image denoising and includes concepts of multiresolution analysis and non-parametric regression. In this study, the method is used to produce estimates of the expected value, but it may be easily extended to the additional estimation of exceedance probabilities. Wavelet smoothing is not only computationally more efficient than the neighbourhood method, but automatically adapts the amount of spatial smoothing to local properties of the underlying data. The method apparently detects deterministically predictable temperature patterns on the basis of statistical guidance only

    Nonparametric Methods for the Comparison of ROC Curves with Application to Biomedicine

    Get PDF
    The main goal of this project is to use nonparametric methods to obtain new procedures concerning the ROC curve with significant applications in the biomedical field. Firstly, tests for the comparison of ROC curves without covariates will be studied. Next, new tests will be designed and studied for the comparison of ROC curves in the presence of a unidimensional covariate. Eventually, those tests will be adapted to the case in which the covariate is multidimensional. It will also be of great interest to study whether it is necessary to include those covariates in an ROC curve analysis. The new procedures will be developed in R, and later applied to real data provided by the Hospital Universitario de Santiago de Compostela

    Training and assessing classification rules with unbalanced data

    Get PDF
    The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with both the problems is proposed, based on a smoothed bootstrap re-sampling technique
    • 

    corecore