13 research outputs found

    Robust Sparse Estimation of Multiresponse Regression and Inverse Covariance Matrix via the L2 distance

    No full text
    ABSTRACT We propose a robust framework to jointly perform two key modeling tasks involving high dimensional data: (i) learning a sparse functional mapping from multiple predictors to multiple responses while taking advantage of the coupling among responses, and (ii) estimating the conditional dependency structure among responses while adjusting for their predictors. The traditional likelihood-based estimators lack resilience with respect to outliers and model misspecification. This issue is exacerbated when dealing with high dimensional noisy data. In this work, we propose instead to minimize a regularized distance criterion, which is motivated by the minimum distance functionals used in nonparametric methods for their excellent robustness properties. The proposed estimates can be obtained efficiently by leveraging a sequential quadratic programming algorithm. We provide theoretical justification such as estimation consistency for the proposed estimator. Additionally, we shed light on the robustness of our estimator through its linearization, which yields a combination of weighted lasso and graphical lasso with the sample weights providing an intuitive explanation of the robustness. We demonstrate the merits of our framework through simulation study and the analysis of real financial and genetics data

    SUPERVISED LEARNING FOR COMPLEX DATA

    Get PDF
    Supervised learning problems are commonly seen in a wide range of scientific fields such as medicine and neuroscience. Given data with predictors and responses, an important goal of supervised learning is to find the underlying relationship between predictors and responses for future prediction. In this dissertation, we propose three new supervised learning approaches for the analysis of complex data. For the first two projects, we focus on block-wise missing multi-modal data which contain samples with different modalities. In the first project, we study regression problems with multiple responses. We propose a new penalized method to predict multiple correlated responses jointly, using not only the information from block-wise missing predictors but also the correlation information among responses. In the second project, we study regression problems with censored outcomes. We propose a penalized Buckley-James method that can simultaneously handle block-wise missing covariates and censored outcomes. For the third project, we analyze data streams under reproducing kernel Hilbert spaces. Specifically, we develop a new supervised learning method to learn the underlying model with limited storage space, where the model may be non-stationary. We use a shrinkage parameter and a data sparsity constraint to balance the bias-variance tradeoff, and use random feature approximation to control the storage space.Doctor of Philosoph

    Interpretable Machine Learning and Deep Learning Frameworks for Predictive Analytics and Biomarker Discovery from Multimodal Imaging Genetics Data

    Get PDF
    In healthcare, neuroimaging studies and genetics research are generating torrents of data to understand the hereditary components of neurological and psychiatric disorders. Neuroimaging techniques like functional Magnetic Resonance Imaging probe into the neural functioning of the disorder. In parallel, genome sequencing technologies explore the genetic underpinning. Integrating these complementary viewpoints in a single framework improves diagnosis and provides biological insights about the disorders. However, imaging-genetic data lies in a high-dimensional space with complex interaction and unknown causal factors. Our research aims to integrate multimodal imaging-genetics data to predict neuropsychiatric disorders while providing biological insights. We first propose a novel generative-discriminative framework that integrates imaging and genetics data for simultaneous biomarker identification and disease classification. The generative module extracts representation patterns from the data, while the discriminative module uses the representation vectors for diagnosis. Our experimental analyses show that the discriminative module guides our framework, leading to improved disease diagnosis and biomarker identification. Next, we extend the linear multivariate approach to capture the complex non-linear interaction between the data modalities using an autoencoder framework coupled with a classifier. Unlike traditional encoder-decoder models, our encoder module jointly identifies predictive imaging and genetic biomarkers using Bayesian feature selection. Our third work uses the Bayesian approach to find genetic variants that causally affect a trait. However, correct identification of the causal variants is challenging due to the correlation structure shared across variants. Our model combines a hierarchical Bayesian model with a deep learning-based inference procedure. We show that this combination provides greater inferential power to handle noise and spurious interactions of the genomic region. Finally, we solve the problem of encoding whole genome genotype data in an imaging-genetics framework. Traditionally, imaging-genetics models sub-select genetic features to ensure model stability. Our approach departs from conventional Artificial Neural Networks and introduces biologically regularized graph convolution networks to encode the whole genome genotype data. We show that this encoding strategy helps us to track the convergence of genetic risk while preventing overfitting. In an exploratory analysis, we use this model to investigate the underlying biological processes associated with autism spectrum disorder and schizophrenia

    Computational and Statistical Aspects of High-Dimensional Structured Estimation

    Get PDF
    University of Minnesota Ph.D. dissertation. May 2018. Major: Computer Science. Advisor: Arindam Banerjee. 1 computer file (PDF); xiii, 256 pages.Modern statistical learning often faces high-dimensional data, for which the number of features that should be considered is very large. In consideration of various constraints encountered in data collection, such as cost and time, however, the available samples for applications in certain domains are of small size compared with the feature sets. In this scenario, statistical estimation becomes much more challenging than in the large-sample regime. Since the information revealed by small samples is inadequate for finding the optimal model parameters, the estimator may end up with incorrect models that appear to fit the observed data but fail to generalize to unseen ones. Owning to the prior knowledge about the underlying parameters, additional structures can be imposed to effectively reduce the parameter space, in which it is easier to identify the true one with limited data. This simple idea has inspired the study of high-dimensional statistics since its inception. Over the last two decades, sparsity has been one of the most popular structures to exploit when we estimate a high-dimensional parameter, which assumes that the number of nonzero elements in parameter vector/matrix is much smaller than its ambient dimension. For simple scenarios such as linear models, L1-norm based convex estimators like Lasso and Dantzig selector, have been widely used to find the true parameter with reasonable amount of computation and provably small error. Recent years have also seen a variety of structures proposed beyond sparsity, e.g., group sparsity and low-rankness of matrix, which are demonstrated to be useful in many applications. On the other hand, the aforementioned estimators can be extended to leverage new types of structures by finding appropriate convex surrogates like the L1 norm for sparsity. Despite their success on individual structures, current developments towards a unified understanding of various structures are still incomplete in both computational and statistical aspects. Moreover, due to the nature of the model or the parameter structure, the associated estimator can be inherently non-convex, which may need additional care when we consider such unification of different structures. In this thesis, we aim to make progress towards a unified framework for the estimation with general structures, by studying the high-dimensional structured linear model and other semi-parametric and non-convex extensions. In particular, we introduce the generalized Dantzig selector (GDS), which extends the original Dantzig selector for sparse linear models. For the computational aspect, we develop an efficient optimization algorithm to compute the GDS. On statistical side, we establish the recovery guarantees of GDS using certain geometric measures. Then we demonstrate that those geometric measures can be bounded by utilizing simple information of the structures. These results on GDS have been extended to the matrix setting as well. Apart from the linear model, we also investigate one of its semi-parametric extension -- the single-index model (SIM). To estimate the true parameter, we incorporate its structure into two types of simple estimators, whose estimation error can be established using similar geometric measures. Besides we also design a new semi-parametric model called sparse linear isotonic model (SLIM), for which we provide an efficient estimation algorithm along with its statistical guarantees. Lastly, we consider the non-convex estimation for structured multi-response linear models. We propose an alternating estimation procedure to estimate the parameters. In spite of dealing with non-convexity, we show that the statistical guarantees for general structures can be also summarized by the geometric measures

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF

    A Novel Approach to Detecting Epistasis using Random Sampling Regularisation

    Get PDF
    Epistasis is a progressive approach that complements the ‘common disease, common variant’ hypothesis that highlights the potential for connected networks of genetic variants collaborating to produce a phenotypic expression. Epistasis is commonly performed as a pairwise or limitless-arity capacity that considers variant networks as either variant vs variant or as high order interactions. This type of analysis extends the number of tests that were previously performed in a standard approach such as Genome-Wide Association Study (GWAS), in which False Discovery Rate (FDR) is already an issue, therefore by multiplying the number of tests up to a factorial rate also increases the issue of FDR. Further to this, epistasis introduces its own limitations of computational complexity and intensity that are generated based on the analysis performed; to consider the most intense approach, a multivariate analysis introduces a time complexity of O(n!). Proposed in this paper is a novel methodology for the detection of epistasis using interpretable methods and best practice to outline interactions through filtering processes. Using a process of Random Sampling Regularisation which randomly splits and produces sample sets to conduct a voting system to regularise the significance and reliability of biological markers, SNPs. Preliminary results are promising, outlining a concise detection of interactions. Results for the detection of epistasis, in the classification of breast cancer patients, indicated eight outlined risk candidate interactions from five variants and a singular candidate variant with high protective association
    corecore