427 research outputs found

    Targeted Fused Ridge Estimation of Inverse Covariance Matrices from Multiple High-Dimensional Data Classes

    Get PDF
    We consider the problem of jointly estimating multiple inverse covariance matrices from high-dimensional data consisting of distinct classes. An 2\ell_2-penalized maximum likelihood approach is employed. The suggested approach is flexible and generic, incorporating several other 2\ell_2-penalized estimators as special cases. In addition, the approach allows specification of target matrices through which prior knowledge may be incorporated and which can stabilize the estimation procedure in high-dimensional settings. The result is a targeted fused ridge estimator that is of use when the precision matrices of the constituent classes are believed to chiefly share the same structure while potentially differing in a number of locations of interest. It has many applications in (multi)factorial study designs. We focus on the graphical interpretation of precision matrices with the proposed estimator then serving as a basis for integrative or meta-analytic Gaussian graphical modeling. Situations are considered in which the classes are defined by data sets and subtypes of diseases. The performance of the proposed estimator in the graphical modeling setting is assessed through extensive simulation experiments. Its practical usability is illustrated by the differential network modeling of 12 large-scale gene expression data sets of diffuse large B-cell lymphoma subtypes. The estimator and its related procedures are incorporated into the R-package rags2ridges.Comment: 52 pages, 11 figure

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    Model selection techniques for sparse weight-based principal component analysis

    Get PDF
    Many studies make use of multiple types of data that are collected for the same set of samples, resulting in so-called multiblock data (e.g., multiomics studies). A popular analysis framework is sparse principal component analysis (PCA) of the concatenated data. The sparseness in the component weights of these models is usually induced by penalties. A crucial factor in the use of such penalized methods is a proper tuning of the regularization parameters used to give more or less weight to the penalties. In this paper, we examine several model selection procedures to tune these regularization parameters for sparse PCA. The model selection procedures include cross-validation, Bayesian information criterion (BIC), index of sparseness, and the convex hull procedure. Furthermore, to account for the multiblock structure, we present a sparse PCA algorithm with a group least absolute shrinkage and selection operator (LASSO) penalty added to it, to either select or cancel out blocks of data in an automated way. Also, the tuning of the group LASSO parameter is studied for the proposed model selection procedures. We conclude that when the component weights are to be interpreted, cross-validation with the one standard error rule is preferred; alternatively, if the interest lies in obtaining component scores using a very limited set of variables, the convex hull, BIC, and index of sparseness are all suitable

    Heterogeneity-aware Clustered Distributed Learning for Multi-source Data Analysis

    Full text link
    In diverse fields ranging from finance to omics, it is increasingly common that data is distributed and with multiple individual sources (referred to as ``clients'' in some studies). Integrating raw data, although powerful, is often not feasible, for example, when there are considerations on privacy protection. Distributed learning techniques have been developed to integrate summary statistics as opposed to raw data. In many of the existing distributed learning studies, it is stringently assumed that all the clients have the same model. To accommodate data heterogeneity, some federated learning methods allow for client-specific models. In this article, we consider the scenario that clients form clusters, those in the same cluster have the same model, and different clusters have different models. Further considering the clustering structure can lead to a better understanding of the ``interconnections'' among clients and reduce the number of parameters. To this end, we develop a novel penalization approach. Specifically, group penalization is imposed for regularized estimation and selection of important variables, and fusion penalization is imposed to automatically cluster clients. An effective ADMM algorithm is developed, and the estimation, selection, and clustering consistency properties are established under mild conditions. Simulation and data analysis further demonstrate the practical utility and superiority of the proposed approach

    Adaptive and Robust Multi-task Learning

    Full text link
    We study the multi-task learning problem that aims to simultaneously analyze multiple datasets collected from different sources and learn one model for each of them. We propose a family of adaptive methods that automatically utilize possible similarities among those tasks while carefully handling their differences. We derive sharp statistical guarantees for the methods and prove their robustness against outlier tasks. Numerical experiments on synthetic and real datasets demonstrate the efficacy of our new methods.Comment: 69 pages, 2 figure

    BAYESIAN INTEGRATIVE ANALYSIS OF OMICS DATA

    Get PDF
    Technological innovations have produced large multi-modal datasets that range in multiplatform genomic data, pathway data, proteomic data, imaging data and clinical data. Integrative analysis of such data sets have potentiality in revealing important biological and clinical insights into complex diseases like cancer. This dissertation focuses on Bayesian methodology establishment in integrative analysis of radiogenomics and pathway driver detection applied in cancer applications. We initially present Radio-iBAG that utilizes Bayesian approaches in analyzing radiological imaging and multi-platform genomic data, which we establish a multi-scale Bayesian hierarchical model that simultaneously identifies genomic and radiomic, i.e., radiology-based imaging markers, along with the latent associations between these two modalities, and to detect the overall prognostic relevance of the combined markers. Our method is motivated by and applied to The Cancer Genome Atlas glioblastoma multiforme data set, wherein it identifies important magnetic resonance imaging features and the associated genomic platforms that are also significantly related with patient survival times. For another aspect of integrative analysis, we then present pathDrive that aims to detect key genetic and epigenetic upstream drivers that influence pathway activity. The method is applied into colorectal cancer incorporated with its four molecular subtypes. For each of the pathways that significantly differentiates subgroups, we detect important genomic drivers that can be viewed as “switches” for the pathway activity. To extend the analysis, finally, we develop proteomic based pathway driver analysis for multiple cancer types wherein we simultaneously detect genomic upstream factors that influence a specific pathway for each cancer type within the cancer group. With Bayesian hierarchical model, we detect signals borrowing strength from common cancer type to rare cancer type, and simultaneously estimate their selection similarity. Through simulation study, our method is demonstrated in providing many advantages, including increased power and lower false discovery rates. We then apply the method into the analysis of multiple cancer groups, wherein we detect key genomic upstream drivers with proper biological interpretation. The overall framework and methodologies established in this dissertation illustrate further investigation in the field of integrative analysis of omics data, provide more comprehensive insight into biological mechanisms and processes, cancer development and progression

    Robust penalized regression for complex high-dimensional data

    Get PDF
    Robust high-dimensional data analysis has become an important and challenging task in complex Big Data analysis due to the high-dimensionality and data contamination. One of the most popular procedures is the robust penalized regression. In this dissertation, we address three typical robust ultra-high dimensional regression problems via penalized regression approaches. The first problem is related to the linear model with the existence of outliers, dealing with the outlier detection, variable selection and parameter estimation simultaneously. The second problem is related to robust high-dimensional mean regression with irregular settings such as the data contamination, data asymmetry and heteroscedasticity. The third problem is related to robust bi-level variable selection for the linear regression model with grouping structures in covariates. In Chapter 1, we introduce the background and challenges by overviews of penalized least squares methods and robust regression techniques. In Chapter 2, we propose a novel approach in a penalized weighted least squares framework to perform simultaneous variable selection and outlier detection. We provide a unified link between the proposed framework and a robust M-estimation in general settings. We also establish the non-asymptotic oracle inequalities for the joint estimation of both the regression coefficients and weight vectors. In Chapter 3, we establish a framework of robust estimators in high-dimensional regression models using Penalized Robust Approximated quadratic M estimation (PRAM). This framework allows general settings such as random errors lack of symmetry and homogeneity, or covariates are not sub-Gaussian. Theoretically, we show that, in the ultra-high dimension setting, the PRAM estimator has local estimation consistency at the minimax rate enjoyed by the LS-Lasso and owns the local oracle property, under certain mild conditions. In Chapter 4, we extend the study in Chapter 3 to robust high-dimensional data analysis with structured sparsity. In particular, we propose a framework of high-dimensional M-estimators for bi-level variable selection. This framework encourages bi-level sparsity through a computationally efficient two-stage procedure. It produces strong robust parameter estimators if some nonconvex redescending loss functions are applied. In theory, we provide sufficient conditions under which our proposed two-stage penalized M-estimator possesses simultaneous local estimation consistency and the bi-level variable selection consistency, if a certain nonconvex penalty function is used at the group level. The performances of the proposed estimators are demonstrated in both simulation studies and real examples. In Chapter 5, we provide some discussions and future work

    Formal Models of the Network Co-occurrence Underlying Mental Operations

    Get PDF
    International audienceSystems neuroscience has identified a set of canonical large-scale networks in humans. These have predominantly been characterized by resting-state analyses of the task-uncon-strained, mind-wandering brain. Their explicit relationship to defined task performance is largely unknown and remains challenging. The present work contributes a multivariate statistical learning approach that can extract the major brain networks and quantify their configuration during various psychological tasks. The method is validated in two extensive datasets (n = 500 and n = 81) by model-based generation of synthetic activity maps from recombination of shared network topographies. To study a use case, we formally revisited the poorly understood difference between neural activity underlying idling versus goal-directed behavior. We demonstrate that task-specific neural activity patterns can be explained by plausible combinations of resting-state networks. The possibility of decomposing a mental task into the relative contributions of major brain networks, the "network co-occurrence architecture" of a given task, opens an alternative access to the neural substrates of human cognition
    corecore