100,394 research outputs found

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    Differential Privacy Applications to Bayesian and Linear Mixed Model Estimation

    Get PDF
    We consider a particular maximum likelihood estimator (MLE) and a computationally-intensive Bayesian method for differentially private estimation of the linear mixed-effects model (LMM) with normal random errors. The LMM is important because it is used in small area estimation and detailed industry tabulations that present significant challenges for confidentiality protection of the underlying data. The differentially private MLE performs well compared to the regular MLE, and deteriorates as the protection increases for a problem in which the small-area variation is at the county level. More dimensions of random effects are needed to adequately represent the time- dimension of the data, and for these cases the differentially private MLE cannot be computed. The direct Bayesian approach for the same model uses an informative, but reasonably diffuse, prior to compute the posterior predictive distribution for the random effects. The differential privacy of this approach is estimated by direct computation of the relevant odds ratios after deleting influential observations according to various criteria

    Space-Time Sampling for Network Observability

    Full text link
    Designing sparse sampling strategies is one of the important components in having resilient estimation and control in networked systems as they make network design problems more cost-effective due to their reduced sampling requirements and less fragile to where and when samples are collected. It is shown that under what conditions taking coarse samples from a network will contain the same amount of information as a more finer set of samples. Our goal is to estimate initial condition of linear time-invariant networks using a set of noisy measurements. The observability condition is reformulated as the frame condition, where one can easily trace location and time stamps of each sample. We compare estimation quality of various sampling strategies using estimation measures, which depend on spectrum of the corresponding frame operators. Using properties of the minimal polynomial of the state matrix, deterministic and randomized methods are suggested to construct observability frames. Intrinsic tradeoffs assert that collecting samples from fewer subsystems dictates taking more samples (in average) per subsystem. Three scalable algorithms are developed to generate sparse space-time sampling strategies with explicit error bounds.Comment: Submitted to IEEE TAC (Revised Version

    Effective Genetic Risk Prediction Using Mixed Models

    Get PDF
    To date, efforts to produce high-quality polygenic risk scores from genome-wide studies of common disease have focused on estimating and aggregating the effects of multiple SNPs. Here we propose a novel statistical approach for genetic risk prediction, based on random and mixed effects models. Our approach (termed GeRSI) circumvents the need to estimate the effect sizes of numerous SNPs by treating these effects as random, producing predictions which are consistently superior to current state of the art, as we demonstrate in extensive simulation. When applying GeRSI to seven phenotypes from the WTCCC study, we confirm that the use of random effects is most beneficial for diseases that are known to be highly polygenic: hypertension (HT) and bipolar disorder (BD). For HT, there are no significant associations in the WTCCC data. The best existing model yields an AUC of 54%, while GeRSI improves it to 59%. For BD, using GeRSI improves the AUC from 55% to 62%. For individuals ranked at the top 10% of BD risk predictions, using GeRSI substantially increases the BD relative risk from 1.4 to 2.5.Comment: main text: 14 pages, 3 figures. Supplementary text: 16 pages, 21 figure
    • …
    corecore