70 research outputs found

    Sharp minimax tests for large covariance matrices and adaptation

    Full text link
    We consider the detection problem of correlations in a pp-dimensional Gaussian vector, when we observe nn independent, identically distributed random vectors, for nn and pp large. We assume that the covariance matrix varies in some ellipsoid with parameter α>1/2\alpha >1/2 and total energy bounded by L>0L>0. We propose a test procedure based on a U-statistic of order 2 which is weighted in an optimal way. The weights are the solution of an optimization problem, they are constant on each diagonal and non-null only for the TT first diagonals, where T=o(p)T=o(p). We show that this test statistic is asymptotically Gaussian distributed under the null hypothesis and also under the alternative hypothesis for matrices close to the detection boundary. We prove upper bounds for the total error probability of our test procedure, for α>1/2\alpha>1/2 and under the assumption T=o(p)T=o(p) which implies that n=o(p2α)n=o(p^{ 2 \alpha}). We illustrate via a numerical study the behavior of our test procedure. Moreover, we prove lower bounds for the maximal type II error and the total error probabilities. Thus we obtain the asymptotic and the sharp asymptotically minimax separation rate φ~=(C(α,L)n2p)α/(4α+1)\tilde{\varphi} = (C(\alpha, L) n^2 p )^{- \alpha/(4 \alpha + 1)}, for α>3/2\alpha>3/2 and for α>1\alpha >1 together with the additional assumption p=o(n4α1)p= o(n^{4 \alpha -1}), respectively. We deduce rate asymptotic minimax results for testing the inverse of the covariance matrix. We construct an adaptive test procedure with respect to the parameter α\alpha and show that it attains the rate ψ~=(n2p/lnln(np))α/(4α+1)\tilde{\psi}= ( n^2 p / \ln\ln(n \displaystyle\sqrt{p}) )^{- \alpha/(4 \alpha + 1)}

    Nonparametric estimation in models for unobservable heterogeneity

    Get PDF
    Nonparametric models which allow for data with unobservable heterogeneity are studied. The first publication introduces new estimators and their asymptotic properties for conditional mixture models. The second publication considers estimation of a function from noisy observations of its Radon transform in a Gaussian white noise model

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    Compressed Sensing Beyond the IID and Static Domains: Theory, Algorithms and Applications

    Get PDF
    Sparsity is a ubiquitous feature of many real world signals such as natural images and neural spiking activities. Conventional compressed sensing utilizes sparsity to recover low dimensional signal structures in high ambient dimensions using few measurements, where i.i.d measurements are at disposal. However real world scenarios typically exhibit non i.i.d and dynamic structures and are confined by physical constraints, preventing applicability of the theoretical guarantees of compressed sensing and limiting its applications. In this thesis we develop new theory, algorithms and applications for non i.i.d and dynamic compressed sensing by considering such constraints. In the first part of this thesis we derive new optimal sampling-complexity tradeoffs for two commonly used processes used to model dependent temporal structures: the autoregressive processes and self-exciting generalized linear models. Our theoretical results successfully recovered the temporal dependencies in neural activities, financial data and traffic data. Next, we develop a new framework for studying temporal dynamics by introducing compressible state-space models, which simultaneously utilize spatial and temporal sparsity. We develop a fast algorithm for optimal inference on such models and prove its optimal recovery guarantees. Our algorithm shows significant improvement in detecting sparse events in biological applications such as spindle detection and calcium deconvolution. Finally, we develop a sparse Poisson image reconstruction technique and the first compressive two-photon microscope which uses lines of excitation across the sample at multiple angles. We recovered diffraction-limited images from relatively few incoherently multiplexed measurements, at a rate of 1.5 billion voxels per second

    Problems In High-Dimensional Statistics And Applications In Genomics, Metabolomics And Microbiomics

    Get PDF
    With rapid technological advancements in data collection and processing, massive large-scale and complex datasets are widely available nowadays in diverse research fields such as genomics, metabolomics and microbiomics. The analysis of large datasets with complex structures poses significant challenges and calls for new theory and methodology. In this dissertation, we address several high-dimensional statistical problems, and develop novel statistical theory and methods for analyzing datasets generated from such data-driven interdisciplinary research. In the first part of the dissertation (Chapter 1 and Chapter 2), motivated by the ubiquitous availability of high-dimensional datasets with binary outcomes and the need of powerful methods for analyzing them, we develop novel bias-correction techniques for inferring low-dimensional components or functionals of high-dimensional objects, and propose computationally efficient procedures for parameter estimation, global and simultaneous hypotheses testing, and confidence intervals in high-dimensional logistic regression(s). The theoretical properties of the proposed methods, including their minimax optimality, are carefully studied. We show empirically the effectiveness and stability of our methods in extracting useful information from high-dimensional noisy datasets. By applying our methods to a real metabolomic dataset, we unveil the associations between fecal metabolites and pediatric Crohn’s disease as well as the effects of dietary treatment on such associations (Chapter 1); by analyzing a real genetic dataset, we obtain novel insights about the shared genetic architecture between ten pediatric autoimmune diseases (Chapter 2). In the second part of the dissertation (Chapter 3 and Chapter 4), motivated by important questions in large-scale human microbiome and metagenomic research, as well as other applications, we propose a novel permuted monotone matrix model and build up new principles, theories and methods for inferring the underlying model parameters. In particular, we focus on two interrelated problems, namely, optimal permutation recovery from noisy observations (Chapter 3), and extreme value estimation in permuted low-rank monotone matrices (Chapter 4), and propose an efficient spectral approach to attack these problems. The proposed methods are rigorously justified by statistical theory, including their convergence rates and the minimax optimality. Numerical experiments through simulated and synthetic microbiome metagenomic data are presented to show the superiority of the proposed methods over the alternatives. The methods are applied to two real datasets to compare the growth rates of gut bacteria between inflammatory bowel disease patients and/or normal controls
    corecore