70 research outputs found
Sharp minimax tests for large covariance matrices and adaptation
We consider the detection problem of correlations in a -dimensional
Gaussian vector, when we observe independent, identically distributed
random vectors, for and large. We assume that the covariance matrix
varies in some ellipsoid with parameter and total energy bounded
by . We propose a test procedure based on a U-statistic of order 2 which
is weighted in an optimal way. The weights are the solution of an optimization
problem, they are constant on each diagonal and non-null only for the first
diagonals, where . We show that this test statistic is asymptotically
Gaussian distributed under the null hypothesis and also under the alternative
hypothesis for matrices close to the detection boundary. We prove upper bounds
for the total error probability of our test procedure, for and
under the assumption which implies that . We
illustrate via a numerical study the behavior of our test procedure. Moreover,
we prove lower bounds for the maximal type II error and the total error
probabilities. Thus we obtain the asymptotic and the sharp asymptotically
minimax separation rate , for and for together with the
additional assumption , respectively. We deduce rate
asymptotic minimax results for testing the inverse of the covariance matrix. We
construct an adaptive test procedure with respect to the parameter and
show that it attains the rate
Nonparametric estimation in models for unobservable heterogeneity
Nonparametric models which allow for data with unobservable heterogeneity are studied. The first publication introduces new estimators and their asymptotic properties for conditional mixture models. The second publication considers estimation of a function from noisy observations of its Radon transform in a Gaussian white noise model
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
Compressed Sensing Beyond the IID and Static Domains: Theory, Algorithms and Applications
Sparsity is a ubiquitous feature of many real world signals such as natural images and neural spiking activities. Conventional compressed sensing utilizes sparsity to recover low dimensional signal structures in high ambient dimensions using few measurements, where i.i.d measurements are at disposal. However real world scenarios typically exhibit non i.i.d and dynamic structures and are confined by physical constraints, preventing applicability of the theoretical guarantees of compressed sensing and limiting its applications. In this thesis we develop new theory, algorithms and applications for non i.i.d and dynamic compressed sensing by considering such constraints.
In the first part of this thesis we derive new optimal sampling-complexity tradeoffs for two commonly used processes used to model dependent temporal structures: the autoregressive processes and self-exciting generalized linear models. Our theoretical results successfully recovered the temporal dependencies in neural activities, financial data and traffic data.
Next, we develop a new framework for studying temporal dynamics by introducing compressible state-space models, which simultaneously utilize spatial and temporal sparsity. We develop a fast algorithm for optimal inference on such models and prove its optimal recovery guarantees. Our algorithm shows significant improvement in detecting sparse events in biological applications such as spindle detection and calcium deconvolution.
Finally, we develop a sparse Poisson image reconstruction technique and the first compressive two-photon microscope which uses lines of excitation across the sample at multiple angles. We recovered diffraction-limited images from relatively few incoherently multiplexed measurements, at a rate of 1.5 billion voxels per second
Problems In High-Dimensional Statistics And Applications In Genomics, Metabolomics And Microbiomics
With rapid technological advancements in data collection and processing, massive large-scale and complex datasets are widely available nowadays in diverse research fields such as genomics, metabolomics and microbiomics. The analysis of large datasets with complex structures poses significant challenges and calls for new theory and methodology. In this dissertation, we address several high-dimensional statistical problems, and develop novel statistical theory and methods for analyzing datasets generated from such data-driven interdisciplinary research.
In the first part of the dissertation (Chapter 1 and Chapter 2), motivated by the ubiquitous availability of high-dimensional datasets with binary outcomes and the need of powerful methods for analyzing them, we develop novel bias-correction techniques for inferring low-dimensional components or functionals of high-dimensional objects, and propose computationally efficient procedures for parameter estimation, global and simultaneous hypotheses testing, and confidence intervals in high-dimensional logistic regression(s). The theoretical properties of the proposed methods, including their minimax optimality, are carefully studied. We show empirically the effectiveness and stability of our methods in extracting useful information from high-dimensional noisy datasets. By applying our methods to a real metabolomic dataset, we unveil the associations between fecal metabolites and pediatric Crohn’s disease as well as the effects of dietary treatment on such associations (Chapter 1); by analyzing a real genetic dataset, we obtain novel insights about the shared genetic architecture between ten pediatric autoimmune diseases (Chapter 2). In the second part of the dissertation (Chapter 3 and Chapter 4), motivated by important questions in large-scale human microbiome and metagenomic research, as well as other applications, we propose a novel permuted monotone matrix model and build up new principles, theories and methods for inferring the underlying model parameters. In particular, we focus on two interrelated problems, namely, optimal permutation recovery from noisy observations (Chapter 3), and extreme value estimation in permuted low-rank monotone matrices (Chapter 4), and propose an efficient spectral approach to attack these problems. The proposed methods are rigorously justified by statistical theory, including their convergence rates and the minimax optimality. Numerical experiments through simulated and synthetic microbiome metagenomic data are presented to show the superiority of the proposed methods over the alternatives. The methods are applied to two real datasets to compare the growth rates of gut bacteria between inflammatory bowel disease patients and/or normal controls
- …