11,234 research outputs found
Block-diagonal covariance selection for high-dimensional Gaussian graphical models
Gaussian graphical models are widely utilized to infer and visualize networks
of dependencies between continuous variables. However, inferring the graph is
difficult when the sample size is small compared to the number of variables. To
reduce the number of parameters to estimate in the model, we propose a
non-asymptotic model selection procedure supported by strong theoretical
guarantees based on an oracle inequality and a minimax lower bound. The
covariance matrix of the model is approximated by a block-diagonal matrix. The
structure of this matrix is detected by thresholding the sample covariance
matrix, where the threshold is selected using the slope heuristic. Based on the
block-diagonal structure of the covariance matrix, the estimation problem is
divided into several independent problems: subsequently, the network of
dependencies between variables is inferred using the graphical lasso algorithm
in each block. The performance of the procedure is illustrated on simulated
data. An application to a real gene expression dataset with a limited sample
size is also presented: the dimension reduction allows attention to be
objectively focused on interactions among smaller subsets of genes, leading to
a more parsimonious and interpretable modular network.Comment: Accepted in JAS
Inverse Covariance Estimation for High-Dimensional Data in Linear Time and Space: Spectral Methods for Riccati and Sparse Models
We propose maximum likelihood estimation for learning Gaussian graphical
models with a Gaussian (ell_2^2) prior on the parameters. This is in contrast
to the commonly used Laplace (ell_1) prior for encouraging sparseness. We show
that our optimization problem leads to a Riccati matrix equation, which has a
closed form solution. We propose an efficient algorithm that performs a
singular value decomposition of the training data. Our algorithm is
O(NT^2)-time and O(NT)-space for N variables and T samples. Our method is
tailored to high-dimensional problems (N gg T), in which sparseness promoting
methods become intractable. Furthermore, instead of obtaining a single solution
for a specific regularization parameter, our algorithm finds the whole solution
path. We show that the method has logarithmic sample complexity under the
spiked covariance model. We also propose sparsification of the dense solution
with provable performance guarantees. We provide techniques for using our
learnt models, such as removing unimportant variables, computing likelihoods
and conditional distributions. Finally, we show promising results in several
gene expressions datasets.Comment: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty
in Artificial Intelligence (UAI2013
TIGER: A Tuning-Insensitive Approach for Optimally Estimating Gaussian Graphical Models
We propose a new procedure for estimating high dimensional Gaussian graphical
models. Our approach is asymptotically tuning-free and non-asymptotically
tuning-insensitive: it requires very few efforts to choose the tuning parameter
in finite sample settings. Computationally, our procedure is significantly
faster than existing methods due to its tuning-insensitive property.
Theoretically, the obtained estimator is simultaneously minimax optimal for
precision matrix estimation under different norms. Empirically, we illustrate
the advantages of our method using thorough simulated and real examples. The R
package bigmatrix implementing the proposed methods is available on the
Comprehensive R Archive Network: http://cran.r-project.org/
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
- …