35,158 research outputs found
Restricted maximum likelihood estimation of covariances in sparse linear models
This paper discusses the restricted maximum likelihood (REML) approach for the estimation of covariance matrices in linear stochastic models, as implemented in the current version of the VCE package for covariance component estimation in large animal breeding models. The main features are: 1) the representation of the equations in an augmented form that simplifies the implementation; 2) the parametrization of the covariance matrices by means of their Cholesky factors, thus automatically ensuring their positive definiteness; 3) explicit formulas for the gradients of the REML function for the case of large and sparse model equations with a large number of unknown covariance components and possibly incomplete data, using the sparse inverse to obtain the gradients cheaply; 4) use of model equations that make separate formation of the inverse of the numerator relationship matrix unnecessary. Many large scale breeding problems were solved with the new implementation, among them an example with more than 250 000 normal equations and 55 covariance components, taking 41 h CPU time on a Hewlett Packard 755
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
Sparse Inverse Covariance Estimation for Chordal Structures
In this paper, we consider the Graphical Lasso (GL), a popular optimization
problem for learning the sparse representations of high-dimensional datasets,
which is well-known to be computationally expensive for large-scale problems.
Recently, we have shown that the sparsity pattern of the optimal solution of GL
is equivalent to the one obtained from simply thresholding the sample
covariance matrix, for sparse graphs under different conditions. We have also
derived a closed-form solution that is optimal when the thresholded sample
covariance matrix has an acyclic structure. As a major generalization of the
previous result, in this paper we derive a closed-form solution for the GL for
graphs with chordal structures. We show that the GL and thresholding
equivalence conditions can significantly be simplified and are expected to hold
for high-dimensional problems if the thresholded sample covariance matrix has a
chordal structure. We then show that the GL and thresholding equivalence is
enough to reduce the GL to a maximum determinant matrix completion problem and
drive a recursive closed-form solution for the GL when the thresholded sample
covariance matrix has a chordal structure. For large-scale problems with up to
450 million variables, the proposed method can solve the GL problem in less
than 2 minutes, while the state-of-the-art methods converge in more than 2
hours
Markov models for fMRI correlation structure: is brain functional connectivity small world, or decomposable into networks?
Correlations in the signal observed via functional Magnetic Resonance Imaging
(fMRI), are expected to reveal the interactions in the underlying neural
populations through hemodynamic response. In particular, they highlight
distributed set of mutually correlated regions that correspond to brain
networks related to different cognitive functions. Yet graph-theoretical
studies of neural connections give a different picture: that of a highly
integrated system with small-world properties: local clustering but with short
pathways across the complete structure. We examine the conditional independence
properties of the fMRI signal, i.e. its Markov structure, to find realistic
assumptions on the connectivity structure that are required to explain the
observed functional connectivity. In particular we seek a decomposition of the
Markov structure into segregated functional networks using decomposable graphs:
a set of strongly-connected and partially overlapping cliques. We introduce a
new method to efficiently extract such cliques on a large, strongly-connected
graph. We compare methods learning different graph structures from functional
connectivity by testing the goodness of fit of the model they learn on new
data. We find that summarizing the structure as strongly-connected networks can
give a good description only for very large and overlapping networks. These
results highlight that Markov models are good tools to identify the structure
of brain connectivity from fMRI signals, but for this purpose they must reflect
the small-world properties of the underlying neural systems
Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation
Across a variety of scientific disciplines, sparse inverse covariance
estimation is a popular tool for capturing the underlying dependency
relationships in multivariate data. Unfortunately, most estimators are not
scalable enough to handle the sizes of modern high-dimensional data sets (often
on the order of terabytes), and assume Gaussian samples. To address these
deficiencies, we introduce HP-CONCORD, a highly scalable optimization method
for estimating a sparse inverse covariance matrix based on a regularized
pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal
gradient method uses a novel communication-avoiding linear algebra algorithm
and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving
parallel scalability on problems with up to ~819 billion parameters (1.28
million dimensions); even on a single node, HP-CONCORD demonstrates
scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to
estimate the underlying dependency structure of the brain from fMRI data, and
use the result to identify functional regions automatically. The results show
good agreement with a clustering from the neuroscience literature.Comment: Main paper: 15 pages, appendix: 24 page
- …