202 research outputs found
High-dimensional Sparse Inverse Covariance Estimation using Greedy Methods
In this paper we consider the task of estimating the non-zero pattern of the
sparse inverse covariance matrix of a zero-mean Gaussian random vector from a
set of iid samples. Note that this is also equivalent to recovering the
underlying graph structure of a sparse Gaussian Markov Random Field (GMRF). We
present two novel greedy approaches to solving this problem. The first
estimates the non-zero covariates of the overall inverse covariance matrix
using a series of global forward and backward greedy steps. The second
estimates the neighborhood of each node in the graph separately, again using
greedy forward and backward steps, and combines the intermediate neighborhoods
to form an overall estimate. The principal contribution of this paper is a
rigorous analysis of the sparsistency, or consistency in recovering the
sparsity pattern of the inverse covariance matrix. Surprisingly, we show that
both the local and global greedy methods learn the full structure of the model
with high probability given just samples, which is a
\emph{significant} improvement over state of the art -regularized
Gaussian MLE (Graphical Lasso) that requires samples. Moreover,
the restricted eigenvalue and smoothness conditions imposed by our greedy
methods are much weaker than the strong irrepresentable conditions required by
the -regularization based methods. We corroborate our results with
extensive simulations and examples, comparing our local and global greedy
methods to the -regularized Gaussian MLE as well as the Neighborhood
Greedy method to that of nodewise -regularized linear regression
(Neighborhood Lasso).Comment: Accepted to AI STAT 2012 for Oral Presentatio
Regression modeling on stratified data with the lasso
We consider the estimation of regression models on strata defined using a
categorical covariate, in order to identify interactions between this
categorical covariate and the other predictors. A basic approach requires the
choice of a reference stratum. We show that the performance of a penalized
version of this approach depends on this arbitrary choice. We propose a refined
approach that bypasses this arbitrary choice, at almost no additional
computational cost. Regarding model selection consistency, our proposal mimics
the strategy based on an optimal and covariate-specific choice for the
reference stratum. Results from an empirical study confirm that our proposal
generally outperforms the basic approach in the identification and description
of the interactions. An illustration on gene expression data is provided.Comment: 23 pages, 5 figure
A sparse conditional Gaussian graphical model for analysis of genetical genomics data
Genetical genomics experiments have now been routinely conducted to measure
both the genetic markers and gene expression data on the same subjects. The
gene expression levels are often treated as quantitative traits and are subject
to standard genetic analysis in order to identify the gene expression
quantitative loci (eQTL). However, the genetic architecture for many gene
expressions may be complex, and poorly estimated genetic architecture may
compromise the inferences of the dependency structures of the genes at the
transcriptional level. In this paper we introduce a sparse conditional Gaussian
graphical model for studying the conditional independent relationships among a
set of gene expressions adjusting for possible genetic effects where the gene
expressions are modeled with seemingly unrelated regressions. We present an
efficient coordinate descent algorithm to obtain the penalized estimation of
both the regression coefficients and the sparse concentration matrix. The
corresponding graph can be used to determine the conditional independence among
a group of genes while adjusting for shared genetic effects. Simulation
experiments and asymptotic convergence rates and sparsistency are used to
justify our proposed methods. By sparsistency, we mean the property that all
parameters that are zero are actually estimated as zero with probability
tending to one. We apply our methods to the analysis of a yeast eQTL data set
and demonstrate that the conditional Gaussian graphical model leads to a more
interpretable gene network than a standard Gaussian graphical model based on
gene expression data alone.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS494 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
On Graphical Models via Univariate Exponential Family Distributions
Undirected graphical models, or Markov networks, are a popular class of
statistical models, used in a wide variety of applications. Popular instances
of this class include Gaussian graphical models and Ising models. In many
settings, however, it might not be clear which subclass of graphical models to
use, particularly for non-Gaussian and non-categorical data. In this paper, we
consider a general sub-class of graphical models where the node-wise
conditional distributions arise from exponential families. This allows us to
derive multivariate graphical model distributions from univariate exponential
family distributions, such as the Poisson, negative binomial, and exponential
distributions. Our key contributions include a class of M-estimators to fit
these graphical model distributions; and rigorous statistical analysis showing
that these M-estimators recover the true graphical model structure exactly,
with high probability. We provide examples of genomic and proteomic networks
learned via instances of our class of graphical models derived from Poisson and
exponential distributions.Comment: Journal of Machine Learning Researc
Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
A challenging problem in estimating high-dimensional graphical models is to
choose the regularization parameter in a data-dependent way. The standard
techniques include -fold cross-validation (-CV), Akaike information
criterion (AIC), and Bayesian information criterion (BIC). Though these methods
work well for low-dimensional problems, they are not suitable in high
dimensional settings. In this paper, we present StARS: a new stability-based
method for choosing the regularization parameter in high dimensional inference
for undirected graphs. The method has a clear interpretation: we use the least
amount of regularization that simultaneously makes a graph sparse and
replicable under random sampling. This interpretation requires essentially no
conditions. Under mild conditions, we show that StARS is partially sparsistent
in terms of graph estimation: i.e. with high probability, all the true edges
will be included in the selected model even when the graph size diverges with
the sample size. Empirically, the performance of StARS is compared with the
state-of-the-art model selection procedures, including -CV, AIC, and BIC, on
both synthetic data and a real microarray dataset. StARS outperforms all these
competing procedures
- …