1,509 research outputs found
The Reduced PC-Algorithm: Improved Causal Structure Learning in Large Random Networks
We consider the task of estimating a high-dimensional directed acyclic graph,
given observations from a linear structural equation model with arbitrary noise
distribution. By exploiting properties of common random graphs, we develop a
new algorithm that requires conditioning only on small sets of variables. The
proposed algorithm, which is essentially a modified version of the
PC-Algorithm, offers significant gains in both computational complexity and
estimation accuracy. In particular, it results in more efficient and accurate
estimation in large networks containing hub nodes, which are common in
biological systems. We prove the consistency of the proposed algorithm, and
show that it also requires a less stringent faithfulness assumption than the
PC-Algorithm. Simulations in low and high-dimensional settings are used to
illustrate these findings. An application to gene expression data suggests that
the proposed algorithm can identify a greater number of clinically relevant
genes than current methods
Structure Learning in Graphical Modeling
A graphical model is a statistical model that is associated to a graph whose
nodes correspond to variables of interest. The edges of the graph reflect
allowed conditional dependencies among the variables. Graphical models admit
computationally convenient factorization properties and have long been a
valuable tool for tractable modeling of multivariate distributions. More
recently, applications such as reconstructing gene regulatory networks from
gene expression data have driven major advances in structure learning, that is,
estimating the graph underlying a model. We review some of these advances and
discuss methods such as the graphical lasso and neighborhood selection for
undirected graphical models (or Markov random fields), and the PC algorithm and
score-based search methods for directed graphical models (or Bayesian
networks). We further review extensions that account for effects of latent
variables and heterogeneous data sources
High-dimensional consistency in score-based and hybrid structure learning
Main approaches for learning Bayesian networks can be classified as
constraint-based, score-based or hybrid methods. Although high-dimensional
consistency results are available for constraint-based methods like the PC
algorithm, such results have not been proved for score-based or hybrid methods,
and most of the hybrid methods have not even shown to be consistent in the
classical setting where the number of variables remains fixed and the sample
size tends to infinity. In this paper, we show that consistency of hybrid
methods based on greedy equivalence search (GES) can be achieved in the
classical setting with adaptive restrictions on the search space that depend on
the current state of the algorithm. Moreover, we prove consistency of GES and
adaptively restricted GES (ARGES) in several sparse high-dimensional settings.
ARGES scales well to sparse graphs with thousands of variables and our
simulation study indicates that both GES and ARGES generally outperform the PC
algorithm.Comment: 37 pages, 5 figures, 41 pages supplement (available as an ancillary
file
Causal Structure Learning
Graphical models can represent a multivariate distribution in a convenient
and accessible form as a graph. Causal models can be viewed as a special class
of graphical models that not only represent the distribution of the observed
system but also the distributions under external interventions. They hence
enable predictions under hypothetical interventions, which is important for
decision making. The challenging task of learning causal models from data
always relies on some underlying assumptions. We discuss several recently
proposed structure learning algorithms and their assumptions, and compare their
empirical performance under various scenarios.Comment: to appear in `Annual Review of Statistics and Its Application', 30
page
High dimensional sparse covariance estimation via directed acyclic graphs
We present a graph-based technique for estimating sparse covariance matrices
and their inverses from high-dimensional data. The method is based on learning
a directed acyclic graph (DAG) and estimating parameters of a multivariate
Gaussian distribution based on a DAG. For inferring the underlying DAG we use
the PC-algorithm and for estimating the DAG-based covariance matrix and its
inverse, we use a Cholesky decomposition approach which provides a positive
(semi-)definite sparse estimate. We present a consistency result in the
high-dimensional framework and we compare our method with the Glasso for
simulated and real data
PC algorithm for Gaussian copula graphical models
The PC algorithm uses conditional independence tests for model selection in
graphical modeling with acyclic directed graphs. In Gaussian models, tests of
conditional independence are typically based on Pearson correlations, and
high-dimensional consistency results have been obtained for the PC algorithm in
this setting. We prove that high-dimensional consistency carries over to the
broader class of Gaussian copula or \textit{nonparanormal} models when using
rank-based measures of correlation. For graphs with bounded degree, our result
is as strong as prior Gaussian results. In simulations, the `Rank PC' algorithm
works as well as the `Pearson PC' algorithm for normal data and considerably
better for non-normal Gaussian copula data, all the while incurring a
negligible increase of computation time. Simulations with contaminated data
show that rank correlations can also perform better than other robust estimates
considered in previous work when the underlying distribution does not belong to
the nonparanormal family
Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm
We consider variable selection in high-dimensional linear models where the
number of covariates greatly exceeds the sample size. We introduce the new
concept of partial faithfulness and use it to infer associations between the
covariates and the response. Under partial faithfulness, we develop a
simplified version of the PC algorithm (Spirtes et al., 2000), the PC-simple
algorithm, which is computationally feasible even with thousands of covariates
and provides consistent variable selection under conditions on the random
design matrix that are of a different nature than coherence conditions for
penalty-based approaches like the Lasso. Simulations and application to real
data show that our method is competitive compared to penalty-based approaches.
We provide an efficient implementation of the algorithm in the R-package pcalg.Comment: 20 pages, 3 figure
Robust causal structure learning with some hidden variables
We introduce a new method to estimate the Markov equivalence class of a
directed acyclic graph (DAG) in the presence of hidden variables, in settings
where the underlying DAG among the observed variables is sparse, and there are
a few hidden variables that have a direct effect on many of the observed ones.
Building on the so-called low rank plus sparse framework, we suggest a
two-stage approach which first removes the effect of the hidden variables, and
then estimates the Markov equivalence class of the underlying DAG under the
assumption that there are no remaining hidden variables. This approach is
consistent in certain high-dimensional regimes and performs favourably when
compared to the state of the art, both in terms of graphical structure recovery
and total causal effect estimation
Concave Penalized Estimation of Sparse Gaussian Bayesian Networks
We develop a penalized likelihood estimation framework to estimate the
structure of Gaussian Bayesian networks from observational data. In contrast to
recent methods which accelerate the learning problem by restricting the search
space, our main contribution is a fast algorithm for score-based structure
learning which does not restrict the search space in any way and works on
high-dimensional datasets with thousands of variables. Our use of concave
regularization, as opposed to the more popular (e.g. BIC) penalty, is
new. Moreover, we provide theoretical guarantees which generalize existing
asymptotic results when the underlying distribution is Gaussian. Most notably,
our framework does not require the existence of a so-called faithful DAG
representation, and as a result the theory must handle the inherent
nonidentifiability of the estimation problem in a novel way. Finally, as a
matter of independent interest, we provide a comprehensive comparison of our
approach to several standard structure learning methods using open-source
packages developed for the R language. Based on these experiments, we show that
our algorithm is significantly faster than other competing methods while
obtaining higher sensitivity with comparable false discovery rates for
high-dimensional data. In particular, the total runtime for our method to
generate a solution path of 20 estimates for DAGs with 8000 nodes is around one
hour.Comment: 57 page
Estimating and Controlling the False Discovery Rate for the PC Algorithm Using Edge-Specific P-Values
The PC algorithm allows investigators to estimate a complete partially
directed acyclic graph (CPDAG) from a finite dataset, but few groups have
investigated strategies for estimating and controlling the false discovery rate
(FDR) of the edges in the CPDAG. In this paper, we introduce PC with p-values
(PC-p), a fast algorithm which robustly computes edge-specific p-values and
then estimates and controls the FDR across the edges. PC-p specifically uses
the p-values returned by many conditional independence tests to upper bound the
p-values of more complex edge-specific hypothesis tests. The algorithm then
estimates and controls the FDR using the bounded p-values and the
Benjamini-Yekutieli FDR procedure. Modifications to the original PC algorithm
also help PC-p accurately compute the upper bounds despite non-zero Type II
error rates. Experiments show that PC-p yields more accurate FDR estimation and
control across the edges in a variety of CPDAGs compared to alternative
methods
- …