19,899 research outputs found
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Bayesian Approach to Linear Bayesian Networks
This study proposes the first Bayesian approach for learning high-dimensional
linear Bayesian networks. The proposed approach iteratively estimates each
element of the topological ordering from backward and its parent using the
inverse of a partial covariance matrix. The proposed method successfully
recovers the underlying structure when Bayesian regularization for the inverse
covariance matrix with unequal shrinkage is applied. Specifically, it shows
that the number of samples and are sufficient for the proposed algorithm to learn linear Bayesian
networks with sub-Gaussian and 4m-th bounded-moment error distributions,
respectively, where is the number of nodes and is the maximum degree
of the moralized graph. The theoretical findings are supported by extensive
simulation studies including real data analysis. Furthermore the proposed
method is demonstrated to outperform state-of-the-art frequentist approaches,
such as the BHLSM, LISTEN, and TD algorithms in synthetic data
A Comparison of Algorithms for Learning Hidden Variables in Normal Graphs
A Bayesian factor graph reduced to normal form consists in the
interconnection of diverter units (or equal constraint units) and
Single-Input/Single-Output (SISO) blocks. In this framework localized
adaptation rules are explicitly derived from a constrained maximum likelihood
(ML) formulation and from a minimum KL-divergence criterion using KKT
conditions. The learning algorithms are compared with two other updating
equations based on a Viterbi-like and on a variational approximation
respectively. The performance of the various algorithm is verified on synthetic
data sets for various architectures. The objective of this paper is to provide
the programmer with explicit algorithms for rapid deployment of Bayesian graphs
in the applications.Comment: Submitted for journal publicatio
Application of new probabilistic graphical models in the genetic regulatory networks studies
This paper introduces two new probabilistic graphical models for
reconstruction of genetic regulatory networks using DNA microarray data. One is
an Independence Graph (IG) model with either a forward or a backward search
algorithm and the other one is a Gaussian Network (GN) model with a novel
greedy search method. The performances of both models were evaluated on four
MAPK pathways in yeast and three simulated data sets. Generally, an IG model
provides a sparse graph but a GN model produces a dense graph where more
information about gene-gene interactions is preserved. Additionally, we found
two key limitations in the prediction of genetic regulatory networks using DNA
microarray data, the first is the sufficiency of sample size and the second is
the complexity of network structures may not be captured without additional
data at the protein level. Those limitations are present in all prediction
methods which used only DNA microarray data.Comment: 38 pages, 3 figure
Variable selection for BART: An application to gene regulation
We consider the task of discovering gene regulatory networks, which are
defined as sets of genes and the corresponding transcription factors which
regulate their expression levels. This can be viewed as a variable selection
problem, potentially with high dimensionality. Variable selection is especially
challenging in high-dimensional settings, where it is difficult to detect
subtle individual effects and interactions between predictors. Bayesian
Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a
novel nonparametric alternative to parametric regression approaches, such as
the lasso or stepwise regression, especially when the number of relevant
predictors is sparse relative to the total number of available predictors and
the fundamental relationships are nonlinear. We develop a principled
permutation-based inferential approach for determining when the effect of a
selected predictor is likely to be real. Going further, we adapt the BART
procedure to incorporate informed prior information about variable importance.
We present simulations demonstrating that our method compares favorably to
existing parametric and nonparametric procedures in a variety of data settings.
To demonstrate the potential of our approach in a biological context, we apply
it to the task of inferring the gene regulatory network in yeast (Saccharomyces
cerevisiae). We find that our BART-based procedure is best able to recover the
subset of covariates with the largest signal compared to other variable
selection methods. The methods developed in this work are readily available in
the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …