Search CORE

19,899 research outputs found

Massively-Parallel Feature Selection for Big Data

Author: Borboudakis Giorgos
Christophides Vassilis
Katsogridakis Pavlos
Pratikakis Polyvios
Tsamardinos Ioannis
Publication venue
Publication date: 23/08/2017
Field of study

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of

p

-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Bayesian Approach to Linear Bayesian Networks

Author: Hwang Seyong
Lee Kyoungjae
Oh Sunmin
Park Gunwoong
Publication venue
Publication date: 27/11/2023
Field of study

This study proposes the first Bayesian approach for learning high-dimensional linear Bayesian networks. The proposed approach iteratively estimates each element of the topological ordering from backward and its parent using the inverse of a partial covariance matrix. The proposed method successfully recovers the underlying structure when Bayesian regularization for the inverse covariance matrix with unequal shrinkage is applied. Specifically, it shows that the number of samples

n = \Omega( d_M^2 \log p)

and

n = \Omega(d_M^2 p^{2/m})

are sufficient for the proposed algorithm to learn linear Bayesian networks with sub-Gaussian and 4m-th bounded-moment error distributions, respectively, where

p

is the number of nodes and

d_M

is the maximum degree of the moralized graph. The theoretical findings are supported by extensive simulation studies including real data analysis. Furthermore the proposed method is demonstrated to outperform state-of-the-art frequentist approaches, such as the BHLSM, LISTEN, and TD algorithms in synthetic data

arXiv.org e-Print Archive

A Comparison of Algorithms for Learning Hidden Variables in Normal Graphs

Author: Palmieri Francesco A. N.
Publication venue
Publication date: 01/01/2013
Field of study

A Bayesian factor graph reduced to normal form consists in the interconnection of diverter units (or equal constraint units) and Single-Input/Single-Output (SISO) blocks. In this framework localized adaptation rules are explicitly derived from a constrained maximum likelihood (ML) formulation and from a minimum KL-divergence criterion using KKT conditions. The learning algorithms are compared with two other updating equations based on a Viterbi-like and on a variational approximation respectively. The performance of the various algorithm is verified on synthetic data sets for various architectures. The objective of this paper is to provide the programmer with explicit algorithms for rapid deployment of Bayesian graphs in the applications.Comment: Submitted for journal publicatio

arXiv.org e-Print Archive

CiteSeerX

Archivio Istituzionale della Ricerca - Università degli Studi della Campania "Luigi Vanvitelli"

Application of new probabilistic graphical models in the genetic regulatory networks studies

Author: Anderson
Bar-Joseph
Chiang
Chickering
Cox
de la Fuente
Edwards
Friedman
Futcher
Geiger
Hartemink
Jan Delabie
Jong
Junbai Wang
Kikuchi
Lee
Leo Wang-Kit Cheung
Li
Meek
Qian
Rangel
Roberts
Rung
Segal
Somogyi
Spirtes
Spirtes
Spirtes
Steffen
Toh
Troyanskaya
Wang
Wu
Yeung
Yu
Yu
Zhang
Zhou
Publication venue: 'Elsevier BV'
Publication date: 31/12/2005
Field of study

This paper introduces two new probabilistic graphical models for reconstruction of genetic regulatory networks using DNA microarray data. One is an Independence Graph (IG) model with either a forward or a backward search algorithm and the other one is a Gaussian Network (GN) model with a novel greedy search method. The performances of both models were evaluated on four MAPK pathways in yeast and three simulated data sets. Generally, an IG model provides a sparse graph but a GN model produces a dense graph where more information about gene-gene interactions is preserved. Additionally, we found two key limitations in the prediction of genetic regulatory networks using DNA microarray data, the first is the sufficiency of sample size and the second is the complexity of network structures may not be captured without additional data at the protein level. Those limitations are present in all prediction methods which used only DNA microarray data.Comment: 38 pages, 3 figure

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Crossref

Variable selection for BART: An application to gene regulation

Author: Bleich Justin
George Edward I.
Jensen Shane T.
Kapelner Adam
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2014
Field of study

We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

ScholarlyCommons@Penn