68,313 research outputs found

    Learning Bayesian Networks with the bnlearn R Package

    Get PDF
    bnlearn is an R package which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package.Comment: 22 pages, 4 picture

    Parallelization of the PC Algorithm

    Get PDF
    This paper describes a parallel version of the PC algorithm for learning the structure of a Bayesian network from data. The PC algorithm is a constraint-based algorithm consisting of fi ve steps where the first step is to perform a set of (conditional) independence tests while the remaining four steps relate to identifying the structure of the Bayesian network using the results of the (conditional) independence tests. In this paper, we describe a new approach to parallelization of the (conditional) independence testing as experiments illustrate that this is by far the most time consuming step. The proposed parallel PC algorithm is evaluated on data sets generated at random from five different real- world Bayesian networks. The results demonstrate that signi cant time performance improvements are possible using the proposed algorithm

    Learning Bayesian Networks with the bnlearn R Package

    Get PDF
    bnlearn is an R package (R Development Core Team 2010) which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package (Tierney et al. 2008) to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package (Gentry et al. 2010).

    Bayesian networks to explain the effect of label information on product perception

    Get PDF
    Interdisciplinary approaches in food research require new methods in data analysis that are able to deal with complexity and facilitate the communication among model users. Four parallel full factorial within-subject designs were performed to examine the relative contribution to consumer product evaluation of intrinsic product properties and information given on packaging. Detailed experimental designs and results obtained from analyses of variance were published [1]. The data was analyzed again with the machine learning modelling technique Bayesian networks. The objective of the current paper is to explain basic features of this technique and its advantages over the standard statistical approach regarding handling of complexity and communication of results. With analysis of variance, visualization and interpretation of main effects and interactions effects becomes difficult in complex systems. The Bayesian network model offers the possibility to formally incorporate (domain) experts knowledge. By combining empirical data with the pre-defined network structure, new relationships can be learned, thus generating an update of current knowledge. Probabilistic inference in Bayesian networks allows instant and global use of the model; its graphical representation makes it easy to visualize and communicate the results. Making use of the most of data from one single experiment, as well as combining data of independent experiments makes Bayesian networks for analysing these and similarly complex and rich data set

    Parallelization of the PC Algorithm

    Get PDF
    Abstract. This paper describes a parallel version of the PC algorithm for learning the structure of a Bayesian network from data. The PC algorithm is a constraint-based algorithm consisting of five steps where the first step is to perform a set of (conditional) independence tests while the remaining four steps relate to identifying the structure of the Bayesian network using the results of the (conditional) independence tests. In this paper, we describe a new approach to parallelization of the (conditional) independence testing as experiments illustrate that this is by far the most time consuming step. The proposed parallel PC algorithm is evaluated on data sets generated at random from five different realworld Bayesian networks. The results demonstrate that significant time performance improvements are possible using the proposed algorithm

    A Parallel Algorithm for Exact Bayesian Structure Discovery in Bayesian Networks

    Full text link
    Exact Bayesian structure discovery in Bayesian networks requires exponential time and space. Using dynamic programming (DP), the fastest known sequential algorithm computes the exact posterior probabilities of structural features in O(2(d+1)n2n)O(2(d+1)n2^n) time and space, if the number of nodes (variables) in the Bayesian network is nn and the in-degree (the number of parents) per node is bounded by a constant dd. Here we present a parallel algorithm capable of computing the exact posterior probabilities for all n(n−1)n(n-1) edges with optimal parallel space efficiency and nearly optimal parallel time efficiency. That is, if p=2kp=2^k processors are used, the run-time reduces to O(5(d+1)n2n−k+k(n−k)d)O(5(d+1)n2^{n-k}+k(n-k)^d) and the space usage becomes O(n2n−k)O(n2^{n-k}) per processor. Our algorithm is based the observation that the subproblems in the sequential DP algorithm constitute a nn-DD hypercube. We take a delicate way to coordinate the computation of correlated DP procedures such that large amount of data exchange is suppressed. Further, we develop parallel techniques for two variants of the well-known \emph{zeta transform}, which have applications outside the context of Bayesian networks. We demonstrate the capability of our algorithm on datasets with up to 33 variables and its scalability on up to 2048 processors. We apply our algorithm to a biological data set for discovering the yeast pheromone response pathways.Comment: 32 pages, 12 figure

    Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark

    Full text link
    In Machine Learning, the parent set identification problem is to find a set of random variables that best explain selected variable given the data and some predefined scoring function. This problem is a critical component to structure learning of Bayesian networks and Markov blankets discovery, and thus has many practical applications, ranging from fraud detection to clinical decision support. In this paper, we introduce a new distributed memory approach to the exact parent sets assignment problem. To achieve scalability, we derive theoretical bounds to constraint the search space when MDL scoring function is used, and we reorganize the underlying dynamic programming such that the computational density is increased and fine-grain synchronization is eliminated. We then design efficient realization of our approach in the Apache Spark platform. Through experimental results, we demonstrate that the method maintains strong scalability on a 500-core standalone Spark cluster, and it can be used to efficiently process data sets with 70 variables, far beyond the reach of the currently available solutions
    • …
    corecore