5 research outputs found

    A Parallel Algorithm for Exact Bayesian Structure Discovery in Bayesian Networks

    Full text link
    Exact Bayesian structure discovery in Bayesian networks requires exponential time and space. Using dynamic programming (DP), the fastest known sequential algorithm computes the exact posterior probabilities of structural features in O(2(d+1)n2n)O(2(d+1)n2^n) time and space, if the number of nodes (variables) in the Bayesian network is nn and the in-degree (the number of parents) per node is bounded by a constant dd. Here we present a parallel algorithm capable of computing the exact posterior probabilities for all n(n−1)n(n-1) edges with optimal parallel space efficiency and nearly optimal parallel time efficiency. That is, if p=2kp=2^k processors are used, the run-time reduces to O(5(d+1)n2n−k+k(n−k)d)O(5(d+1)n2^{n-k}+k(n-k)^d) and the space usage becomes O(n2n−k)O(n2^{n-k}) per processor. Our algorithm is based the observation that the subproblems in the sequential DP algorithm constitute a nn-DD hypercube. We take a delicate way to coordinate the computation of correlated DP procedures such that large amount of data exchange is suppressed. Further, we develop parallel techniques for two variants of the well-known \emph{zeta transform}, which have applications outside the context of Bayesian networks. We demonstrate the capability of our algorithm on datasets with up to 33 variables and its scalability on up to 2048 processors. We apply our algorithm to a biological data set for discovering the yeast pheromone response pathways.Comment: 32 pages, 12 figure

    A Review of Inference Algorithms for Hybrid Bayesian Networks

    Get PDF
    Hybrid Bayesian networks have received an increasing attention during the last years. The difference with respect to standard Bayesian networks is that they can host discrete and continuous variables simultaneously, which extends the applicability of the Bayesian network framework in general. However, this extra feature also comes at a cost: inference in these types of models is computationally more challenging and the underlying models and updating procedures may not even support closed-form solutions. In this paper we provide an overview of the main trends and principled approaches for performing inference in hybrid Bayesian networks. The methods covered in the paper are organized and discussed according to their methodological basis. We consider how the methods have been extended and adapted to also include (hybrid) dynamic Bayesian networks, and we end with an overview of established software systems supporting inference in these types of models

    Parallel Algorithms for Bayesian Networks Structure Learning with Applications in Systems Biology

    Get PDF
    The expression levels of thousands to tens of thousands of genes in a living cell are controlled by internal and external cues which act in a combinatorial manner that can be modeled as a network. High-throughput technologies, such as DNA-microarrays and next generation sequencing, allow for the measurement of gene expression levels on a whole-genome scale. In recent years, a wealth of microarray data probing gene expression under various biological conditions has been accumulated in public repositories, which facilitates uncovering the underlying transcriptional networks (gene networks). Due to the high data dimensionality and inherent complexity of gene interactions, this task inevitably requires automated computational approaches. Various models have been proposed for learning gene networks, with Bayesian networks (BNs) showing promise for the task. However, BN structure learning is an NP-hard problem and both exact and heuristic methods are computationally intensive with limited ability to produce large networks. To address these issues, we developed a set of parallel algorithms. First, we present a communication efficient parallel algorithm for exact BN structure learning, which is work-optimal provided that 2^n \u3e p.log(p), where n is the total number of variables, and p is the number of processors. This algorithm has space complexity within 1.41 of the optimal. Our empirical results demonstrate near perfect scaling on up to 2,048 processors. We further extend this work to the case of bounded node in-degree, where a limit d on the number of parents per variable is imposed. We characterize the algorithm\u27s run-time behavior as a function of d, establishing the range [n/3 - log(mn), ceil(n/2)) of values for d where it affects performance. Consequently, two plateaus regions are identified: for d \u3c n/3 - log(mn), where the run-time complexity remains the same as for d=1, and for d \u3e= ceil(n/2), where the run-time complexity remains the same as for d=n-1. Finally, we present a parallel heuristic approach for large-scale BN learning. This approach aims to combine the precision of exact learning with the scalability of heuristic methods. Our empirical results demonstrate good scaling on various high performance platforms. The quality of the learned networks for both exact and heuristic methods are evaluated using synthetically generated expression data. The biological relevance of the networks learned by the exact algorithm is assessed by applying it to the carotenoid biosynthesis pathway in Arabidopsis thaliana

    Structure Discovery in Bayesian Networks: Algorithms and Applications

    Get PDF
    Bayesian networks are a class of probabilistic graphical models that have been widely used in various tasks for probabilistic inference and causal modeling. A Bayesian network provides a compact, flexible, and interpretable representation of a joint probability distribution. When the network structure is unknown but there are observational data at hand, one can try to learn the network structure from the data. This is called structure discovery. Structure discovery in Bayesian networks is a host of several interesting problem variants. In the optimal Bayesian network learning problem (we call this structure learning), one aims to find a Bayesian network that best explains the data and then utilizes this optimal Bayesian network for predictions or inferences. In others, we are interested in finding the local structural features that are highly probable (we call this structure discovery). Both structure learning and structure discovery are considered very hard because existing approaches to these problems require highly intensive computations. In this dissertation, we develop algorithms to achieve more accurate, efficient and scalable structure discovery in Bayesian networks and demonstrate these algorithms in applications of systems biology and educational data mining. Specifically, this study is conducted in five directions. First of all, we propose a novel heuristic algorithm for Bayesian network structure learning that takes advantage of the idea of curriculum learning and learns Bayesian network structures by stages. We prove theoretical advantages of our algorithm and also empirically show that it outperforms the state-of-the-art heuristic approach in learning Bayesian network structures. Secondly, we develop an algorithm to efficiently enumerate the k-best equivalence classes of Bayesian networks where Bayesian networks in the same equivalence class are equally expressive in terms of representing probability distributions. We demonstrate our algorithm in the task of Bayesian model averaging. Our approach goes beyond the maximum-a-posteriori (MAP) model by listing the most likely network structures and their relative likelihood and therefore has important applications in causal structure discovery. Thirdly, we study how parallelism can be used to tackle the exponential time and space complexity in the exact Bayesian structure discovery. We consider the problem of computing the exact posterior probabilities of directed edges in Bayesian networks. We present a parallel algorithm capable of computing the exact posterior probabilities of all possible directed edges with optimal parallel space efficiency and nearly optimal parallel time efficiency. We apply our algorithm to a biological data set for discovering the yeast pheromone response pathways. Fourthly, we develop novel algorithms for computing the exact posterior probabilities of ancestor relations in Bayesian networks. Existing algorithm assumes an order-modular prior over Bayesian networks that does not respect Markov equivalence. Our algorithm allows uniform prior and respects the Markov equivalence. We apply our algorithm to a biological data set for discovering protein signaling pathways. Finally, we introduce Combined student Modeling and prerequisite Discovery (COMMAND), a novel algorithm for jointly inferring a prerequisite graph and a student model from student performance data. COMMAND learns the skill prerequisite relations as a Bayesian network, which is capable of modeling the global prerequisite structure and capturing the conditional independence between skills. Our experiments on simulations and real student data suggest that COMMAND is better than prior methods in the literature. COMMAND is useful for designing intelligent tutoring systems that assess student knowledge or that offer remediation interventions to students

    ALGORITHMS FOR CONSTRAINT-BASED LEARNING OF BAYESIAN NETWORK STRUCTURES WITH LARGE NUMBERS OF VARIABLES

    Get PDF
    Bayesian networks (BNs) are highly practical and successful tools for modeling probabilistic knowledge. They can be constructed by an expert, learned from data, or by a combination of the two. A popular approach to learning the structure of a BN is the constraint-based search (CBS) approach, with the PC algorithm being a prominent example. In recent years, we have been experiencing a data deluge. We have access to more data, big and small, than ever before. The exponential nature of BN algorithms, however, hinders large-scale analysis. Developments in parallel and distributed computing have made the computational power required for large-scale data processing widely available, yielding opportunities for developing parallel and distributed algorithms for BN learning and inference. In this dissertation, (1) I propose two MapReduce versions of the PC algorithm, aimed at solving an increasingly common case: data is not necessarily massive in the number of records, but more and more so in the number of variables. (2) When the number of data records is small, the PC algorithm experiences problems in independence testing. Empirically, I explore a contradiction in the literature on how to resolve the case of having insufficient data when testing the independence of two variables: declare independence or dependence. (3) When BNs learned from data become complex in terms of graph density, they may require more parameters than we can feasibly store. I propose and evaluate five approaches to pruning a BN structure to guarantee that it will be tractable for storage and inference. I follow this up by proposing three approaches to improving the classification accuracy of a BN by modifying its structure
    corecore