1,207 research outputs found

    Advances in Learning Bayesian Networks of Bounded Treewidth

    Full text link
    This work presents novel algorithms for learning Bayesian network structures with bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed-integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in uniformly sampling kk-trees (maximal graphs of treewidth kk), and subsequently selecting, exactly or approximately, the best structure whose moral graph is a subgraph of that kk-tree. Some properties of these methods are discussed and proven. The approaches are empirically compared to each other and to a state-of-the-art method for learning bounded treewidth structures on a collection of public data sets with up to 100 variables. The experiments show that our exact algorithm outperforms the state of the art, and that the approximate approach is fairly accurate.Comment: 23 pages, 2 figures, 3 table

    Inferring Regulatory Networks by Combining Perturbation Screens and Steady State Gene Expression Profiles

    Full text link
    Reconstructing transcriptional regulatory networks is an important task in functional genomics. Data obtained from experiments that perturb genes by knockouts or RNA interference contain useful information for addressing this reconstruction problem. However, such data can be limited in size and/or are expensive to acquire. On the other hand, observational data of the organism in steady state (e.g. wild-type) are more readily available, but their informational content is inadequate for the task at hand. We develop a computational approach to appropriately utilize both data sources for estimating a regulatory network. The proposed approach is based on a three-step algorithm to estimate the underlying directed but cyclic network, that uses as input both perturbation screens and steady state gene expression data. In the first step, the algorithm determines causal orderings of the genes that are consistent with the perturbation data, by combining an exhaustive search method with a fast heuristic that in turn couples a Monte Carlo technique with a fast search algorithm. In the second step, for each obtained causal ordering, a regulatory network is estimated using a penalized likelihood based method, while in the third step a consensus network is constructed from the highest scored ones. Extensive computational experiments show that the algorithm performs well in reconstructing the underlying network and clearly outperforms competing approaches that rely only on a single data source. Further, it is established that the algorithm produces a consistent estimate of the regulatory network.Comment: 24 pages, 4 figures, 6 table

    Sparse Linear Identifiable Multivariate Modeling

    Full text link
    In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component delta-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/.Comment: 45 pages, 17 figure

    Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

    Full text link
    Unsupervised estimation of latent variable models is a fundamental problem central to numerous applications of machine learning and statistics. This work presents a principled approach for estimating broad classes of such models, including probabilistic topic models and latent linear Bayesian networks, using only second-order observed moments. The sufficient conditions for identifiability of these models are primarily based on weak expansion constraints on the topic-word matrix, for topic models, and on the directed acyclic graph, for Bayesian networks. Because no assumptions are made on the distribution among the latent variables, the approach can handle arbitrary correlations among the topics or latent factors. In addition, a tractable learning method via 1\ell_1 optimization is proposed and studied in numerical experiments.Comment: 38 pages, 6 figures, 2 tables, applications in topic models and Bayesian networks are studied. Simulation section is adde

    Bayesian network structure learning using characteristic properties of permutation representations with applications to prostate cancer treatment.

    Get PDF
    Over the last decades, Bayesian Networks (BNs) have become an increasingly popular technique to model data under presence of uncertainty. BNs are probabilistic models that represent relationships between variables by means of a node structure and a set of parameters. Learning efficiently the structure that models a particular dataset is a NP-hard task that requires substantial computational efforts to be successful. Although there exist many families of techniques for this purpose, this thesis focuses on the study and improvement of search and score methods such as Evolutionary Algorithms (EAs). In the domain of BN structure learning, previous work has investigated the use of permutations to represent variable orderings within EAs. In this thesis, the characteristic properties of permutation representations are analysed and used in order to enhance BN structure learning. The thesis assesses well-established algorithms to provide a detailed analysis of the difficulty of learning BN structures using permutation representations. Using selected benchmarks, rugged and plateaued fitness landscapes are identified that result in a loss of population diversity throughout the search. The thesis proposes two approaches to handle the loss of diversity. First, the benefits of introducing the Island Model (IM) paradigm are studied, showing that diversity loss can be significantly reduced. Second, a novel agent-based metaheuristic is presented in which evolution is based on the use of several mutation operators and the definition of a distance metric in permutation spaces. The latter approach shows that diversity can be maintained throughout the search while exploring efficiently the solution space. In addition, the use of IM is investigated in the context of distributed data, a common property of real-world problems. Experiments prove that privacy can be preserved while learning BNs of high quality. Finally, using UK-wide data related to prostate cancer patients, the thesis assesses the general suitability of BNs alongside the proposed learning approaches for medical data modeling. Following comparisons with tools currently used in clinical settings and with alternative classifiers, it is shown that BNs can improve the predictive power of prostate cancer staging tools, a major concern in the field of urology

    Problem dependent metaheuristic performance in Bayesian network structure learning.

    Get PDF
    Bayesian network (BN) structure learning from data has been an active research area in the machine learning field in recent decades. Much of the research has considered BN structure learning as an optimization problem. However, the finding of optimal BN from data is NP-hard. This fact has driven the use of heuristic algorithms for solving this kind of problem. Amajor recent focus in BN structure learning is on search and score algorithms. In these algorithms, a scoring function is introduced and a heuristic search algorithm is used to evaluate each network with respect to the training data. The optimal network is produced according to the best score evaluated. This thesis investigates a range of search and score algorithms to understand the relationship between technique performance and structure features of the problems. The main contributions of this thesis include (a) Two novel Ant Colony Optimization based search and score algorithms for BN structure learning; (b) Node juxtaposition distribution for studying the relationship between the best node ordering and the optimal BN structure; (c) Fitness landscape analysis for investigating the di erent performances of both chain score function and the CH score function; (d) A classifier method is constructed by utilizing receiver operating characteristic curve with the results on fitness landscape analysis; and finally (e) a selective o -line hyperheuristic algorithm is built for unseen BN structure learning with search and score algorithms. In this thesis, we also construct a new algorithm for producing BN benchmark structures and apply our novel approaches to a range of benchmark problems and real world problem
    corecore