822 research outputs found
Who Learns Better Bayesian Network Structures: Accuracy and Speed of Structure Learning Algorithms
Three classes of algorithms to learn the structure of Bayesian networks from
data are common in the literature: constraint-based algorithms, which use
conditional independence tests to learn the dependence structure of the data;
score-based algorithms, which use goodness-of-fit scores as objective functions
to maximise; and hybrid algorithms that combine both approaches.
Constraint-based and score-based algorithms have been shown to learn the same
structures when conditional independence and goodness of fit are both assessed
using entropy and the topological ordering of the network is known (Cowell,
2001).
In this paper, we investigate how these three classes of algorithms perform
outside the assumptions above in terms of speed and accuracy of network
reconstruction for both discrete and Gaussian Bayesian networks. We approach
this question by recognising that structure learning is defined by the
combination of a statistical criterion and an algorithm that determines how the
criterion is applied to the data. Removing the confounding effect of different
choices for the statistical criterion, we find using both simulated and
real-world complex data that constraint-based algorithms are often less
accurate than score-based algorithms, but are seldom faster (even at large
sample sizes); and that hybrid algorithms are neither faster nor more accurate
than constraint-based algorithms. This suggests that commonly held beliefs on
structure learning in the literature are strongly influenced by the choice of
particular statistical criteria rather than just by the properties of the
algorithms themselves.Comment: 27 pages, 8 figure
Learning Large-Scale Bayesian Networks with the sparsebn Package
Learning graphical models from data is an important problem with wide
applications, ranging from genomics to the social sciences. Nowadays datasets
often have upwards of thousands---sometimes tens or hundreds of thousands---of
variables and far fewer samples. To meet this challenge, we have developed a
new R package called sparsebn for learning the structure of large, sparse
graphical models with a focus on Bayesian networks. While there are many
existing software packages for this task, this package focuses on the unique
setting of learning large networks from high-dimensional data, possibly with
interventions. As such, the methods provided place a premium on scalability and
consistency in a high-dimensional setting. Furthermore, in the presence of
interventions, the methods implemented here achieve the goal of learning a
causal network from data. Additionally, the sparsebn package is fully
compatible with existing software packages for network analysis.Comment: To appear in the Journal of Statistical Software, 39 pages, 7 figure
Mining Marked Nodes in Large Graphs
abstract: With the rise of the Big Data Era, an exponential amount of network data is being generated at an unprecedented rate across a wide-range of high impact micro and macro areas of research---from protein interaction to social networks. The critical challenge is translating this large scale network data into actionable information.
A key task in the data translation is the analysis of network connectivity via marked nodes---the primary focus of our research. We have developed a framework for analyzing network connectivity via marked nodes in large scale graphs, utilizing novel algorithms in three interrelated areas: (1) analysis of a single seed node via it’s ego-centric network (AttriPart algorithm); (2) pathway identification between two seed nodes (K-Simple Shortest Paths Multithreaded and Search Reduced (KSSPR) algorithm); and (3) tree detection, defining the interaction between three or more seed nodes (Shortest Path MST algorithm).
In an effort to address both fundamental and applied research issues, we have developed the LocalForcasting algorithm to explore how network connectivity analysis can be applied to local community evolution and recommender systems. The goal is to apply the LocalForecasting algorithm to various domains---e.g., friend suggestions in social networks or future collaboration in co-authorship networks. This algorithm utilizes link prediction in combination with the AttriPart algorithm to predict future connections in local graph partitions.
Results show that our proposed AttriPart algorithm finds up to 1.6x denser local partitions, while running approximately 43x faster than traditional local partitioning techniques (PageRank-Nibble). In addition, our LocalForecasting algorithm demonstrates a significant improvement in the number of nodes and edges correctly predicted over baseline methods. Furthermore, results for the KSSPR algorithm demonstrate a speed-up of up to 2.5x the standard k-simple shortest paths algorithm.Dissertation/ThesisMasters Thesis Computer Science 201
Low Rank Directed Acyclic Graphs and Causal Structure Learning
Despite several important advances in recent years, learning causal
structures represented by directed acyclic graphs (DAGs) remains a challenging
task in high dimensional settings when the graphs to be learned are not sparse.
In particular, the recent formulation of structure learning as a continuous
optimization problem proved to have considerable advantages over the
traditional combinatorial formulation, but the performance of the resulting
algorithms is still wanting when the target graph is relatively large and
dense. In this paper we propose a novel approach to mitigate this problem, by
exploiting a low rank assumption regarding the (weighted) adjacency matrix of a
DAG causal model. We establish several useful results relating interpretable
graphical conditions to the low rank assumption, and show how to adapt existing
methods for causal structure learning to take advantage of this assumption. We
also provide empirical evidence for the utility of our low rank algorithms,
especially on graphs that are not sparse. Not only do they outperform
state-of-the-art algorithms when the low rank condition is satisfied, the
performance on randomly generated scale-free graphs is also very competitive
even though the true ranks may not be as low as is assumed
- …