258 research outputs found

    Understanding the Scalability of Bayesian Network Inference Using Clique Tree Growth Curves

    Get PDF
    One of the main approaches to performing computation in Bayesian networks (BNs) is clique tree clustering and propagation. The clique tree approach consists of propagation in a clique tree compiled from a Bayesian network, and while it was introduced in the 1980s, there is still a lack of understanding of how clique tree computation time depends on variations in BN size and structure. In this article, we improve this understanding by developing an approach to characterizing clique tree growth as a function of parameters that can be computed in polynomial time from BNs, specifically: (i) the ratio of the number of a BN s non-root nodes to the number of root nodes, and (ii) the expected number of moral edges in their moral graphs. Analytically, we partition the set of cliques in a clique tree into different sets, and introduce a growth curve for the total size of each set. For the special case of bipartite BNs, there are two sets and two growth curves, a mixed clique growth curve and a root clique growth curve. In experiments, where random bipartite BNs generated using the BPART algorithm are studied, we systematically increase the out-degree of the root nodes in bipartite Bayesian networks, by increasing the number of leaf nodes. Surprisingly, root clique growth is well-approximated by Gompertz growth curves, an S-shaped family of curves that has previously been used to describe growth processes in biology, medicine, and neuroscience. We believe that this research improves the understanding of the scaling behavior of clique tree clustering for a certain class of Bayesian networks; presents an aid for trade-off studies of clique tree clustering using growth curves; and ultimately provides a foundation for benchmarking and developing improved BN inference and machine learning algorithms

    An extended depth-first search algorithm for optimal triangulation of Bayesian networks

    Get PDF
    The junction tree algorithm is currently the most popular algorithm for exact inference on Bayesian networks. To improve the time complexity of the junction tree algorithm, we need to find a triangulation with the optimal total table size. For this purpose, Ottosen and Vomlel have proposed a depth-first search (DFS) algorithm. They also introduced several techniques to improve the DFS algorithm, including dynamic clique maintenance and coalescing map pruning. Nevertheless, the efficiency and scalability of that algorithm leave much room for improvement. First, the dynamic clique maintenance allows to recompute some cliques. Second, in the worst case, the DFS algorithm explores the search space of all elimination orders, which has size n!, where n is the number of variables in the Bayesian network. To mitigate these problems, we propose an extended depth-first search (EDFS) algorithm. The new EDFS algorithm introduces the following two techniques as improvements to the DFS algorithm: (1) a new dynamic clique maintenance algorithm that computes only those cliques that contain a new edge, and (2) a new pruning rule, called pivot clique pruning. The new dynamic clique maintenance algorithm explores a smaller search space and runs faster than the Ottosen and Vomlel approach. This improvement can decrease the overhead cost of the DFS algorithm, and the pivot clique pruning reduces the size of the search space by a factor of O(n2). Our empirical results show that our proposed algorithm finds an optimal triangulation markedly faster than the state-of-the-art algorithm does

    ăƒ™ă‚€ă‚žă‚ąăƒłăƒăƒƒăƒˆăƒŻăƒŒă‚Żă«ăŠă‘ă‚‹çąșçŽ‡æŽšè«–ăźé«˜é€ŸćŒ–ăźăŸă‚ăźæœ€é©äž‰è§’ćŒ–ă‚ąăƒ«ă‚ŽăƒȘă‚șăƒ ăźææĄˆ

    Get PDF
     Bayesian networks are widely used probabilistic graphical models that provide a compact representation of joint probability distributions over a set of variables. A common inference task in Bayesian networks is to compute the posterior marginal distributions for the unobserved variables given some evidence variables that we have already observed. However, the inference problem is known to be NP-hard and this complexity of inference limits the usage of Bayesian networks. Many attempts to improve the inference algorithm have been made in the past two decades. Currently, the junction tree algorithm is among the most prominent exact inference algorithms. To perform efficient inference on a Bayesian network using the junction tree algorithm, it is necessary to find a triangulation of the moral graph of the Bayesian network such that the total table size is small. In this context, the total table size is used to measure the computational complexity of the junction tree inference algorithm. This thesis focuses on exact algorithms for finding a triangulation that minimizes the total table size for a given Bayesian network. For optimal triangulation, Ottosen and Vomlel have proposed a depth-first search (DFS) algorithm. They also introduced several techniques to improve the DFS algorithm, including dynamic clique maintenance and coalescing map pruning. Nevertheless, the efficiency and scalability of their algorithm leave much room for improvement. First, the dynamic clique maintenance allows the recomputation of some cliques. Second, for a Bayesian network with n variables, the DFS algorithm runs in O*(n!) time because it explores a search space of all elimination orders. To mitigate these problems, an extended depth-first search (EDFS) algorithm is proposed in this thesis. The new EDFS algorithm introduces two techniques: (1) a new dynamic clique maintenance algorithm that computes only those cliques that contain a new edge, and (2) a new pruning rule, called pivot clique pruning. The new dynamic clique maintenance algorithm explores a smaller search space and runs faster than the Ottosen and Vomlel approach. This improvement can decrease the overhead cost of the DFS algorithm, and the pivot clique pruning reduces the size of the search space by a factor of O(n2). Our empirical results show that our proposed algorithm finds an optimal triangulation markedly faster than the state-of-the-art algorithm does.é›»æ°—é€šäżĄć€§ć­Š201

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    Finding Optimal Tree Decompositions

    Get PDF
    The task of organizing a given graph into a structure called a tree decomposition is relevant in multiple areas of computer science. In particular, many NP-hard problems can be solved in polynomial time if a suitable tree decomposition of a graph describing the problem instance is given as a part of the input. This motivates the task of finding as good tree decompositions as possible, or ideally, optimal tree decompositions. This thesis is about finding optimal tree decompositions of graphs with respect to several notions of optimality. Each of the considered notions measures the quality of a tree decomposition in the context of an application. In particular, we consider a total of seven problems that are formulated as finding optimal tree decompositions: treewidth, minimum fill-in, generalized and fractional hypertreewidth, total table size, phylogenetic character compatibility, and treelength. For each of these problems we consider the BT algorithm of BouchittĂ© and Todinca as the method of finding optimal tree decompositions. The BT algorithm is well-known on the theoretical side, but to our knowledge the first time it was implemented was only recently for the 2nd Parameterized Algorithms and Computational Experiments Challenge (PACE 2017). The author’s implementation of the BT algorithm took the second place in the minimum fill-in track of PACE 2017. In this thesis we review and extend the BT algorithm and our implementation. In particular, we improve the eciency of the algorithm in terms of both theory and practice. We also implement the algorithm for each of the seven problems considered, introducing a novel adaptation of the algorithm for the maximum compatibility problem of phylogenetic characters. Our implementation outperforms alternative state-of-the-art approaches in terms of numbers of test instances solved on well-known benchmarks on minimum fill-in, generalized hypertreewidth, fractional hypertreewidth, total table size, and the maximum compatibility problem of phylogenetic characters. Furthermore, to our understanding the implementation is the first exact approach for the treelength problem

    Generalised temporal network inference

    Get PDF
    openNetwork inference is becoming increasingly central in the analysis of complex phenomena as it allows to obtain understandable models of entities interactions. Among the many possible graphical models, Markov Random Fields are widely used as they are strictly connected to a probability distribution assumption that allow to model a variety of different data. The inference of such models can be guided by two priors: sparsity and non-stationarity. In other words, only few connections are necessary to explain the phenomenon under observation and, as the phenomenon evolves, the underlying connections that explain it may change accordingly. This thesis contains two general methods for the inference of temporal graphical models that deeply rely on the concept of temporal consistency, i.e., the underlying structure of the system is similar (i.e., consistent) in time points that model the same behaviour (i.e., are dependent). The first contribution is a model that allows to be flexible in terms of probability assumption, temporal consistency, and dependency. The second contribution studies the previously introduces model in the presence of Gaussian partially un-observed data. Indeed, it is necessary to explicitly tackle the presence of un-observed data in order to avoid introducing misrepresentations in the inferred graphical model. All extensions are coupled with fast and non-trivial minimisation algorithms that are extensively validate on synthetic and real-world data. Such algorithms and experiments are implemented in a large and well-designed Python library that comprehends many tools for the modelling of multivariate data. Lastly, all the presented models have many hyper-parameters that need to be tuned on data. On this regard, we analyse different model selection strategies showing that a stability-based approach performs best in presence of multi-networks and multiple hyper-parameters.openXXXII CICLO - INFORMATICA E INGEGNERIA DEI SISTEMI/ COMPUTER SCIENCE AND SYSTEMS ENGINEERING - InformaticaTozzo, Veronic

    Fast and accurate supertrees: towards large scale phylogenies

    Get PDF
    Phylogenetics is the study of evolutionary relationships between biological entities; phylogenetic trees (phylogenies) are a visualization of these evolutionary relationships. Accurate approaches to reconstruct hylogenies from sequence data usually result in NPhard optimization problems, hence local search heuristics have to be applied in practice. These methods are highly accurate and fast enough as long as the input data is not too large. Divide-and-conquer techniques are a promising approach to boost scalability and accuracy of those local search heuristics on very large datasets. A divide-and-conquer method breaks down a large phylogenetic problem into smaller sub-problems that are computationally easier to solve. The sub-problems (overlapping trees) are then combined using a supertree method. Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there are methods that encode the source trees in a matrix and the supertree is constructed applying a local search heuristic to optimize the respective objective function. The most widely used supertree methods use such local search heuristics. However, to really improve the scalability of accurate tree reconstruction by divide-and-conquer approaches, accurate polynomial time methods are needed for the supertree reconstruction step. In this work, we present approaches for accurate polynomial time supertree reconstruction in particular Bad Clade Deletion (BCD), a novel heuristic supertree algorithm with polynomial running time. BCD uses minimum cuts to greedily delete a locally minimal number of columns from a matrix representation to make it compatible. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees if one exists. BCD can take support values of the source trees into account without an increase in complexity. We show how reliable clades can be used to restrict the search space for BCD and how those clades can be collected from the input data using the Greedy Strict Consensus Merger. Finally, we introduce a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of BCD with beam search extension is still polynomial. We present an exact and a randomized subroutine to generate suboptimal partial solutions. In our thorough evaluation on several simulated and biological datasets against a representative set of supertree methods we found that BCD is more accurate than the most accurate supertree methods when using support values and search space restriction on simulated data. Simultaneously BCD is faster than any other evaluated method. The beam search approach improved the accuracy of BCD on all evaluated datasets at the cost of speed. We found that BCD supertrees can boost maximum likelihood tree reconstruction when used as starting tree. Further, BCD could handle large scale datasets where local search heuristics did not converge in reasonable time. Due to its combination of speed, accuracy, and the ability to reconstruct the parent tree if one exists, BCD is a promising approach to enable outstanding scalability of divide-and-conquer approaches.Die Phylogenetik studiert die evolutionĂ€ren Beziehungen zwischen biologischen EntitĂ€ten. Phylogenetische BĂ€ume sind eine Visualisierung dieser Beziehungen. Akkurate AnsĂ€tze zur Rekonstruktion von Phylogenien aus Sequenzdaten fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis lokale Suchheuristiken angewendet werden mĂŒssen. Diese Methoden liefern akkurate BĂ€ume und sind schnell genug, solange die Eingabedaten nicht zu groß werden. Teile-und-herrsche-Verfahren sind ein vielversprechender Ansatz, um Skalierbarkeit und Genauigkeit dieser lokalen Suchheuristiken auf sehr großen DatensĂ€tzen zu verbessern. Beim Teile-und-herrsche-Ansatz zerlegt man ein großes phylogenetisches Problem in kleinere Teilprobleme, die einfacher und schneller zu lösen sind. Die Teilprobleme, in diesem Fall ĂŒberlappende TeilbĂ€ume, mĂŒssen dann zu einem gesamtheitlichen Baum kombiniert werden. Superbaummethoden verschmelzen solche ĂŒberlappenden phylogenetischen BĂ€ume zu einem Superbaum, der alle Taxa der EingangsbĂ€ume enthĂ€lt. Die Herausforderung bei der Superbaumrekonstruktion besteht darin, mit widersprĂŒchlichen EingabebĂ€umen umzugehen. Es wurden viele verschiedene Algorithmen mit unterschiedlichen Zielfunktionen entwickelt, um solche WidersprĂŒche möglichst sinnvoll aufzulösen. Verfahren, die auf der Kodierung der EingabebĂ€ume als MatrixreprĂ€sentation basieren, sind am weitesten verbreitet. Die zum Auflösen der Konflikte verwendeten Zielfunktionen fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis auch hier lokale Suchheuristiken zum Einsatz kommen. Da diese AnsĂ€tze nicht wesentlich besser mit der GrĂ¶ĂŸe der Eingabedaten skalieren als die direkte Rekonstruktion aus Sequenzdaten, werden fĂŒr die Superbaumrekonstruktion in Teile-undherrsche-AnsĂ€tzen akkurate Polynomialzeitmethoden benötigt. Diese Arbeit beschĂ€ftigt sich mit der akkuraten Rekonstruktion von SuperbĂ€umen in Polynomialzeit. Wir prĂ€sentieren Bad Clade Deletion (BCD), eine neue Polynomialzeitheuristik zur Superbaumrekonstruktion. BCD verwendet minimale Schnitte in Graphen, um eine minimale Anzahl von Spalten aus der MatrixreprĂ€sentation zu löschen, sodass diese konfliktfrei wird. Im Gegensatz zu lokalen Suchheuristiken garantiert BCD die Rekonstruktion einer perfekten Phylogenie, sofern eine solche fĂŒr die Eingabematrix existiert. BCD ermöglicht es, GĂŒtekriterien der EingabebĂ€ume zu berĂŒcksichtigen, ohne dass sich dadurch die KomplexitĂ€t erhöht. Weiterhin zeigen wir, wie zuverlĂ€ssige Kladen verwendet werden können, um den Suchraum fĂŒr BCD einzuschrĂ€nken und wie man diese mit Hilfe des Greedy Strict Consensus Mergers aus den Eingabedaten gewinnen kann. Schließlich stellen wir eine Strahlensuche fĂŒr BCD vor. Diese erlaubt es eine bestimmte Anzahl suboptimaler Teillösungen (anstatt nur der optimalen) zu berĂŒcksichtigen, um so das Gesamtergebnis zu verbessern. Die Worst-Case-Laufzeit der Strahlensuche ist immer noch polynomiell. Zur Berechnung suboptimaler Teillösungen stellen wir einen exakten und einen randomisierten Algorithmus vor. In einer ausfĂŒhrlichen Evaluation auf mehreren simulierten und biologischen DatensĂ€tzen vergleichen wir BCD mit einer reprĂ€sentativen Auswahl an Superbaummethoden. Wir haben herausgefunden, dass BCD bei Verwendung von GĂŒtekriterien und SuchraumbeschrĂ€nkung auf simulierten Daten genauer ist als die akkuratesten evaluierten Superbaummethoden. Gleichzeitig ist BCD deutlich schneller als alle evaluierten Methoden. Die Strahlensuche verbessert die QualitĂ€t der BCD-BĂ€ume auf allen DatensĂ€tzen, allerdings auf Kosten der Laufzeit. Weiterhin fanden wir heraus, dass ein BCD-Superbaum, der als Startbaum verwendet wird, die QualitĂ€t einer Maximum-Likelihood-Baumrekonstruktion verbessern kann. Außerdem kann BCD DatensĂ€tze verarbeiten, die so groß sind, dass lokale Suchheuristiken auf diesen nicht mehr in angemessener Zeit konvergieren. Aufgrund der Kombination aus Geschwindigkeit, Genauigkeit und der FĂ€higkeit, den Elternbaum zu rekonstruieren, sofern ein solcher existiert, ist BCD ein vielversprechender Ansatz um die Skalierbarkeit von Teile-und-herrsche-Methoden entscheidend zu verbessern

    Automatic Segmentation of Cells of Different Types in Fluorescence Microscopy Images

    Get PDF
    Recognition of different cell compartments, types of cells, and their interactions is a critical aspect of quantitative cell biology. This provides a valuable insight for understanding cellular and subcellular interactions and mechanisms of biological processes, such as cancer cell dissemination, organ development and wound healing. Quantitative analysis of cell images is also the mainstay of numerous clinical diagnostic and grading procedures, for example in cancer, immunological, infectious, heart and lung disease. Computer automation of cellular biological samples quantification requires segmenting different cellular and sub-cellular structures in microscopy images. However, automating this problem has proven to be non-trivial, and requires solving multi-class image segmentation tasks that are challenging owing to the high similarity of objects from different classes and irregularly shaped structures. This thesis focuses on the development and application of probabilistic graphical models to multi-class cell segmentation. Graphical models can improve the segmentation accuracy by their ability to exploit prior knowledge and model inter-class dependencies. Directed acyclic graphs, such as trees have been widely used to model top-down statistical dependencies as a prior for improved image segmentation. However, using trees, a few inter-class constraints can be captured. To overcome this limitation, polytree graphical models are proposed in this thesis that capture label proximity relations more naturally compared to tree-based approaches. Polytrees can effectively impose the prior knowledge on the inclusion of different classes by capturing both same-level and across-level dependencies. A novel recursive mechanism based on two-pass message passing is developed to efficiently calculate closed form posteriors of graph nodes on polytrees. Furthermore, since an accurate and sufficiently large ground truth is not always available for training segmentation algorithms, a weakly supervised framework is developed to employ polytrees for multi-class segmentation that reduces the need for training with the aid of modeling the prior knowledge during segmentation. Generating a hierarchical graph for the superpixels in the image, labels of nodes are inferred through a novel efficient message-passing algorithm and the model parameters are optimized with Expectation Maximization (EM). Results of evaluation on the segmentation of simulated data and multiple publicly available fluorescence microscopy datasets indicate the outperformance of the proposed method compared to state-of-the-art. The proposed method has also been assessed in predicting the possible segmentation error and has been shown to outperform trees. This can pave the way to calculate uncertainty measures on the resulting segmentation and guide subsequent segmentation refinement, which can be useful in the development of an interactive segmentation framework
    • 

    corecore