503 research outputs found

    The Weight Function in the Subtree Kernel is Decisive

    Get PDF
    Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 36 page

    Random enriched trees with applications to random graphs

    Full text link
    We establish limit theorems that describe the asymptotic local and global geometric behaviour of random enriched trees considered up to symmetry. We apply these general results to random unlabelled weighted rooted graphs and uniform random unlabelled kk-trees that are rooted at a kk-clique of distinguishable vertices. For both models we establish a Gromov--Hausdorff scaling limit, a Benjamini--Schramm limit, and a local weak limit that describes the asymptotic shape near the fixed root

    EvoMiner: Frequent Subtree Mining in Phylogenetic Databases

    Get PDF
    The problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to interpret the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner, a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like level-wise method, which uses a novel phylogeny-specific constant-time candidate generation scheme, an efficient fingerprinting-based technique for downward closure, and a lowest common ancestor based support counting step that requires neither costly subtree operations nor database traversal. Our algorithm achieves speed-ups of up to 100 times or more over Phylominer, the current state-of-the-art algorithm for mining phylogenetic trees. EvoMiner can also work in depth first enumeration mode, to use less memory at the expense of speed. We demonstrate the utility of FST mining as a way to extract meaningful phylogenetic information from collections of trees when compared to maximum agreement subtrees and majority rule trees --- two commonly used approaches in phylogenetic analysis for extracting consensus information from a collection of trees over a common leaf set

    Managing and analyzing phylogenetic databases

    Get PDF
    The ever growing availability of phylogenomic data makes it increasingly possible to study and analyze phylogenetic relationships across a wide range of species. Indeed, current phylogenetic analyses are now producing enormous collections of trees that vary greatly in size. Our proposed research addresses the challenges posed by storing, querying, and analyzing such phylogenetic databases. Our first contribution is the further development of STBase, a phylogenetic tree database consisting of a billion trees whose leaf sets range from four to 20000. STBase applies techniques from different areas of computer science for efficient tree storage and retrieval. It also introduces new ideas that are specific to tree databases. STBase provides a unique opportunity to explore innovative ways to analyze the results from queries on large sets of phylogenetic trees. We propose new ways of extracting consensus information from a collection of phylogenetic trees. Specifically, this involves extending the maximum agreement subtree problem. We greatly improve upon an existing approach based on frequent subtrees and, propose two new approaches based on agreement subtrees and frequent subtrees respectively. The final part of our proposed work deals with the problem of simplifying multi-labeled trees and handling rogue taxa. We propose a novel technique to extract conflict-free information from multi-labeled trees as a much smaller single labeled tree. We show that the inherent problem in identifying rogue taxa is NP-hard and give fixed-parameter tractable and integer linear programming solutions

    Fixed-parameter tractable canonization and isomorphism test for graphs of bounded treewidth

    Get PDF
    We give a fixed-parameter tractable algorithm that, given a parameter kk and two graphs G1,G2G_1,G_2, either concludes that one of these graphs has treewidth at least kk, or determines whether G1G_1 and G2G_2 are isomorphic. The running time of the algorithm on an nn-vertex graph is 2O(k5logk)n52^{O(k^5\log k)}\cdot n^5, and this is the first fixed-parameter algorithm for Graph Isomorphism parameterized by treewidth. Our algorithm in fact solves the more general canonization problem. We namely design a procedure working in 2O(k5logk)n52^{O(k^5\log k)}\cdot n^5 time that, for a given graph GG on nn vertices, either concludes that the treewidth of GG is at least kk, or: * finds in an isomorphic-invariant way a graph c(G)\mathfrak{c}(G) that is isomorphic to GG; * finds an isomorphism-invariant construction term --- an algebraic expression that encodes GG together with a tree decomposition of GG of width O(k4)O(k^4). Hence, the isomorphism test reduces to verifying whether the computed isomorphic copies or the construction terms for G1G_1 and G2G_2 are equal.Comment: Full version of a paper presented at FOCS 201

    A Survey on Graph Kernels

    Get PDF
    Graph kernels have become an established and widely-used technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years. We describe and categorize graph kernels based on properties inherent to their design, such as the nature of their extracted graph features, their method of computation and their applicability to problems in practice. In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a data-driven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner's guide to kernel-based graph classification

    A Survey of Alternating Permutations

    Get PDF
    This survey of alternating permutations and Euler numbers includes refinements of Euler numbers, other occurrences of Euler numbers, longest alternating subsequences, umbral enumeration of classes of alternating permutations, and the cd-index of the symmetric group.Comment: 32 pages, 7 figure
    corecore