20 research outputs found

    The Weight Function in the Subtree Kernel is Decisive

    Get PDF
    Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 36 page

    Revisiting Tree Isomorphism: AHU Algorithm with Primes Numbers

    Full text link
    The AHU algorithm has been the state of the art since the 1970s for determining in linear time whether two unordered rooted trees are isomorphic or not. However, it has been criticized (by Campbell and Radford) for the way it is written, which requires several (re)readings to be understood, and does not facilitate its analysis. In this paper, we propose an alternative version of the AHU algorithm, which addresses this issue by being designed to be clearer to understand and implement, with the same theoretical complexity and equally fast in practice.. Whereas the key to the linearity of the original algorithm lay on the careful sorting of lists of integers, we replace this step by the multiplication of lists of prime numbers, and prove that this substitution causes no loss in the final complexity of the new algorithm

    The Weight Function in the Subtree Kernel is Decisive

    Get PDF
    Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficul per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through two real data classification problems the great efficiency of our approach, in particular with respect to the ones considered in the literature, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 28 page

    Detection of Common Subtrees with Identical Label Distribution

    Full text link
    Frequent pattern mining is a relevant method to analyse structured data, like sequences, trees or graphs. It consists in identifying characteristic substructures of a dataset. This paper deals with a new type of patterns for tree data: common subtrees with identical label distribution. Their detection is far from obvious since the underlying isomorphism problem is graph isomorphism complete. An elaborated search algorithm is developed and analysed from both theoretical and numerical perspectives. Based on this, the enumeration of patterns is performed through a new lossless compression scheme for trees, called DAG-RW, whose complexity is investigated as well. The method shows very good properties, both in terms of computation times and analysis of real datasets from the literature. Compared to other substructures like topological subtrees and labelled subtrees for which the isomorphism problem is linear, the patterns found provide a more parsimonious representation of the data.Comment: 40 page

    Enumeration of Unordered Forests

    Full text link
    Reverse search is a convenient method for enumerating structured objects, that can be used both to address theoretical issues and to solve data mining problems. This method has already been successfully developed to handle unordered trees. If the literature proposes solutions to enumerate singletons of trees, we study in this article a more general problem, the enumeration of sets of trees - forests. By compressing each forest into a Directed Acyclic Graph (DAG), we develop a reverse search like method to enumerate DAG compressing forests. Remarkably, we prove that these DAG are in bijection with the row-Fishburn matrices, a well-studied class of combinatorial objects. In a second step, we derive our forest enumeration to provide algorithms for tackling two related problems : (i) the enumeration of "subforests" of a forest, and (ii) the frequent "subforest" mining problem. All the methods presented in this article enumerate each item uniquely, up to isomorphism

    Sur la similariteĢ des arbres : lā€™inteĢreĢ‚t des meĢthodes dā€™eĢnumeĢration et de compression

    No full text
    Les arbres sont des donneĢes qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinseĢ€quement non euclidienne ainsi que le pheĢnomeĢ€ne dā€™explosion combinatoire rendent leur analyse deĢlicate. On sā€™inteĢresse dans cette theĢ€se aĢ€ trois approches permettant de comparer des arbres, sous le prisme notamment dā€™une technique de compression sans perte des arbres par des graphes dirigeĢs acycliques. Dā€™abord, concernant lā€™isomorphisme dā€™arbres, nous consideĢrons une extension de la deĢfinition classique aux arbres eĢtiqueteĢs, qui requiert que les arbres soient identiques aĢ€ reĢeĢcriture des eĢtiquettes preĢ€s. Ce probleĢ€me est aussi dur que lā€™isomorphisme de graphes, et nous avons deĢveloppeĢ un algorithme qui reĢduit drastiquement la taille de lā€™espace de recherche des solutions, qui est ensuite exploreĢ avec une strateĢgie de retour sur trace. Lorsque deux arbres sont diffeĢrents, on peut chercher aĢ€ en trouver des sous-structures communes. Si cette question a deĢjaĢ€ eĢteĢ traiteĢe pour les sous-arbres, nous nous inteĢressons aĢ€ un probleĢ€me plus large, celui de trouver des ensembles de sous-arbres apparaissant simultaneĢment. Cela nous ameĢ€ne aĢ€ consideĢrer lā€™eĢnumeĢration des foreĢ‚ts, pour laquelle nous proposons un algorithme de type ā€œreverse searchā€ qui construit un arbre dā€™eĢnumeĢration dont le facteur de branchement est lineĢaire. Enfin, aĢ€ partir dā€™une liste de sous-structures communes, on peut construire un noyau de convolution qui permet dā€™aborder des probleĢ€mes de classification. Nous reprenons de la litteĢrature le noyau des sous-arbres, et construisons un algorithme qui les eĢnumeĢ€re explicitement (contrairement aĢ€ la meĢthode originale). Notre approche permet notamment de parameĢtrer plus finement le noyau, ameĢliorant significativement les capaciteĢs de classification.Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities

    Sur la similariteĢ des arbres : lā€™inteĢreĢ‚t des meĢthodes dā€™eĢnumeĢration et de compression

    No full text
    Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities.Les arbres sont des donneĢes qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinseĢ€quement non euclidienne ainsi que le pheĢnomeĢ€ne dā€™explosion combinatoire rendent leur analyse deĢlicate. On sā€™inteĢresse dans cette theĢ€se aĢ€ trois approches permettant de comparer des arbres, sous le prisme notamment dā€™une technique de compression sans perte des arbres par des graphes dirigeĢs acycliques. Dā€™abord, concernant lā€™isomorphisme dā€™arbres, nous consideĢrons une extension de la deĢfinition classique aux arbres eĢtiqueteĢs, qui requiert que les arbres soient identiques aĢ€ reĢeĢcriture des eĢtiquettes preĢ€s. Ce probleĢ€me est aussi dur que lā€™isomorphisme de graphes, et nous avons deĢveloppeĢ un algorithme qui reĢduit drastiquement la taille de lā€™espace de recherche des solutions, qui est ensuite exploreĢ avec une strateĢgie de retour sur trace. Lorsque deux arbres sont diffeĢrents, on peut chercher aĢ€ en trouver des sous-structures communes. Si cette question a deĢjaĢ€ eĢteĢ traiteĢe pour les sous-arbres, nous nous inteĢressons aĢ€ un probleĢ€me plus large, celui de trouver des ensembles de sous-arbres apparaissant simultaneĢment. Cela nous ameĢ€ne aĢ€ consideĢrer lā€™eĢnumeĢration des foreĢ‚ts, pour laquelle nous proposons un algorithme de type ā€œreverse searchā€ qui construit un arbre dā€™eĢnumeĢration dont le facteur de branchement est lineĢaire. Enfin, aĢ€ partir dā€™une liste de sous-structures communes, on peut construire un noyau de convolution qui permet dā€™aborder des probleĢ€mes de classification. Nous reprenons de la litteĢrature le noyau des sous-arbres, et construisons un algorithme qui les eĢnumeĢ€re explicitement (contrairement aĢ€ la meĢthode originale). Notre approche permet notamment de parameĢtrer plus finement le noyau, ameĢliorant significativement les capaciteĢs de classification

    Sur la similariteĢ des arbres : lā€™inteĢreĢ‚t des meĢthodes dā€™eĢnumeĢration et de compression

    No full text
    Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities.Les arbres sont des donneĢes qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinseĢ€quement non euclidienne ainsi que le pheĢnomeĢ€ne dā€™explosion combinatoire rendent leur analyse deĢlicate. On sā€™inteĢresse dans cette theĢ€se aĢ€ trois approches permettant de comparer des arbres, sous le prisme notamment dā€™une technique de compression sans perte des arbres par des graphes dirigeĢs acycliques. Dā€™abord, concernant lā€™isomorphisme dā€™arbres, nous consideĢrons une extension de la deĢfinition classique aux arbres eĢtiqueteĢs, qui requiert que les arbres soient identiques aĢ€ reĢeĢcriture des eĢtiquettes preĢ€s. Ce probleĢ€me est aussi dur que lā€™isomorphisme de graphes, et nous avons deĢveloppeĢ un algorithme qui reĢduit drastiquement la taille de lā€™espace de recherche des solutions, qui est ensuite exploreĢ avec une strateĢgie de retour sur trace. Lorsque deux arbres sont diffeĢrents, on peut chercher aĢ€ en trouver des sous-structures communes. Si cette question a deĢjaĢ€ eĢteĢ traiteĢe pour les sous-arbres, nous nous inteĢressons aĢ€ un probleĢ€me plus large, celui de trouver des ensembles de sous-arbres apparaissant simultaneĢment. Cela nous ameĢ€ne aĢ€ consideĢrer lā€™eĢnumeĢration des foreĢ‚ts, pour laquelle nous proposons un algorithme de type ā€œreverse searchā€ qui construit un arbre dā€™eĢnumeĢration dont le facteur de branchement est lineĢaire. Enfin, aĢ€ partir dā€™une liste de sous-structures communes, on peut construire un noyau de convolution qui permet dā€™aborder des probleĢ€mes de classification. Nous reprenons de la litteĢrature le noyau des sous-arbres, et construisons un algorithme qui les eĢnumeĢ€re explicitement (contrairement aĢ€ la meĢthode originale). Notre approche permet notamment de parameĢtrer plus finement le noyau, ameĢliorant significativement les capaciteĢs de classification

    The Weight Function in the Subtree Kernel is Decisive

    No full text
    36 pagesInternational audienceTree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived
    corecore