Search CORE

20 research outputs found

The Weight Function in the Subtree Kernel is Decisive

Author: Azaïs Romain
Ingels Florian
Publication venue
Publication date: 12/04/2019
Field of study

Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 36 page

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Revisiting Tree Isomorphism: AHU Algorithm with Primes Numbers

Author: Ingels Florian
Publication venue
Publication date: 25/09/2023
Field of study

The AHU algorithm has been the state of the art since the 1970s for determining in linear time whether two unordered rooted trees are isomorphic or not. However, it has been criticized (by Campbell and Radford) for the way it is written, which requires several (re)readings to be understood, and does not facilitate its analysis. In this paper, we propose an alternative version of the AHU algorithm, which addresses this issue by being designed to be clearer to understand and implement, with the same theoretical complexity and equally fast in practice.. Whereas the key to the linearity of the original algorithm lay on the careful sorting of lists of integers, we replace this step by the multiplication of lists of prime numbers, and prove that this substitution causes no loss in the final complexity of the new algorithm

arXiv.org e-Print Archive

The Weight Function in the Subtree Kernel is Decisive

Author: Azaïs Romain
Ingels Florian
Publication venue
Publication date: 01/01/1989
Field of study

Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficul per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through two real data classification problems the great efficiency of our approach, in particular with respect to the ones considered in the literature, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 28 page

arXiv.org e-Print Archive

Yale University

Detection of Common Subtrees with Identical Label Distribution

Author: Azaïs Romain
Ingels Florian
Publication venue
Publication date: 24/07/2023
Field of study

Frequent pattern mining is a relevant method to analyse structured data, like sequences, trees or graphs. It consists in identifying characteristic substructures of a dataset. This paper deals with a new type of patterns for tree data: common subtrees with identical label distribution. Their detection is far from obvious since the underlying isomorphism problem is graph isomorphism complete. An elaborated search algorithm is developed and analysed from both theoretical and numerical perspectives. Based on this, the enumeration of patterns is performed through a new lossless compression scheme for trees, called DAG-RW, whose complexity is investigated as well. The method shows very good properties, both in terms of computation times and analysis of real datasets from the literature. Compared to other substructures like topological subtrees and labelled subtrees for which the isomorphism problem is linear, the patterns found provide a more parsimonious representation of the data.Comment: 40 page

arXiv.org e-Print Archive

Enumeration of Unordered Forests

Author: Azaïs Romain
Ingels Florian
Publication venue
Publication date: 17/12/2021
Field of study

Reverse search is a convenient method for enumerating structured objects, that can be used both to address theoretical issues and to solve data mining problems. This method has already been successfully developed to handle unordered trees. If the literature proposes solutions to enumerate singletons of trees, we study in this article a more general problem, the enumeration of sets of trees - forests. By compressing each forest into a Directed Acyclic Graph (DAG), we develop a reverse search like method to enumerate DAG compressing forests. Remarkably, we prove that these DAG are in bijection with the row-Fishburn matrices, a well-studied class of combinatorial objects. In a second step, we derive our forest enumeration to provide algorithms for tackling two related problems : (i) the enumeration of "subforests" of a forest, and (ii) the frequent "subforest" mining problem. All the methods presented in this article enumerate each item uniquely, up to isomorphism

arXiv.org e-Print Archive

Sur la similarité des arbres : l’intérêt des méthodes d’énumération et de compression

Author: Ingels Florian
Publication venue
Publication date: 19/09/2022
Field of study

Les arbres sont des données qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinsèquement non euclidienne ainsi que le phénomène d’explosion combinatoire rendent leur analyse délicate. On s’intéresse dans cette thèse à trois approches permettant de comparer des arbres, sous le prisme notamment d’une technique de compression sans perte des arbres par des graphes dirigés acycliques. D’abord, concernant l’isomorphisme d’arbres, nous considérons une extension de la définition classique aux arbres étiquetés, qui requiert que les arbres soient identiques à réécriture des étiquettes près. Ce problème est aussi dur que l’isomorphisme de graphes, et nous avons développé un algorithme qui réduit drastiquement la taille de l’espace de recherche des solutions, qui est ensuite exploré avec une stratégie de retour sur trace. Lorsque deux arbres sont différents, on peut chercher à en trouver des sous-structures communes. Si cette question a déjà été traitée pour les sous-arbres, nous nous intéressons à un problème plus large, celui de trouver des ensembles de sous-arbres apparaissant simultanément. Cela nous amène à considérer l’énumération des forêts, pour laquelle nous proposons un algorithme de type “reverse search” qui construit un arbre d’énumération dont le facteur de branchement est linéaire. Enfin, à partir d’une liste de sous-structures communes, on peut construire un noyau de convolution qui permet d’aborder des problèmes de classification. Nous reprenons de la littérature le noyau des sous-arbres, et construisons un algorithme qui les énumère explicitement (contrairement à la méthode originale). Notre approche permet notamment de paramétrer plus finement le noyau, améliorant significativement les capacités de classification.Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities

Theses.fr

Sur la similarité des arbres : l’intérêt des méthodes d’énumération et de compression

Author: Ingels Florian
Publication venue: HAL CCSD
Publication date: 19/09/2022
Field of study

Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities.Les arbres sont des données qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinsèquement non euclidienne ainsi que le phénomène d’explosion combinatoire rendent leur analyse délicate. On s’intéresse dans cette thèse à trois approches permettant de comparer des arbres, sous le prisme notamment d’une technique de compression sans perte des arbres par des graphes dirigés acycliques. D’abord, concernant l’isomorphisme d’arbres, nous considérons une extension de la définition classique aux arbres étiquetés, qui requiert que les arbres soient identiques à réécriture des étiquettes près. Ce problème est aussi dur que l’isomorphisme de graphes, et nous avons développé un algorithme qui réduit drastiquement la taille de l’espace de recherche des solutions, qui est ensuite exploré avec une stratégie de retour sur trace. Lorsque deux arbres sont différents, on peut chercher à en trouver des sous-structures communes. Si cette question a déjà été traitée pour les sous-arbres, nous nous intéressons à un problème plus large, celui de trouver des ensembles de sous-arbres apparaissant simultanément. Cela nous amène à considérer l’énumération des forêts, pour laquelle nous proposons un algorithme de type “reverse search” qui construit un arbre d’énumération dont le facteur de branchement est linéaire. Enfin, à partir d’une liste de sous-structures communes, on peut construire un noyau de convolution qui permet d’aborder des problèmes de classification. Nous reprenons de la littérature le noyau des sous-arbres, et construisons un algorithme qui les énumère explicitement (contrairement à la méthode originale). Notre approche permet notamment de paramétrer plus finement le noyau, améliorant significativement les capacités de classification

HAL-ENS-LYON

Sur la similarité des arbres : l’intérêt des méthodes d’énumération et de compression

Author: Ingels Florian
Publication venue: HAL CCSD
Publication date: 19/09/2022
Field of study

HAL-ENS-LYON

Thèses en Ligne

Theses.fr

The Weight Function in the Subtree Kernel is Decisive

Author: Azaïs Romain
Ingels Florian
Publication venue: Microtome Publishing
Publication date: 01/04/2020
Field of study

36 pagesInternational audienceTree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived

HAL-ENS-LYON