20 research outputs found
The Weight Function in the Subtree Kernel is Decisive
Tree data are ubiquitous because they model a large variety of situations,
e.g., the architecture of plants, the secondary structure of RNA, or the
hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data
is difficult per se. In this paper, we focus on the subtree kernel that is a
convolution kernel for tree data introduced by Vishwanathan and Smola in the
early 2000's. More precisely, we investigate the influence of the weight
function from a theoretical perspective and in real data applications. We
establish on a 2-classes stochastic model that the performance of the subtree
kernel is improved when the weight of leaves vanishes, which motivates the
definition of a new weight function, learned from the data and not fixed by the
user as usually done. To this end, we define a unified framework for computing
the subtree kernel from ordered or unordered trees, that is particularly
suitable for tuning parameters. We show through eight real data classification
problems the great efficiency of our approach, in particular for small
datasets, which also states the high importance of the weight function.
Finally, a visualization tool of the significant features is derived.Comment: 36 page
Revisiting Tree Isomorphism: AHU Algorithm with Primes Numbers
The AHU algorithm has been the state of the art since the 1970s for
determining in linear time whether two unordered rooted trees are isomorphic or
not. However, it has been criticized (by Campbell and Radford) for the way it
is written, which requires several (re)readings to be understood, and does not
facilitate its analysis. In this paper, we propose an alternative version of
the AHU algorithm, which addresses this issue by being designed to be clearer
to understand and implement, with the same theoretical complexity and equally
fast in practice.. Whereas the key to the linearity of the original algorithm
lay on the careful sorting of lists of integers, we replace this step by the
multiplication of lists of prime numbers, and prove that this substitution
causes no loss in the final complexity of the new algorithm
The Weight Function in the Subtree Kernel is Decisive
Tree data are ubiquitous because they model a large variety of situations,
e.g., the architecture of plants, the secondary structure of RNA, or the
hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data
is difficul per se. In this paper, we focus on the subtree kernel that is a
convolution kernel for tree data introduced by Vishwanathan and Smola in the
early 2000's. More precisely, we investigate the influence of the weight
function from a theoretical perspective and in real data applications. We
establish on a 2-classes stochastic model that the performance of the subtree
kernel is improved when the weight of leaves vanishes, which motivates the
definition of a new weight function, learned from the data and not fixed by the
user as usually done. To this end, we define a unified framework for computing
the subtree kernel from ordered or unordered trees, that is particularly
suitable for tuning parameters. We show through two real data classification
problems the great efficiency of our approach, in particular with respect to
the ones considered in the literature, which also states the high importance of
the weight function. Finally, a visualization tool of the significant features
is derived.Comment: 28 page
Detection of Common Subtrees with Identical Label Distribution
Frequent pattern mining is a relevant method to analyse structured data, like
sequences, trees or graphs. It consists in identifying characteristic
substructures of a dataset. This paper deals with a new type of patterns for
tree data: common subtrees with identical label distribution. Their detection
is far from obvious since the underlying isomorphism problem is graph
isomorphism complete. An elaborated search algorithm is developed and analysed
from both theoretical and numerical perspectives. Based on this, the
enumeration of patterns is performed through a new lossless compression scheme
for trees, called DAG-RW, whose complexity is investigated as well. The method
shows very good properties, both in terms of computation times and analysis of
real datasets from the literature. Compared to other substructures like
topological subtrees and labelled subtrees for which the isomorphism problem is
linear, the patterns found provide a more parsimonious representation of the
data.Comment: 40 page
Enumeration of Unordered Forests
Reverse search is a convenient method for enumerating structured objects,
that can be used both to address theoretical issues and to solve data mining
problems. This method has already been successfully developed to handle
unordered trees. If the literature proposes solutions to enumerate singletons
of trees, we study in this article a more general problem, the enumeration of
sets of trees - forests. By compressing each forest into a Directed Acyclic
Graph (DAG), we develop a reverse search like method to enumerate DAG
compressing forests. Remarkably, we prove that these DAG are in bijection with
the row-Fishburn matrices, a well-studied class of combinatorial objects. In a
second step, we derive our forest enumeration to provide algorithms for
tackling two related problems : (i) the enumeration of "subforests" of a
forest, and (ii) the frequent "subforest" mining problem. All the methods
presented in this article enumerate each item uniquely, up to isomorphism
Sur la similariteĢ des arbres : lāinteĢreĢt des meĢthodes dāeĢnumeĢration et de compression
Les arbres sont des donneĢes qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinseĢquement non euclidienne ainsi que le pheĢnomeĢne dāexplosion combinatoire rendent leur analyse deĢlicate. On sāinteĢresse dans cette theĢse aĢ trois approches permettant de comparer des arbres, sous le prisme notamment dāune technique de compression sans perte des arbres par des graphes dirigeĢs acycliques. Dāabord, concernant lāisomorphisme dāarbres, nous consideĢrons une extension de la deĢfinition classique aux arbres eĢtiqueteĢs, qui requiert que les arbres soient identiques aĢ reĢeĢcriture des eĢtiquettes preĢs. Ce probleĢme est aussi dur que lāisomorphisme de graphes, et nous avons deĢveloppeĢ un algorithme qui reĢduit drastiquement la taille de lāespace de recherche des solutions, qui est ensuite exploreĢ avec une strateĢgie de retour sur trace. Lorsque deux arbres sont diffeĢrents, on peut chercher aĢ en trouver des sous-structures communes. Si cette question a deĢjaĢ eĢteĢ traiteĢe pour les sous-arbres, nous nous inteĢressons aĢ un probleĢme plus large, celui de trouver des ensembles de sous-arbres apparaissant simultaneĢment. Cela nous ameĢne aĢ consideĢrer lāeĢnumeĢration des foreĢts, pour laquelle nous proposons un algorithme de type āreverse searchā qui construit un arbre dāeĢnumeĢration dont le facteur de branchement est lineĢaire. Enfin, aĢ partir dāune liste de sous-structures communes, on peut construire un noyau de convolution qui permet dāaborder des probleĢmes de classification. Nous reprenons de la litteĢrature le noyau des sous-arbres, et construisons un algorithme qui les eĢnumeĢre explicitement (contrairement aĢ la meĢthode originale). Notre approche permet notamment de parameĢtrer plus finement le noyau, ameĢliorant significativement les capaciteĢs de classification.Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities
Sur la similariteĢ des arbres : lāinteĢreĢt des meĢthodes dāeĢnumeĢration et de compression
Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities.Les arbres sont des donneĢes qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinseĢquement non euclidienne ainsi que le pheĢnomeĢne dāexplosion combinatoire rendent leur analyse deĢlicate. On sāinteĢresse dans cette theĢse aĢ trois approches permettant de comparer des arbres, sous le prisme notamment dāune technique de compression sans perte des arbres par des graphes dirigeĢs acycliques. Dāabord, concernant lāisomorphisme dāarbres, nous consideĢrons une extension de la deĢfinition classique aux arbres eĢtiqueteĢs, qui requiert que les arbres soient identiques aĢ reĢeĢcriture des eĢtiquettes preĢs. Ce probleĢme est aussi dur que lāisomorphisme de graphes, et nous avons deĢveloppeĢ un algorithme qui reĢduit drastiquement la taille de lāespace de recherche des solutions, qui est ensuite exploreĢ avec une strateĢgie de retour sur trace. Lorsque deux arbres sont diffeĢrents, on peut chercher aĢ en trouver des sous-structures communes. Si cette question a deĢjaĢ eĢteĢ traiteĢe pour les sous-arbres, nous nous inteĢressons aĢ un probleĢme plus large, celui de trouver des ensembles de sous-arbres apparaissant simultaneĢment. Cela nous ameĢne aĢ consideĢrer lāeĢnumeĢration des foreĢts, pour laquelle nous proposons un algorithme de type āreverse searchā qui construit un arbre dāeĢnumeĢration dont le facteur de branchement est lineĢaire. Enfin, aĢ partir dāune liste de sous-structures communes, on peut construire un noyau de convolution qui permet dāaborder des probleĢmes de classification. Nous reprenons de la litteĢrature le noyau des sous-arbres, et construisons un algorithme qui les eĢnumeĢre explicitement (contrairement aĢ la meĢthode originale). Notre approche permet notamment de parameĢtrer plus finement le noyau, ameĢliorant significativement les capaciteĢs de classification
Sur la similariteĢ des arbres : lāinteĢreĢt des meĢthodes dāeĢnumeĢration et de compression
Tree data appear naturally in many scientific domains. Their intrinsically non-Euclidean nature and the combinatorial explosion phenomenon make their analysis delicate. In this thesis, we focus on three approaches to compare trees, notably through the prism of a lossless compression technique of trees into directed acyclic graphs. First, concerning tree isomorphism, we consider an extension of the classical definition to labeled trees, which requires that trees are identical up to label rewriting. This problem is as hard as graph isomorphism, and we have developed an algorithm that drastically reduces the size of the solution search space which is then explored with a backtracking strategy. When two trees are different, we may try to find common substructures. If this question has already been addressed for subtrees, we are interested in a larger problem, namely finding sets of subtrees appearing simultaneously. This leads us to consider forest enumeration, for which we propose a reverse search algorithm that constructs an enumeration tree whose branching factor is linear. Finally, from a list of common substructures, one can build a convolution kernel allowing to tackle classification problems. We consider the subtree kernel from the literature, and build an algorithm that explicitly enumerates subtrees (unlike the original method). In particular, our approach allows us to parameterize the kernel more finely, significantly improving its classification abilities.Les arbres sont des donneĢes qui apparaissent naturellement dans de nombreux domaines scientifiques. Leur nature intrinseĢquement non euclidienne ainsi que le pheĢnomeĢne dāexplosion combinatoire rendent leur analyse deĢlicate. On sāinteĢresse dans cette theĢse aĢ trois approches permettant de comparer des arbres, sous le prisme notamment dāune technique de compression sans perte des arbres par des graphes dirigeĢs acycliques. Dāabord, concernant lāisomorphisme dāarbres, nous consideĢrons une extension de la deĢfinition classique aux arbres eĢtiqueteĢs, qui requiert que les arbres soient identiques aĢ reĢeĢcriture des eĢtiquettes preĢs. Ce probleĢme est aussi dur que lāisomorphisme de graphes, et nous avons deĢveloppeĢ un algorithme qui reĢduit drastiquement la taille de lāespace de recherche des solutions, qui est ensuite exploreĢ avec une strateĢgie de retour sur trace. Lorsque deux arbres sont diffeĢrents, on peut chercher aĢ en trouver des sous-structures communes. Si cette question a deĢjaĢ eĢteĢ traiteĢe pour les sous-arbres, nous nous inteĢressons aĢ un probleĢme plus large, celui de trouver des ensembles de sous-arbres apparaissant simultaneĢment. Cela nous ameĢne aĢ consideĢrer lāeĢnumeĢration des foreĢts, pour laquelle nous proposons un algorithme de type āreverse searchā qui construit un arbre dāeĢnumeĢration dont le facteur de branchement est lineĢaire. Enfin, aĢ partir dāune liste de sous-structures communes, on peut construire un noyau de convolution qui permet dāaborder des probleĢmes de classification. Nous reprenons de la litteĢrature le noyau des sous-arbres, et construisons un algorithme qui les eĢnumeĢre explicitement (contrairement aĢ la meĢthode originale). Notre approche permet notamment de parameĢtrer plus finement le noyau, ameĢliorant significativement les capaciteĢs de classification
The Weight Function in the Subtree Kernel is Decisive
36 pagesInternational audienceTree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived