1,171 research outputs found

    Revisiting the tree edit distance and its backtracing: A tutorial

    Full text link
    Almost 30 years ago, Zhang and Shasha (1989) published a seminal paper describing an efficient dynamic programming algorithm computing the tree edit distance, that is, the minimum number of node deletions, insertions, and replacements that are necessary to transform one tree into another. Since then, the tree edit distance has been widely applied, for example in biology and intelligent tutoring systems. However, the original paper of Zhang and Shasha can be challenging to read for newcomers and it does not describe how to efficiently infer the optimal edit script. In this contribution, we provide a comprehensive tutorial to the tree edit distance algorithm of Zhang and Shasha. We further prove metric properties of the tree edit distance, and describe efficient algorithms to infer the cheapest edit script, as well as a summary of all cheapest edit scripts between two trees.Comment: Supplementary material for the ICML 2018 paper: Tree Edit Distance Learning via Adaptive Symbol Embedding

    Automatic Wrapper Adaptation by Tree Edit Distance Matching

    Get PDF
    Information distributed through the Web keeps growing faster day by day,\ud and for this reason, several techniques for extracting Web data have been suggested\ud during last years. Often, extraction tasks are performed through so called wrappers,\ud procedures extracting information from Web pages, e.g. implementing logic-based\ud techniques. Many fields of application today require a strong degree of robustness\ud of wrappers, in order not to compromise assets of information or reliability of data\ud extracted.\ud Unfortunately, wrappers may fail in the task of extracting data from a Web page, if\ud its structure changes, sometimes even slightly, thus requiring the exploiting of new\ud techniques to be automatically held so as to adapt the wrapper to the new structure\ud of the page, in case of failure. In this work we present a novel approach of automatic wrapper adaptation based on the measurement of similarity of trees through\ud improved tree edit distance matching techniques

    Tree Edit Distance Learning via Adaptive Symbol Embeddings

    Full text link
    Metric learning has the aim to improve classification accuracy by learning a distance measure which brings data points from the same class closer together and pushes data points from different classes further apart. Recent research has demonstrated that metric learning approaches can also be applied to trees, such as molecular structures, abstract syntax trees of computer programs, or syntax trees of natural language, by learning the cost function of an edit distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree. However, learning such costs directly may yield an edit distance which violates metric axioms, is challenging to interpret, and may not generalize well. In this contribution, we propose a novel metric learning approach for trees which we call embedding edit distance learning (BEDL) and which learns an edit distance indirectly by embedding the tree nodes as vectors, such that the Euclidean distance between those vectors supports class discrimination. We learn such embeddings by reducing the distance to prototypical trees from the same class and increasing the distance to prototypical trees from different classes. In our experiments, we show that BEDL improves upon the state-of-the-art in metric learning for trees on six benchmark data sets, ranging from computer science over biomedical data to a natural-language processing data set containing over 300,000 nodes.Comment: Paper at the International Conference of Machine Learning (2018), 2018-07-10 to 2018-07-15 in Stockholm, Swede

    Learning Stochastic Tree Edit Distance

    No full text
    pages 42-53International audienceTrees provide a suited structural representation to deal with complex tasks such as web information extraction, RNA secondary structure prediction, or conversion of tree structured documents. In this context, many applications require the calculation of similarities between tree pairs. The most studied distance is likely the tree edit distance for which improvements in terms of complexity have been achieved during the last decade. However, this classic edit distance usually uses a priori fixed edit costs which are often difficult to tune, that leaves little room for tackling complex problems. In this paper, we focus on the learning of a stochastic tree edit distance. We use an adaptation of the expectation-maximization algorithm for learning the primitive edit costs. We carried out several series of experiments that confirm the interest to learn a tree edit distance rather than a priori imposing edit costs

    Tree edit distance as a baseline approach for paraphrase representation

    Get PDF
    Finding an adequate paraphrase representation formalism is a challenging issue in Natural Language Processing. In this paper, we analyse the performance of Tree Edit Distance as a paraphrase representation baseline. Our experiments using Edit Distance Textual Entailment Suite show that, as Tree Edit Distance consists of a purely syntactic approach, paraphrase alternations not based on structural reorganizations do not find an adequate representation. They also show that there is much scope for better modelling of the way trees are aligned

    An O(n^3)-Time Algorithm for Tree Edit Distance

    Full text link
    The {\em edit distance} between two ordered trees with vertex labels is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. In this paper, we present a worst-case O(n3)O(n^3)-time algorithm for this problem, improving the previous best O(n3logn)O(n^3\log n)-time algorithm~\cite{Klein}. Our result requires a novel adaptive strategy for deciding how a dynamic program divides into subproblems (which is interesting in its own right), together with a deeper understanding of the previous algorithms for the problem. We also prove the optimality of our algorithm among the family of \emph{decomposition strategy} algorithms--which also includes the previous fastest algorithms--by tightening the known lower bound of Ω(n2log2n)\Omega(n^2\log^2 n)~\cite{Touzet} to Ω(n3)\Omega(n^3), matching our algorithm's running time. Furthermore, we obtain matching upper and lower bounds of Θ(nm2(1+lognm))\Theta(n m^2 (1 + \log \frac{n}{m})) when the two trees have different sizes mm and~nn, where m<nm < n.Comment: 10 pages, 5 figures, 5 .tex files where TED.tex is the main on

    Subcubic algorithm for (Unweighted) Unrooted Tree Edit Distance

    Full text link
    The tree edit distance problem is a natural generalization of the classic string edit distance problem. Given two ordered, edge-labeled trees T1T_1 and T2T_2, the edit distance between T1T_1 and T2T_2 is defined as the minimum total cost of operations that transform T1T_1 into T2T_2. In one operation, we can contract an edge, split a vertex into two or change the label of an edge. For the weighted version of the problem, where the cost of each operation depends on the type of the operation and the label on the edge involved, O(n3)\mathcal{O}(n^3) time algorithms are known for both rooted and unrooted trees. The existence of a truly subcubic O(n3ϵ)\mathcal{O}(n^{3-\epsilon}) time algorithm is unlikely, as it would imply a truly subcubic algorithm for the APSP problem. However, recently Mao (FOCS'21) showed that if we assume that each operation has a unit cost, then the tree edit distance between two rooted trees can be computed in truly subcubic time. In this paper, we show how to adapt Mao's algorithm to make it work for unrooted trees and we show an O~(n(7ω+15)/(2ω+6))O(n2.9417)\widetilde{\mathcal{O}}(n^{(7\omega + 15)/(2\omega + 6)}) \leq \mathcal{O}(n^{2.9417}) time algorithm for the unweighted tree edit distance between two unrooted trees, where ω2.373\omega \leq 2.373 is the matrix multiplication exponent. It is the first known subcubic algorithm for unrooted trees. The main idea behind our algorithm is the fact that to compute the tree edit distance between two unrooted trees, it is enough to compute the tree edit distance between an arbitrary rooting of the first tree and every rooting of the second tree.Comment: 20 page

    Comparing similar ordered trees in linear-time

    Get PDF
    AbstractWe describe a linear-time algorithm for comparing two similar ordered rooted trees with node labels. The method for comparing trees is the usual tree edit distance. We show that an optimal mapping that uses at most k insertions or deletions can then be constructed in O(nk3) where n is the size of the trees. The approach is inspired by the Zhang–Shasha algorithm for tree edit distance in combination with an adequate pruning of the search space based on the tree edit graph
    corecore