27 research outputs found

    Biochemical network matching and composition

    Get PDF
    This paper looks at biochemical network matching and compositio

    Tree mining application to matching of hetereogeneous knowledge

    Get PDF
    Matching of heterogeneous knowledge sources is of increasing importance in areas such as scientific knowledge management, e-commerce, enterprise application integration, and many emerging Semantic Web applications. With the desire of knowledge sharing and reuse in these fields, it is common that the knowledge coming from different organizations from the same domain is to be matched. We propose a knowledge matching method based on our previously developed tree mining algorithms for extracting frequently occurring subtrees from a tree structured database such as XML. Using the method the common structure among the different representations can be automatically extracted. Our focus is on knowledge matching at the structural level and we use a set of example XML schema documents from the same domain to evaluate the method. We discuss some important issues that arise when applying tree mining algorithms for detection of common document structures. The experiments demonstrate the usefulness of the approach

    Comparing similar ordered trees in linear-time

    Get PDF
    AbstractWe describe a linear-time algorithm for comparing two similar ordered rooted trees with node labels. The method for comparing trees is the usual tree edit distance. We show that an optimal mapping that uses at most k insertions or deletions can then be constructed in O(nk3) where n is the size of the trees. The approach is inspired by the Zhang–Shasha algorithm for tree edit distance in combination with an adequate pruning of the search space based on the tree edit graph

    Mechanism for Change Detection in HTML Web Pages as XML Documents

    Get PDF
    Veebilehtede muudatuste tuvastamine on oluline osa veebi monitoorimisest. Veebi automaatset monitoorimist saab kasutada spetsiiflise informatsiooni kogumiseks, näiteks avalike teadaannete, uudiste või hinnamuutuste automaatseks märkamiseks. Kui lehe HTML-kood talletada, on võimalik seda lehte uuesti külastades uut ja eelnevat koodi võrrelda ning nendevahelised erinevused leida. HTML-koode saab võrrelda tavateksti võrdlemise meetodite abil, kuid sel juhul riskime lehe struktuuri kohta käiva informatsiooni kaotamisega. HTML-kood on struktuurilt puulaadne ja selle omaduse säilitamine muudatuste tuvastamisel on soovitav. Selles töös kirjeldame mehhanismi, millega eelnevalt kogutud HTML-koodis lehed teisendatakse XML dokumentide kujule ning võrreldakse neid XML puudena. Me kirjeldame selle ülesande täitmiseks vajalikke komponente ja oma teostust, mis kasutab NutchWAX-i, NekoHTML-i, XMLUnit-it, Jena-t ja MongoDBd. Me analüüsime mõõtmistulemusi, mis koguti selle programmiga 1,1 miljoni HTML lehe läbimisel. Meile teadaolevatel andmetel pole sellist mehhanismi varem rakendatud. Me näitame, et mehhanism on kasutatav tegelikkuses esinevate andmete töötlemiseks.Change detection of web pages is an important aspect of web monitoring. Automated web monitoring can be used for the collection of specifc information, for example for detecting public announcements, news posts and changes of prices. If we store the HTML code of a page, we can compare the current and previous codes when we revisit the page, allowing us to find their changes. HTML code can be compared using ordinary text comparison, but this brings the risk of losing information about the structure of the page. HTML code is treelike in structure and it is a desirable property to preserve when finding changes. In this work we describe a mechanism that can be applied to collected HTML pages to find their changes by transforming HTML pages into XML documents and comparing the resulting XML trees. We give a general list of the components needed for this task, describe our implementation which uses NutchWAX, NekoHTML, XMLUnit, Jena and MongoDB, and show the results of applying the program to a dataset. We analyse the results of measurements collected when running our program on 1.1 million HTML pages. To our knowledge this mechanism has not been tested in previous works. We show that the mechanism is usable on real world data

    Archiving scientific data

    Get PDF
    We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of versions in which it appears. The basic idea of timestamping was discovered by Driscoll et. al. in the context of persistent data structures where one wishes to track the sequences of changes made to a data structure. We extend this idea to develop an archiving tool for XML data that is capable of providing meaningful change descriptions and can also efficiently support a variety of basic functions concerning the evolution of data such as retrieval of any specific version from the archive and querying the temporal history of any element. This is in contrast to diff-based approaches where such operations may require undoing a large number of changes or significant reasoning with the deltas. Surprisingly, our archiving technique does not incur any significant space overhead when contrasted with other approaches. Our experimental results support this and also show that the compacted archive file interacts well with other compression techniques. Finally, another useful property of our approach is that the resulting archive is also in XML and hence can directly leverage existing XML tools

    Edit Distance between Unrooted Trees in Cubic Time

    Get PDF
    Edit distance between trees is a natural generalization of the classical edit distance between strings, in which the allowed elementary operations are contraction, uncontraction and relabeling of an edge. Demaine et al. [ACM Trans. on Algorithms, 6(1), 2009] showed how to compute the edit distance between rooted trees on n nodes in O(n^3) time. However, generalizing their method to unrooted trees seems quite problematic, and the most efficient known solution remains to be the previous O(n^3 log n) time algorithm by Klein [ESA 1998]. Given the lack of progress on improving this complexity, it might appear that unrooted trees are simply more difficult than rooted trees. We show that this is, in fact, not the case, and edit distance between unrooted trees on n nodes can be computed in O(n^3) time. A significantly faster solution is unlikely to exist, as Bringmann et al. [SODA 2018] proved that the complexity of computing the edit distance between rooted trees cannot be decreased to O(n^{3-epsilon}) unless some popular conjecture fails, and the lower bound easily extends to unrooted trees. We also show that for two unrooted trees of size m and n, where m <=n, our algorithm can be modified to run in O(nm^2(1+log(n/m))). This, again, matches the complexity achieved by Demaine et al. for rooted trees, who also showed that this is optimal if we restrict ourselves to the so-called decomposition algorithms

    Subcubic Algorithm for (Unweighted) Unrooted Tree Edit Distance

    Get PDF

    Decomposition algorithms for the tree edit distance problem

    Get PDF
    AbstractWe study the behavior of dynamic programming methods for the tree edit distance problem, such as [P. Klein, Computing the edit-distance between unrooted ordered trees, in: Proceedings of 6th European Symposium on Algorithms, 1998, p. 91–102; K. Zhang, D. Shasha, SIAM J. Comput. 18 (6) (1989) 1245–1262]. We show that those two algorithms may be described as decomposition strategies. We introduce the general framework of cover strategies, and we provide an exact characterization of the complexity of cover strategies. This analysis allows us to define a new tree edit distance algorithm, that is optimal for cover strategies

    Subcubic algorithm for (Unweighted) Unrooted Tree Edit Distance

    Full text link
    The tree edit distance problem is a natural generalization of the classic string edit distance problem. Given two ordered, edge-labeled trees T1T_1 and T2T_2, the edit distance between T1T_1 and T2T_2 is defined as the minimum total cost of operations that transform T1T_1 into T2T_2. In one operation, we can contract an edge, split a vertex into two or change the label of an edge. For the weighted version of the problem, where the cost of each operation depends on the type of the operation and the label on the edge involved, O(n3)\mathcal{O}(n^3) time algorithms are known for both rooted and unrooted trees. The existence of a truly subcubic O(n3ϵ)\mathcal{O}(n^{3-\epsilon}) time algorithm is unlikely, as it would imply a truly subcubic algorithm for the APSP problem. However, recently Mao (FOCS'21) showed that if we assume that each operation has a unit cost, then the tree edit distance between two rooted trees can be computed in truly subcubic time. In this paper, we show how to adapt Mao's algorithm to make it work for unrooted trees and we show an O~(n(7ω+15)/(2ω+6))O(n2.9417)\widetilde{\mathcal{O}}(n^{(7\omega + 15)/(2\omega + 6)}) \leq \mathcal{O}(n^{2.9417}) time algorithm for the unweighted tree edit distance between two unrooted trees, where ω2.373\omega \leq 2.373 is the matrix multiplication exponent. It is the first known subcubic algorithm for unrooted trees. The main idea behind our algorithm is the fact that to compute the tree edit distance between two unrooted trees, it is enough to compute the tree edit distance between an arbitrary rooting of the first tree and every rooting of the second tree.Comment: 20 page
    corecore