411 research outputs found

    Indexing XML Documents Using Tree Paths Automaton

    Get PDF
    An XML document can be viewed as a tree in a natural way. Processing tree data structures usually requires a pushdown automaton as a model of computation. Therefore, it is interesting that a finite automaton can be used to solve the XML index problem. In this paper, we attempt to support a significant fragment of XPath queries which may use any combination of child (i.e., /) and descendant-or-self (i.e., //) axis. A systematic approach to the construction of such XML index, which is a finite automaton called Tree Paths Automaton, is presented. Given an XML tree model T, the tree is first of all preprocessed by means of its linear fragments called string paths. Since only path queries are considered, the branching structure of the XML tree model can be omitted. For individual string paths, smaller Tree Paths Automata are built, and they are afterwards combined to form the index. The searching phase uses the index, reads an input query Q of size m, and computes the list of positions of all occurrences of Q in the tree T. The searching is performed in time O(m) and does not depend on the size of the XML document. Although the number of queries is clearly exponential in the number of nodes of the XML tree model, the size of the index seems to be, according to our experimental results, usually only about 2.5 times larger than the size of the original document

    Indexing XML Documents

    Get PDF
    Výzkum v oblasti indexování řetězců má již mnoho prezentovaných výsledků, což však neplatí pro ostatní datové struktury, jakými jsou například stromy. Tato práce obsahuje v prvé řadě shrnutí metod pro indexování řetězců a stromů. Dále se podrobně zabývá rešerší existujících řešení indexování XML dokumentů. Představena je zde nová jednoduchá metoda využívající deterministický konečný automat, jež umožňuje efektivně zpracovat XPath dotazy skládající se z libovolné kombinace child (/) a descendant-or-self (//) os, sloužících k navigaci v XML dokumentu. Spolu s touto metodou byly dále navrženy dva další konečné automaty na podporu jednodušších dotazů obsahujících vždy pouze jednu z uvedených os. Ke konstrukci indexu pro daný XML dokument D s n elementy je využit odpovídající XML stromový model T. Zpracování dotazu Q o m elementech proběhne v čase O(m) nezávislém na n. Výsledkem dotazu je poté množina elementů splňujících dané požadavky. Ačkoli automat podporující všechny dotazy s // osou indexuje až O(2^n) různých dotazů, počet stavů vlastního deterministického automatu je O(h^k), kde h je výška XML stromového modelu T a k je počet listů T. Pro běžné XML dokumenty lze navíc tuto mez triviálně snížit až na O(h.2^k).The theory of text indexing is very well-researched, which does not hold for theories of indexing other data structures, such as trees for example. In this thesis we review existing techniques for indexing texts and trees and study state-of-the-art methods for indexing XML documents. We show that automata can be used effectively for the purpose of indexing XML documents. A new and simple method for indexing XML documents using deterministic finite automaton is introduced. The presented method supports a significant fragment of XPath queries which may use any combination of child (/) and descendant-or-self (//) axes. We also propose another two indexing techniques based on finite automata, aimed to assist in evaluating paths queries with either / or // axis only. Given a subject XML document D and its corresponding XML tree model T with n nodes, the tree is preprocessed and the index is constructed. The searching phase uses the index, reads an input query Q of size m and computes the list of positions of all occurrences of target nodes of Q in T. All the proposed automata performed the searching in time O(m) and do not depend on n. Although the automaton that supports all linear XPath queries where just // axis is used evaluates O(2^n) distinct queries, number of states of the deterministic automaton is O(h^k), where h is the height of T and k is the number of its leaf nodes. Moreover, we discuss that in case of indexing a common XML document the number of state in the deterministic finite automaton is at most O(h.2^k)

    Fast and Tiny Structural Self-Indexes for XML

    Full text link
    XML document markup is highly repetitive and therefore well compressible using dictionary-based methods such as DAGs or grammars. In the context of selectivity estimation, grammar-compressed trees were used before as synopsis for structural XPath queries. Here a fully-fledged index over such grammars is presented. The index allows to execute arbitrary tree algorithms with a slow-down that is comparable to the space improvement. More interestingly, certain algorithms execute much faster over the index (because no decompression occurs). E.g., for structural XPath count queries, evaluating over the index is faster than previous XPath implementations, often by two orders of magnitude. The index also allows to serialize XML results (including texts) faster than previous systems, by a factor of ca. 2-3. This is due to efficient copy handling of grammar repetitions, and because materialization is totally avoided. In order to compare with twig join implementations, we implemented a materializer which writes out pre-order numbers of result nodes, and show its competitiveness.Comment: 13 page

    DescribeX: A Framework for Exploring and Querying XML Web Collections

    Full text link
    This thesis introduces DescribeX, a powerful framework that is capable of describing arbitrarily complex XML summaries of web collections, providing support for more efficient evaluation of XPath workloads. DescribeX permits the declarative description of document structure using all axes and language constructs in XPath, and generalizes many of the XML indexing and summarization approaches in the literature. DescribeX supports the construction of heterogeneous summaries where different document elements sharing a common structure can be declaratively defined and refined by means of path regular expressions on axes, or axis path regular expression (AxPREs). DescribeX can significantly help in the understanding of both the structure of complex, heterogeneous XML collections and the behaviour of XPath queries evaluated on them. Experimental results demonstrate the scalability of DescribeX summary refinements and stabilizations (the key enablers for tailoring summaries) with multi-gigabyte web collections. A comparative study suggests that using a DescribeX summary created from a given workload can produce query evaluation times orders of magnitude better than using existing summaries. DescribeX's light-weight approach of combining summaries with a file-at-a-time XPath processor can be a very competitive alternative, in terms of performance, to conventional fully-fledged XML query engines that provide DB-like functionality such as security, transaction processing, and native storage.Comment: PhD thesis, University of Toronto, 2008, 163 page

    CONTEXT-BASED AUTOSUGGEST ON GRAPH DATA

    Get PDF
    Autosuggest is an important feature in any search applications. Currently, most applications only suggest a single term based on how frequent that term appears in the indexed documents or how often it is searched upon. These approaches might not provide the most relevant suggestions because users often enter a series of related query terms to answer a question they have in mind. In this project, we implemented the Smart Solr Suggester plugin using a context-based approach that takes into account the relationships among search keywords. In particular, we used the keywords that the user has chosen so far in the search text box as the context to autosuggest their next incomplete keyword. This context-based approach uses the relationships between entities in the graph data that the user is searching on and therefore would provide more meaningful suggestions

    Fast and Compact Regular Expression Matching

    Get PDF
    We study 4 problems in string matching, namely, regular expression matching, approximate regular expression matching, string edit distance, and subsequence indexing, on a standard word RAM model of computation that allows logarithmic-sized words to be manipulated in constant time. We show how to improve the space and/or remove a dependency on the alphabet size for each problem using either an improved tabulation technique of an existing algorithm or by combining known algorithms in a new way

    Automata Approach to XML Data Indexing: Selecting Unknown Nodes

    Get PDF
    Tato práce je součástí projektu "Indexování XML dokumentů pomocí automatů". Popisuje existující metody pro indexování XML dokumentů, které jsou založeny na teorii automatů, a jejich rozšíření, za účelem umožnění efektivního zpracování XPath dotazů skládajících se z libovolné kombinace child (/), descendant-or-self (//) os a asterisk (*) a nodename node testů, sloužících k navigaci v XML dokumentu. Ke konstrukci indexu pro daný XML dokument D s n elementy je využít odpovídající XML stromový model T. Zpracování dotazu Q o velikosti m proběhne v čase O(m) nezávislém na n. Tato práce obsahuje též diskuzi ohledně časové a paměťové složitosti pro každou z navržených metod. Všechny nově popsané algoritmy jsou implementovány a otestovány na reálních datech.Being a part of the "Automata Approach to XML Data Indexing" project, this thesis is concerned with studying the existing methods of indexes creation algorithms based on the automata theory and extending them to deal with more significant fragment of XPath queries. The presented methods allow us to construct XML data indexes that support evaluation of all XPath queries using any combinations of child (/), descendant-or-self (//) axes, asterisk (*) and nodename node tests. Given an XML document D and its corresponding XML tree model T with n nodes, the tree is preprocessed and the index for the document D is constructed. The searching phase time of each of the constructed indexes for a query Q is bounded by O(m), where m is size of the query Q, and does not depend on the indexed XML document size n. Moreover, the space and time complexities for each of the proposed indexes are discussed, all the introduced algorithms are implemented and tested over the real-life datasets
    corecore