8 research outputs found
Fast and Tiny Structural Self-Indexes for XML
XML document markup is highly repetitive and therefore well compressible
using dictionary-based methods such as DAGs or grammars. In the context of
selectivity estimation, grammar-compressed trees were used before as synopsis
for structural XPath queries. Here a fully-fledged index over such grammars is
presented. The index allows to execute arbitrary tree algorithms with a
slow-down that is comparable to the space improvement. More interestingly,
certain algorithms execute much faster over the index (because no decompression
occurs). E.g., for structural XPath count queries, evaluating over the index is
faster than previous XPath implementations, often by two orders of magnitude.
The index also allows to serialize XML results (including texts) faster than
previous systems, by a factor of ca. 2-3. This is due to efficient copy
handling of grammar repetitions, and because materialization is totally
avoided. In order to compare with twig join implementations, we implemented a
materializer which writes out pre-order numbers of result nodes, and show its
competitiveness.Comment: 13 page
I/O efficient bisimulation partitioning on very large directed acyclic graphs
In this paper we introduce the first efficient external-memory algorithm to
compute the bisimilarity equivalence classes of a directed acyclic graph (DAG).
DAGs are commonly used to model data in a wide variety of practical
applications, ranging from XML documents and data provenance models, to web
taxonomies and scientific workflows. In the study of efficient reasoning over
massive graphs, the notion of node bisimilarity plays a central role. For
example, grouping together bisimilar nodes in an XML data set is the first step
in many sophisticated approaches to building indexing data structures for
efficient XPath query evaluation. To date, however, only internal-memory
bisimulation algorithms have been investigated. As the size of real-world DAG
data sets often exceeds available main memory, storage in external memory
becomes necessary. Hence, there is a practical need for an efficient approach
to computing bisimulation in external memory.
Our general algorithm has a worst-case IO-complexity of O(Sort(|N| + |E|)),
where |N| and |E| are the numbers of nodes and edges, resp., in the data graph
and Sort(n) is the number of accesses to external memory needed to sort an
input of size n. We also study specializations of this algorithm to common
variations of bisimulation for tree-structured XML data sets. We empirically
verify efficient performance of the algorithms on graphs and XML documents
having billions of nodes and edges, and find that the algorithms can process
such graphs efficiently even when very limited internal memory is available.
The proposed algorithms are simple enough for practical implementation and use,
and open the door for further study of external-memory bisimulation algorithms.
To this end, the full open-source C++ implementation has been made freely
available
Efficient external-memory bisimulation on DAGs
ABSTRACT In this paper we introduce the first efficient external-memory algorithm to compute the bisimilarity equivalence classes of a directed acyclic graph (DAG). DAGs are commonly used to model data in a wide variety of practical applications, ranging from XML documents and data provenance models, to web taxonomies and scientific workflows. In the study of efficient reasoning over massive graphs, the notion of node bisimilarity plays a central role. For example, grouping together bisimilar nodes in an XML data set is the first step in many sophisticated approaches to building indexing data structures for efficient XPath query evaluation. To date, however, only internal-memory bisimulation algorithms have been investigated. As the size of real-world DAG data sets often exceeds available main memory, storage in external memory becomes necessary. Hence, there is a practical need for an efficient approach to computing bisimulation in external memory. Our general algorithm has a worst-case IO-complexity of O(Sort(|N | + |E|)), where |N | and |E| are the numbers of nodes and edges, resp., in the data graph and Sort(n) is the number of accesses to external memory needed to sort an input of size n. We also study specializations of this algorithm to common variations of bisimulation for treestructured XML data sets. We empirically verify efficient performance of the algorithms on graphs and XML documents having billions of nodes and edges, and find that the algorithms can process such graphs efficiently even when very limited internal memory is available. The proposed algorithms are simple enough for practical implementation and use, and open the door for further study of externalmemory bisimulation algorithms. To this end, the full opensource C++ implementation has been made freely available
A Methodology for Coupling Fragments of XPath with Structural Indexes for XML Documents
Supporting efficient access to XML data using XPath [3] continues to be an important research problem [6, 12]. XPath queries are used to specify nodelabeled trees which match portions of the hierarchical XML data. In XPath query evaluation, indices similar to those used in relational database system
A Methodology for Coupling Fragments of XPath with Structural Indexes for XML Documents
Abstract. Recent studies have proposed structural summary techniques for pathquery evaluation on semi-structured data sources. One major line of this research has been the introduction of the DataGuide, 1-index, 2-index, and A(k) indices, and subsequent investigations and generalizations. Another recent study has considered structural characterizations of fragments of XPath, the standard path navigation language for XML documents. In this paper we provide a methodology on XPath query processing that couples these two areas of research on structural indices and query languages. To illustrate this methodology, we apply it to couple an upward-only XPath fragment with the A(k) and P (k) structural indices. With an eye towards applying this result to XPath query processing, we (1) show how upward-only XPath expressions can be evaluated directly on the corresponding indices; (2) develop a labeling scheme for A(k) and P(k) partition blocks, using algebraic expressions; and (3) leverage these results to develop generic techniques for making effective use of A(k) and P (k) indices for more general, frequently occurring XPath expressions.
DescribeX: A Framework for Exploring and Querying XML Web Collections
This thesis introduces DescribeX, a powerful framework that is capable of
describing arbitrarily complex XML summaries of web collections, providing
support for more efficient evaluation of XPath workloads. DescribeX permits the
declarative description of document structure using all axes and language
constructs in XPath, and generalizes many of the XML indexing and summarization
approaches in the literature. DescribeX supports the construction of
heterogeneous summaries where different document elements sharing a common
structure can be declaratively defined and refined by means of path regular
expressions on axes, or axis path regular expression (AxPREs). DescribeX can
significantly help in the understanding of both the structure of complex,
heterogeneous XML collections and the behaviour of XPath queries evaluated on
them.
Experimental results demonstrate the scalability of DescribeX summary
refinements and stabilizations (the key enablers for tailoring summaries) with
multi-gigabyte web collections. A comparative study suggests that using a
DescribeX summary created from a given workload can produce query evaluation
times orders of magnitude better than using existing summaries. DescribeX's
light-weight approach of combining summaries with a file-at-a-time XPath
processor can be a very competitive alternative, in terms of performance, to
conventional fully-fledged XML query engines that provide DB-like functionality
such as security, transaction processing, and native storage.Comment: PhD thesis, University of Toronto, 2008, 163 page