Search CORE

976 research outputs found

Fast and Tiny Structural Self-Indexes for XML

Author: Maneth Sebastian
Sebastian Tom
Publication venue
Publication date: 27/12/2010
Field of study

XML document markup is highly repetitive and therefore well compressible using dictionary-based methods such as DAGs or grammars. In the context of selectivity estimation, grammar-compressed trees were used before as synopsis for structural XPath queries. Here a fully-fledged index over such grammars is presented. The index allows to execute arbitrary tree algorithms with a slow-down that is comparable to the space improvement. More interestingly, certain algorithms execute much faster over the index (because no decompression occurs). E.g., for structural XPath count queries, evaluating over the index is faster than previous XPath implementations, often by two orders of magnitude. The index also allows to serialize XML results (including texts) faster than previous systems, by a factor of ca. 2-3. This is due to efficient copy handling of grammar repetitions, and because materialization is totally avoided. In order to compare with twig join implementations, we implemented a materializer which writes out pre-order numbers of result nodes, and show its competitiveness.Comment: 13 page

arXiv.org e-Print Archive

HAL - Lille 3

INRIA a CCSD electronic archive server

XML Compression via DAGs

Author: Bousquet-Melou Mireille
Lohrey Markus
Maneth Sebastian
Noeth Eric
Publication venue
Publication date: 01/01/2013
Field of study

Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT 201

arXiv.org e-Print Archive

CiteSeerX

Linear Compressed Pattern Matching for Polynomial Rewriting (Extended Abstract)

Author: Schmidt-Schauss Manfred
Publication venue: 'Open Publishing Association'
Publication date: 01/02/2013
Field of study

This paper is an extended abstract of an analysis of term rewriting where the terms in the rewrite rules as well as the term to be rewritten are compressed by a singleton tree grammar (STG). This form of compression is more general than node sharing or representing terms as dags since also partial trees (contexts) can be shared in the compression. In the first part efficient but complex algorithms for detecting applicability of a rewrite rule under STG-compression are constructed and analyzed. The second part applies these results to term rewriting sequences. The main result for submatching is that finding a redex of a left-linear rule can be performed in polynomial time under STG-compression. The main implications for rewriting and (single-position or parallel) rewriting steps are: (i) under STG-compression, n rewriting steps can be performed in nondeterministic polynomial time. (ii) under STG-compression and for left-linear rewrite rules a sequence of n rewriting steps can be performed in polynomial time, and (iii) for compressed rewrite rules where the left hand sides are either DAG-compressed or ground and STG-compressed, and an STG-compressed target term, n rewriting steps can be performed in polynomial time.Comment: In Proceedings TERMGRAPH 2013, arXiv:1302.599

arXiv.org e-Print Archive

Directory of Open Access Journals

XQuery Streaming by Forest Transducers

Author: Hakuta Shizuya
Iwasaki Hideya
Maneth Sebastian
Nakano Keisuke
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/12/2013
Field of study

Streaming of XML transformations is a challenging task and only very few systems support streaming. Research approaches generally define custom fragments of XQuery and XPath that are amenable to streaming, and then design custom algorithms for each fragment. These languages have several shortcomings. Here we take a more principles approach to the problem of streaming XQuery-based transformations. We start with an elegant transducer model for which many static analysis problems are well-understood: the Macro Forest Transducer (MFT). We show that a large fragment of XQuery can be translated into MFTs --- indeed, a fragment of XQuery, that can express important features that are missing from other XQuery stream engines, such as GCX: our fragment of XQuery supports XPath predicates and let-statements. We then rely on a streaming execution engine for MFTs, one which uses a well-founded set of optimizations from functional programming, such as strictness analysis and deforestation. Our prototype achieves time and memory efficiency comparable to the fastest known engine for XQuery streaming, GCX. This is surprising because our engine relies on the OCaml built in garbage collector and does not use any specialized buffer management, while GCX's efficiency is due to clever and explicit buffer management.Comment: Full version of the paper in the Proceedings of the 30th IEEE International Conference on Data Engineering (ICDE 2014

arXiv.org e-Print Archive

CiteSeerX

Constructing Small Tree Grammars and Small Circuits for Formulas

Author: Hucke Danny
Lohrey Markus
Noeth Eric
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014)
Publication date: 01/01/2014
Field of study

It is shown that every tree of size n over a fixed set of sigma different ranked symbols can be decomposed into O(n/log_sigma(n)) = O((n * log(sigma))/ log(n)) many hierarchically defined pieces. Formally, such a hierarchical decomposition has the form of a straight-line linear context-free tree grammar of size O(n/log_sigma(n)), which can be used as a compressed representation of the input tree. This generalizes an analogous result for strings. Previous grammar-based tree compressors were not analyzed for the worst-case size of the computed grammar, except for the top dag of Bille et al., for which only the weaker upper bound of O(n/log^{0.19}(n)) for unranked and unlabelled trees has been derived. The main result is used to show that every arithmetical formula of size n, in which only m <= n different variables occur, can be transformed (in time O(n * log(n)) into an arithmetical circuit of size O((n * log(m))/log(n)) and depth O(log(n)). This refines a classical result of Brent, according to which an arithmetical formula of size n can be transformed into a logarithmic depth circuit of size O(n)

CiteSeerX

Dagstuhl Research Online Publication Server

Re-pair for Trees

Author: Mennicke Roy
Publication venue
Publication date: 20/10/2017
Field of study

We introduce a new linear time compression algorithm, called 'Repair for Trees', which compresses ordered trees over a ranked alphabet using linear straight-line context-free tree grammars. Such grammars generalize straight-line context-free string grammars and allow basic tree operations, like traversal along edges, to be executed without prior decompression. Our algorithm can be considered as a generalization of the 'Re-pair' algorithm developed by N. Jesper Larsson and Alistair Moffat in 2000. The latter algorithm is a dictionary-based compression algorithm for strings. We also introduce a succinct coding which is specialized in further compressing the grammars generated by our algorithm. Thisis accomplished without loosing the ability do directly execute queries on this compressed representation of the input tree. Finally, we compare the grammars and output files generated by a prototype of the Re-pair for Trees algorithm with those of similar compression algorithms. The obtained results show that that our algorithm outperforms its competitors in terms of compression ratio, runtime and memory usage

Qucosa - Publikationsserver der Universität Leipzig

Regular Matching and Inclusion on Compressed Tree Patterns with Constrained Context Variables

Author: Boneva Iovka
Niehren Joachim
Sakho Momar
Publication venue: 'Elsevier BV'
Publication date: 24/02/2021
Field of study

International audienceWe study the complexity of regular matching and inclusion for compressed tree patterns with context variables subject to regular constraints. Context variables with regular constraints permit to properly generalize on unranked tree patterns with hedge variables. Regular inclusion on unranked tree patterns is relevant to certain query answering on Xml streams with references. We show that regular matching and inclusion with regular constraints can be reduced in polynomial time to the corresponding problem without regular constraints

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1