Search CORE

118 research outputs found

XML Compression via DAGs

Author: Bousquet-Melou Mireille
Lohrey Markus
Maneth Sebastian
Noeth Eric
Publication venue
Publication date: 01/01/2013
Field of study

Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT 201

arXiv.org e-Print Archive

CiteSeerX

Dictionary-Based Tree Compression (Invited Talk)

Author: Maneth Sebastian
Publication venue
Publication date: 01/01/2012
Field of study

Trees are a ubiquitous data structure in computer science. LISP, for instance, was designed to manipulate nested lists, that is, ordered unranked trees. Already at that time, DAGs were used to detect common subexpression, a process known as "hash consing." In a DAG every distinct subtree is represented only once (but can be referenced many times) and hence it constitutes a dictionary-based compression method for ordered trees. In our compression scenario we distinguish two kinds of ordered trees: binary and unranked. The latter appear naturally as representation of XML document structures. We survey these dictionary-based compression methods for ordered trees: (1) DAGs, (2) hybrid DAGs, (3) straight-line context-free tree grammars ("SLT grammars"). We compare the minimal DAG of an unranked tree with the minimal DAG of its binary tree encoding. The latter is obtained by identifying first children of the unranked tree with left children of the binary tree, and next-siblings with the right children. For XML document trees, unranked DAGs are usually smaller than encoded binary DAGs. We show that this holds for arbitrary unranked trees, on average. We also present the "hybrid DAG"; its size lower-bounds those of the binary and unranked DAGs. Finding a smallest SLT grammar for a given tree is NP-complete. We discuss two linear-time approximation algorithms: BPLEX and TreeRePair. For typical XML document trees, TreeRePair produces SLT grammars that are only one fourth of the size of the minimal DAG, and which contain approximately 3$% of the edges of the original tree. As far as we know, this gives rise to the smallest existing pointer-based tree representation. We show that some basic algorithms can be computed directly on the compressed trees, without prior decompression. Examples include the execution of different kinds of tree automata, and the real-time traversal of the original tree. It is even possible to evaluate simple XPath queries directly on the SLT grammars, using deterministic node-selecting tree automata. In this way, impressive speed-ups are achieved over existing XPath evaluators, while at the same time the memory requirement is slashed to only a few percent. For more complex XPath queries that require nondeterministic node-selecting tree automata, efficient evaluation over SLT grammars remains a difficult challenge

Edinburgh Research Explorer

Dagstuhl Research Online Publication Server

Constructing Small Tree Grammars and Small Circuits for Formulas

Author: Hucke Danny
Lohrey Markus
Noeth Eric
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th International Conference on Foundation of Software Technology and Theoretical Computer Science (FSTTCS 2014)
Publication date: 01/01/2014
Field of study

It is shown that every tree of size n over a fixed set of sigma different ranked symbols can be decomposed into O(n/log_sigma(n)) = O((n * log(sigma))/ log(n)) many hierarchically defined pieces. Formally, such a hierarchical decomposition has the form of a straight-line linear context-free tree grammar of size O(n/log_sigma(n)), which can be used as a compressed representation of the input tree. This generalizes an analogous result for strings. Previous grammar-based tree compressors were not analyzed for the worst-case size of the computed grammar, except for the top dag of Bille et al., for which only the weaker upper bound of O(n/log^{0.19}(n)) for unranked and unlabelled trees has been derived. The main result is used to show that every arithmetical formula of size n, in which only m <= n different variables occur, can be transformed (in time O(n * log(n)) into an arithmetical circuit of size O((n * log(m))/log(n)) and depth O(log(n)). This refines a classical result of Brent, according to which an arithmetical formula of size n can be transformed into a logarithmic depth circuit of size O(n)

CiteSeerX

Dagstuhl Research Online Publication Server

Compression vs Queryability - A Case Study

Author: Anantharaman Siva
Publication venue: Dagstuhl Seminar Proceedings. 08261 - Structure-Based Compression of Complex Massive Data
Publication date: 01/01/2008
Field of study

International audienceSome compromise on compression is known to be necessary, if the relative positions of the information stored by semi-structured documents are to remain accessible under queries. With this in view, we compare, on an example, the `query-friendliness' of XML documents, when compressed into straightline tree grammars which are either regular or context-free. The queries considered are in a limited fragment of XPath, corresponding to a type of patterns; each such query defines naturally a non-deterministic, bottom-up `query automaton' that runs just as well on a tree as on its compressed dag

Dagstuhl Research Online Publication Server

XQuery Streaming by Forest Transducers

Author: Hakuta Shizuya
Iwasaki Hideya
Maneth Sebastian
Nakano Keisuke
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/12/2013
Field of study

Streaming of XML transformations is a challenging task and only very few systems support streaming. Research approaches generally define custom fragments of XQuery and XPath that are amenable to streaming, and then design custom algorithms for each fragment. These languages have several shortcomings. Here we take a more principles approach to the problem of streaming XQuery-based transformations. We start with an elegant transducer model for which many static analysis problems are well-understood: the Macro Forest Transducer (MFT). We show that a large fragment of XQuery can be translated into MFTs --- indeed, a fragment of XQuery, that can express important features that are missing from other XQuery stream engines, such as GCX: our fragment of XQuery supports XPath predicates and let-statements. We then rely on a streaming execution engine for MFTs, one which uses a well-founded set of optimizations from functional programming, such as strictness analysis and deforestation. Our prototype achieves time and memory efficiency comparable to the fastest known engine for XQuery streaming, GCX. This is surprising because our engine relies on the OCaml built in garbage collector and does not use any specialized buffer management, while GCX's efficiency is due to clever and explicit buffer management.Comment: Full version of the paper in the Proceedings of the 30th IEEE International Conference on Data Engineering (ICDE 2014

arXiv.org e-Print Archive

CiteSeerX

XML Compression via Directed Acyclic Graphs

Author: Bousquet-Mélou Mireille
Lohrey Markus
Maneth Sebastian
Noeth Eric
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

A short version of this paper appeared in the Proceedings of ICDT 2013International audienceUnranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior

Edinburgh Research Explorer