3,152 research outputs found
Re-pair for Trees
We introduce a new linear time compression algorithm, called 'Repair for Trees', which compresses ordered trees over a ranked alphabet using linear straight-line context-free tree grammars. Such grammars generalize straight-line context-free string grammars and allow basic tree operations, like traversal along edges, to be executed without prior decompression. Our algorithm can be considered as a generalization of the 'Re-pair' algorithm developed by N. Jesper Larsson and Alistair Moffat in 2000. The latter algorithm is a dictionary-based compression algorithm for strings.
We also introduce a succinct coding which is specialized in further compressing the grammars generated by our algorithm. Thisis accomplished without loosing the ability do directly execute queries on this compressed representation of the input tree. Finally, we compare the grammars and output files generated by a prototype of the Re-pair for Trees algorithm with those of similar compression algorithms. The obtained results show that that our algorithm outperforms its competitors in terms of compression ratio, runtime and memory usage
XML Compression via DAGs
Unranked trees can be represented using their minimal dag (directed acyclic
graph). For XML this achieves high compression ratios due to their repetitive
mark up. Unranked trees are often represented through first child/next sibling
(fcns) encoded binary trees. We study the difference in size (= number of
edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One
main finding is that the size of the dag of the binary tree can never be
smaller than the square root of the size of the minimal dag, and that there are
examples that match this bound. We introduce a new combined structure, the
hybrid dag, which is guaranteed to be smaller than (or equal in size to) both
dags. Interestingly, we find through experiments that last child/previous
sibling encodings are much better for XML compression via dags, than fcns
encodings. We determine the average sizes of unranked and binary dags over a
given set of labels (under uniform distribution) in terms of their exact
generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT
201
Finger Search in Grammar-Compressed Strings
Grammar-based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. Given a grammar, the
random access problem is to compactly represent the grammar while supporting
random access, that is, given a position in the original uncompressed string
report the character at that position. In this paper we study the random access
problem with the finger search property, that is, the time for a random access
query should depend on the distance between a specified index , called the
\emph{finger}, and the query index . We consider both a static variant,
where we first place a finger and subsequently access indices near the finger
efficiently, and a dynamic variant where also moving the finger such that the
time depends on the distance moved is supported.
Let be the size the grammar, and let be the size of the string. For
the static variant we give a linear space representation that supports placing
the finger in time and subsequently accessing in time,
where is the distance between the finger and the accessed index. For the
dynamic variant we give a linear space representation that supports placing the
finger in time and accessing and moving the finger in time. Compared to the best linear space solution to random
access, we improve a query bound to for the static
variant and to for the dynamic variant, while
maintaining linear space. As an application of our results we obtain an
improved solution to the longest common extension problem in grammar compressed
strings. To obtain our results, we introduce several new techniques of
independent interest, including a novel van Emde Boas style decomposition of
grammars
Linear Compressed Pattern Matching for Polynomial Rewriting (Extended Abstract)
This paper is an extended abstract of an analysis of term rewriting where the
terms in the rewrite rules as well as the term to be rewritten are compressed
by a singleton tree grammar (STG). This form of compression is more general
than node sharing or representing terms as dags since also partial trees
(contexts) can be shared in the compression. In the first part efficient but
complex algorithms for detecting applicability of a rewrite rule under
STG-compression are constructed and analyzed. The second part applies these
results to term rewriting sequences.
The main result for submatching is that finding a redex of a left-linear rule
can be performed in polynomial time under STG-compression.
The main implications for rewriting and (single-position or parallel)
rewriting steps are: (i) under STG-compression, n rewriting steps can be
performed in nondeterministic polynomial time. (ii) under STG-compression and
for left-linear rewrite rules a sequence of n rewriting steps can be performed
in polynomial time, and (iii) for compressed rewrite rules where the left hand
sides are either DAG-compressed or ground and STG-compressed, and an
STG-compressed target term, n rewriting steps can be performed in polynomial
time.Comment: In Proceedings TERMGRAPH 2013, arXiv:1302.599
- …