2,378 research outputs found
Fully Online Grammar Compression in Constant Space
We present novel variants of fully online LCA (FOLCA), a fully online grammar
compression that builds a straight line program (SLP) and directly encodes it
into a succinct representation in an online manner. FOLCA enables a direct
encoding of an SLP into a succinct representation that is asymptotically
equivalent to an information theoretic lower bound for representing an SLP
(Maruyama et al., SPIRE'13). The compression of FOLCA takes linear time
proportional to the length of an input text and its working space depends only
on the size of the SLP, which enables us to apply FOLCA to large-scale
repetitive texts. Recent repetitive texts, however, include some noise. For
example, current sequencing technology has significant error rates, which
embeds noise into genome sequences. For such noisy repetitive texts, FOLCA
working in the SLP size consumes a large amount of memory. We present two
variants of FOLCA working in constant space by leveraging the idea behind
stream mining techniques. Experiments using 100 human genomes corresponding to
about 300GB from the 1000 human genomes project revealed the applicability of
our method to large-scale, noisy repetitive texts.Comment: This is an extended version of a proceeding accepted to Data
Compression Conference (DCC), 201
A Space-Optimal Grammar Compression
A grammar compression is a context-free grammar (CFG) deriving a single string deterministically. For an input string of length N over an alphabet of size sigma, the smallest CFG is O(log N)-approximable in the offline setting and O(log N log^* N)-approximable in the online setting. In addition, an information-theoretic lower bound for representing a CFG in Chomsky normal form of n variables is log (n!/n^sigma) + n + o(n) bits. Although there is an online grammar compression algorithm that directly computes the succinct encoding of its output CFG with O(log N log^* N) approximation guarantee, the problem of optimizing its working space has remained open. We propose a fully-online algorithm that requires the fewest bits of working space asymptotically equal to the lower bound in O(N log log n) compression time. In addition we propose several techniques to boost grammar compression and show their efficiency by computational experiments
Online Self-Indexed Grammar Compression
Although several grammar-based self-indexes have been proposed thus far,
their applicability is limited to offline settings where whole input texts are
prepared, thus requiring to rebuild index structures for given additional
inputs, which is often the case in the big data era. In this paper, we present
the first online self-indexed grammar compression named OESP-index that can
gradually build the index structure by reading input characters one-by-one.
Such a property is another advantage which enables saving a working space for
construction, because we do not need to store input texts in memory. We
experimentally test OESP-index on the ability to build index structures and
search query texts, and we show OESP-index's efficiency, especially
space-efficiency for building index structures.Comment: To appear in the Proceedings of the 22nd edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2015
Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P)
In this paper, a compressed membership problem for finite automata, both
deterministic and non-deterministic, with compressed transition labels is
studied. The compression is represented by straight-line programs (SLPs), i.e.
context-free grammars generating exactly one string. A novel technique of
dealing with SLPs is introduced: the SLPs are recompressed, so that substrings
of the input text are encoded in SLPs labelling the transitions of the NFA
(DFA) in the same way, as in the SLP representing the input text. To this end,
the SLPs are locally decompressed and then recompressed in a uniform way.
Furthermore, such recompression induces only small changes in the automaton, in
particular, the size of the automaton remains polynomial.
Using this technique it is shown that the compressed membership for NFA with
compressed labels is in NP, thus confirming the conjecture of Plandowski and
Rytter and extending the partial result of Lohrey and Mathissen; as it is
already known, that this problem is NP-hard, we settle its exact computational
complexity. Moreover, the same technique applied to the compressed membership
for DFA with compressed labels yields that this problem is in P; for this
problem, only trivial upper-bound PSPACE was known
Improved ESP-index: a practical self-index for highly repetitive texts
While several self-indexes for highly repetitive texts exist, developing a
practical self-index applicable to real world repetitive texts remains a
challenge. ESP-index is a grammar-based self-index on the notion of
edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees
upper bounds of parsing discrepancies between different appearances of the same
subtexts in a text. Although ESP-index performs efficient top-down searches of
query texts, it has a serious issue on binary searches for finding appearances
of variables for a query text, which resulted in slowing down the query
searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea
behind succinct data structures for large alphabets. While ESP-index-I keeps
the same types of efficiencies as ESP-index about the top-down searches, it
avoid the binary searches using fast rank/select operations. We experimentally
test ESP-index-I on the ability to search query texts and extract subtexts from
real world repetitive texts on a large-scale, and we show that ESP-index-I
performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th
International Symposium on Experimental Algorithms (SEA2014
Tree Compression with Top Trees Revisited
We revisit tree compression with top trees (Bille et al, ICALP'13) and
present several improvements to the compressor and its analysis. By
significantly reducing the amount of information stored and guiding the
compression step using a RePair-inspired heuristic, we obtain a fast compressor
achieving good compression ratios, addressing an open problem posed by Bille et
al. We show how, with relatively small overhead, the compressed file can be
converted into an in-memory representation that supports basic navigation
operations in worst-case logarithmic time without decompression. We also show a
much improved worst-case bound on the size of the output of top-tree
compression (answering an open question posed in a talk on this algorithm by
Weimann in 2012).Comment: SEA 201
Fast and Tiny Structural Self-Indexes for XML
XML document markup is highly repetitive and therefore well compressible
using dictionary-based methods such as DAGs or grammars. In the context of
selectivity estimation, grammar-compressed trees were used before as synopsis
for structural XPath queries. Here a fully-fledged index over such grammars is
presented. The index allows to execute arbitrary tree algorithms with a
slow-down that is comparable to the space improvement. More interestingly,
certain algorithms execute much faster over the index (because no decompression
occurs). E.g., for structural XPath count queries, evaluating over the index is
faster than previous XPath implementations, often by two orders of magnitude.
The index also allows to serialize XML results (including texts) faster than
previous systems, by a factor of ca. 2-3. This is due to efficient copy
handling of grammar repetitions, and because materialization is totally
avoided. In order to compare with twig join implementations, we implemented a
materializer which writes out pre-order numbers of result nodes, and show its
competitiveness.Comment: 13 page
- …