205 research outputs found
Online Self-Indexed Grammar Compression
Although several grammar-based self-indexes have been proposed thus far,
their applicability is limited to offline settings where whole input texts are
prepared, thus requiring to rebuild index structures for given additional
inputs, which is often the case in the big data era. In this paper, we present
the first online self-indexed grammar compression named OESP-index that can
gradually build the index structure by reading input characters one-by-one.
Such a property is another advantage which enables saving a working space for
construction, because we do not need to store input texts in memory. We
experimentally test OESP-index on the ability to build index structures and
search query texts, and we show OESP-index's efficiency, especially
space-efficiency for building index structures.Comment: To appear in the Proceedings of the 22nd edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2015
Fully Online Grammar Compression in Constant Space
We present novel variants of fully online LCA (FOLCA), a fully online grammar
compression that builds a straight line program (SLP) and directly encodes it
into a succinct representation in an online manner. FOLCA enables a direct
encoding of an SLP into a succinct representation that is asymptotically
equivalent to an information theoretic lower bound for representing an SLP
(Maruyama et al., SPIRE'13). The compression of FOLCA takes linear time
proportional to the length of an input text and its working space depends only
on the size of the SLP, which enables us to apply FOLCA to large-scale
repetitive texts. Recent repetitive texts, however, include some noise. For
example, current sequencing technology has significant error rates, which
embeds noise into genome sequences. For such noisy repetitive texts, FOLCA
working in the SLP size consumes a large amount of memory. We present two
variants of FOLCA working in constant space by leveraging the idea behind
stream mining techniques. Experiments using 100 human genomes corresponding to
about 300GB from the 1000 human genomes project revealed the applicability of
our method to large-scale, noisy repetitive texts.Comment: This is an extended version of a proceeding accepted to Data
Compression Conference (DCC), 201
A Grammar Compression Algorithm based on Induced Suffix Sorting
We introduce GCIS, a grammar compression algorithm based on the induced
suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution
builds on the factorization performed by SAIS during suffix sorting. We
construct a context-free grammar on the input string which can be further
reduced into a shorter string by substituting each substring by its
correspondent factor. The resulting grammar is encoded by exploring some
redundancies, such as common prefixes between suffix rules, which are sorted
according to SAIS framework. When compared to well-known compression tools such
as Re-Pair and 7-zip, our algorithm is competitive and very effective at
handling repetitive string regarding compression ratio, compression and
decompression running time
A Space-Optimal Grammar Compression
A grammar compression is a context-free grammar (CFG) deriving a single string deterministically. For an input string of length N over an alphabet of size sigma, the smallest CFG is O(log N)-approximable in the offline setting and O(log N log^* N)-approximable in the online setting. In addition, an information-theoretic lower bound for representing a CFG in Chomsky normal form of n variables is log (n!/n^sigma) + n + o(n) bits. Although there is an online grammar compression algorithm that directly computes the succinct encoding of its output CFG with O(log N log^* N) approximation guarantee, the problem of optimizing its working space has remained open. We propose a fully-online algorithm that requires the fewest bits of working space asymptotically equal to the lower bound in O(N log log n) compression time. In addition we propose several techniques to boost grammar compression and show their efficiency by computational experiments
- …