93 research outputs found
Computing LZ77 in Run-Compressed Space
In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n}
can be computed in O(R log n) bits of working space and O(n log R) time, R
being the number of runs in the Burrows-Wheeler transform of T reversed. For
extremely repetitive inputs, the working space can be as low as O(log n) bits:
exponentially smaller than the text itself. As a direct consequence of our
result, we show that a class of repetition-aware self-indexes based on a
combination of run-length encoded BWT and LZ77 can be built in asymptotically
optimal O(R + z) words of working space, z being the size of the LZ77 parsing
Comparison of LZ77-type Parsings
We investigate the relations between different variants of the LZ77 parsing
existing in the literature. All of them are defined as greedily constructed
parsings encoding each phrase by reference to a string occurring earlier in the
input. They differ by the phrase encodings: encoded by pairs (length + position
of an earlier occurrence) or by triples (length + position of an earlier
occurrence + the letter following the earlier occurring part); and they differ
by allowing or not allowing overlaps between the phrase and its earlier
occurrence. For a given string of length over an alphabet of size ,
denote the numbers of phrases in the parsings allowing (resp., not allowing)
overlaps by (resp., ) for "pairs", and by (resp.,
) for "triples". We prove the following bounds and provide series of
examples showing that these bounds are tight:
and
;
and .Comment: 6 page
Fully Online Grammar Compression in Constant Space
We present novel variants of fully online LCA (FOLCA), a fully online grammar
compression that builds a straight line program (SLP) and directly encodes it
into a succinct representation in an online manner. FOLCA enables a direct
encoding of an SLP into a succinct representation that is asymptotically
equivalent to an information theoretic lower bound for representing an SLP
(Maruyama et al., SPIRE'13). The compression of FOLCA takes linear time
proportional to the length of an input text and its working space depends only
on the size of the SLP, which enables us to apply FOLCA to large-scale
repetitive texts. Recent repetitive texts, however, include some noise. For
example, current sequencing technology has significant error rates, which
embeds noise into genome sequences. For such noisy repetitive texts, FOLCA
working in the SLP size consumes a large amount of memory. We present two
variants of FOLCA working in constant space by leveraging the idea behind
stream mining techniques. Experiments using 100 human genomes corresponding to
about 300GB from the 1000 human genomes project revealed the applicability of
our method to large-scale, noisy repetitive texts.Comment: This is an extended version of a proceeding accepted to Data
Compression Conference (DCC), 201
Online Self-Indexed Grammar Compression
Although several grammar-based self-indexes have been proposed thus far,
their applicability is limited to offline settings where whole input texts are
prepared, thus requiring to rebuild index structures for given additional
inputs, which is often the case in the big data era. In this paper, we present
the first online self-indexed grammar compression named OESP-index that can
gradually build the index structure by reading input characters one-by-one.
Such a property is another advantage which enables saving a working space for
construction, because we do not need to store input texts in memory. We
experimentally test OESP-index on the ability to build index structures and
search query texts, and we show OESP-index's efficiency, especially
space-efficiency for building index structures.Comment: To appear in the Proceedings of the 22nd edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2015
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
Searching and Indexing Genomic Databases via Kernelization
The rapid advance of DNA sequencing technologies has yielded databases of
thousands of genomes. To search and index these databases effectively, it is
important that we take advantage of the similarity between those genomes.
Several authors have recently suggested searching or indexing only one
reference genome and the parts of the other genomes where they differ. In this
paper we survey the twenty-year history of this idea and discuss its relation
to kernelization in parameterized complexity
- …