20 research outputs found
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
Speeding-up -gram mining on grammar-based compressed texts
We present an efficient algorithm for calculating -gram frequencies on
strings represented in compressed form, namely, as a straight line program
(SLP). Given an SLP of size that represents string , the
algorithm computes the occurrence frequencies of all -grams in , by
reducing the problem to the weighted -gram frequencies problem on a
trie-like structure of size , where
is a quantity that represents the amount of
redundancy that the SLP captures with respect to -grams. The reduced problem
can be solved in linear time. Since , the running time of our
algorithm is , improving our
previous algorithm when
Minimal Absent Words in Rooted and Unrooted Trees
We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n
Computing Runs on a Trie
A maximal repetition, or run, in a string, is a maximal periodic substring whose smallest period is at most half the length of the substring. In this paper, we consider runs that correspond to a path on a trie, or in other words, on a rooted edge-labeled tree where the endpoints of the path must be a descendant/ancestor of the other. For a trie with n edges, we show that the number of runs is less than n. We also show an O(n sqrt{log n}log log n) time and O(n) space algorithm for counting and finding the shallower endpoint of all runs. We further show an O(n log n) time and O(n) space algorithm for finding both endpoints of all runs. We also discuss how to improve the running time even more
Compact q-gram Profiling of Compressed Strings
We consider the problem of computing the q-gram profile of a string \str of
size compressed by a context-free grammar with production rules. We
present an algorithm that runs in expected time and uses
O(n+q+\kq) space, where is the exact number of characters
decompressed by the algorithm and \kq\leq N-\alpha is the number of distinct
q-grams in \str. This simultaneously matches the current best known time
bound and improves the best known space bound. Our space bound is
asymptotically optimal in the sense that any algorithm storing the grammar and
the q-gram profile must use \Omega(n+q+\kq) space. To achieve this we
introduce the q-gram graph that space-efficiently captures the structure of a
string with respect to its q-grams, and show how to construct it from a
grammar
Tying up the loose ends in fully LZW-compressed pattern matching
We consider a natural generalization of the classical pattern matching problem: given compressed representations of a pattern p[1..M] and a text t[1..N] of sizes m and n, respectively, does p occur in t? We develop an optimal linear time solution for the case when p and t are compressed using the LZW method. This improves the previously known O((n+m)log(n+m)) time solution of Gasieniec and Rytter, and essentially closes the line of research devoted to tudying LZW-compressed exact pattern matching