69,666 research outputs found
Lempel-Ziv Parsing in External Memory
For decades, computing the LZ factorization (or LZ77 parsing) of a string has
been a requisite and computationally intensive step in many diverse
applications, including text indexing and data compression. Many algorithms for
LZ77 parsing have been discovered over the years; however, despite the
increasing need to apply LZ77 to massive data sets, no algorithm to date scales
to inputs that exceed the size of internal memory. In this paper we describe
the first algorithm for computing the LZ77 parsing in external memory. Our
algorithm is fast in practice and will allow the next generation of text
indexes to be realised for massive strings and string collections.Comment: 10 page
Lightweight Lempel-Ziv Parsing
We introduce a new approach to LZ77 factorization that uses O(n/d) words of
working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet
sizes). We also describe carefully engineered implementations of alternative
approaches to lightweight LZ77 factorization. Extensive experiments show that
the new algorithm is superior in most cases, particularly at the lowest memory
levels and for highly repetitive data. As a part of the algorithm, we describe
new methods for computing matching statistics which may be of independent
interest.Comment: 12 page
EERTREE: An Efficient Data Structure for Processing Palindromes in Strings
We propose a new linear-size data structure which provides a fast access to
all palindromic substrings of a string or a set of strings. This structure
inherits some ideas from the construction of both the suffix trie and suffix
tree. Using this structure, we present simple and efficient solutions for a
number of problems involving palindromes.Comment: 21 pages, 2 figures. Accepted to IWOCA 201
Measuring and Understanding Throughput of Network Topologies
High throughput is of particular interest in data center and HPC networks.
Although myriad network topologies have been proposed, a broad head-to-head
comparison across topologies and across traffic patterns is absent, and the
right way to compare worst-case throughput performance is a subtle problem.
In this paper, we develop a framework to benchmark the throughput of network
topologies, using a two-pronged approach. First, we study performance on a
variety of synthetic and experimentally-measured traffic matrices (TMs).
Second, we show how to measure worst-case throughput by generating a
near-worst-case TM for any given topology. We apply the framework to study the
performance of these TMs in a wide range of network topologies, revealing
insights into the performance of topologies with scaling, robustness of
performance across TMs, and the effect of scattered workload placement. Our
evaluation code is freely available
On Maximal Unbordered Factors
Given a string of length , its maximal unbordered factor is the
longest factor which does not have a border. In this work we investigate the
relationship between and the length of the maximal unbordered factor of
. We prove that for the alphabet of size the expected length
of the maximal unbordered factor of a string of length~ is at least
(for sufficiently large values of ). As an application of this result, we
propose a new algorithm for computing the maximal unbordered factor of a
string.Comment: Accepted to the 26th Annual Symposium on Combinatorial Pattern
Matching (CPM 2015
Internal Pattern Matching Queries in a Text and Applications
We consider several types of internal queries: questions about subwords of a
text. As the main tool we develop an optimal data structure for the problem
called here internal pattern matching. This data structure provides
constant-time answers to queries about occurrences of one subword in
another subword of a given text, assuming that ,
which allows for a constant-space representation of all occurrences. This
problem can be viewed as a natural extension of the well-studied pattern
matching problem. The data structure has linear size and admits a linear-time
construction algorithm.
Using the solution to the internal pattern matching problem, we obtain very
efficient data structures answering queries about: primitivity of subwords,
periods of subwords, general substring compression, and cyclic equivalence of
two subwords. All these results improve upon the best previously known
counterparts. The linear construction time of our data structure also allows to
improve the algorithm for finding -subrepetitions in a text (a more
general version of maximal repetitions, also called runs). For any fixed
we obtain the first linear-time algorithm, which matches the linear
time complexity of the algorithm computing runs. Our data structure has already
been used as a part of the efficient solutions for subword suffix rank &
selection, as well as substring compression using Burrows-Wheeler transform
composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201
Faster Compact On-Line Lempel-Ziv Factorization
We present a new on-line algorithm for computing the Lempel-Ziv factorization
of a string that runs in time and uses only bits
of working space, where is the length of the string and is the
size of the alphabet. This is a notable improvement compared to the performance
of previous on-line algorithms using the same order of working space but
running in either time (Okanohara & Sadakane 2009) or
time (Starikovskaya 2012). The key to our new algorithm is in the
utilization of an elegant but less popular index structure called Directed
Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an
opportunistic variant of our algorithm, which, given the run length encoding of
size of a string of length , computes the Lempel-Ziv factorization
on-line, in time
and bits of space, which is faster and more space efficient when
the string is run-length compressible
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
- …