1,106 research outputs found
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
We study the approximate string matching and regular expression matching
problem for the case when the text to be searched is compressed with the
Ziv-Lempel adaptive dictionary compression schemes. We present a time-space
trade-off that leads to algorithms improving the previously known complexities
for both problems. In particular, we significantly improve the space bounds,
which in practical applications are likely to be a bottleneck
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Run-length encoding Burrows-Wheeler Transformed strings, resulting in
Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive
strings. We propose a new algorithm for online RLBWT working in run-compressed
space, which runs in time and bits of space, where
is the length of input string received so far and is the number of runs
in the BWT of the reversed . We improve the state-of-the-art algorithm for
online RLBWT in terms of empirical construction time. Adopting the dynamic list
for maintaining a total order, we can replace rank queries in a dynamic wavelet
tree on a run-length compressed string by the direct comparison of labels in a
dynamic list. The empirical result for various benchmarks show the efficiency
of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
Dynamic Fluctuation Phenomena in Double Membrane Films
Dynamics of double membrane films is investigated in the long-wavelength
limit including the overdamped squeezing mode. We demonstrate that thermal
fluctuations essentially modify the character of the mode due to its nonlinear
coupling to the transversal shear hydrodynamic mode. The corresponding Green
function acquires as a function of the frequency a cut along the imaginary
semi-axis. Fluctuations lead to increasing the attenuation of the squeezing
mode it becomes larger than the `bare' value.Comment: 7 pages, Revte
Numerical Observation of a Tubular Phase in Anisotropic Membranes
We provide the first numerical evidence for the existence of a tubular phase,
predicted by Radzihovsky and Toner (RT), for anisotropic tethered membranes
without self-avoidance. Incorporating anisotropy into the bending rigidity of a
simple model of a tethered membrane with free boundary conditions, we show that
the model indeed has two phase transitions corresponding to the flat-to-tubular
and tubular-to-crumpled transitions. For the tubular phase we measure the Flory
exponent and the roughness exponent . We find
and , which are in reasonable agreement with the theoretical
predictions of RT --- and .Comment: 8 pages, LaTeX, REVTEX, final published versio
Composite repetition-aware data structures
In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from
previous version
Analysis of the uncertainty in the monetary valuation of ecosystem services - a case study at the river basin scale
Ecosystem services provide multiple benefits to human wellbeing and are increasingly considered by 18 policy-makers in environmental management. However, the uncertainty related with the monetary 19 valuation of these benefits is not yet adequately defined or integrated by policy-makers. Given this 20 background, our aim was to quantify different sources of uncertainty when performing monetary 21 valuation of ecosystem services, in order to provide a series of guidelines to reduce them. With an 22 example of 4 ecosystem services (i.e., water provisioning, waste treatment, erosion protection, and 23 habitat for species) provided at the river basin scale, we quantified the uncertainty associated with 24 the following sources: (1) the number of services considered, (2) the number of benefits considered 25 for each service, (3) the valuation metrics (i.e. valuation methods) used to value benefits, and (4) the 26 uncertainty of the parameters included in the valuation metrics. Results indicate that the highest 27 uncertainty was caused by the number of services considered, as well as by the number of benefits 28 considered for each service, whereas the parametric uncertainty was similar to the one related to the 29 selection of valuation metric, thus suggesting that the parametric uncertainty, which is the only 30 uncertainty type commonly considered, was less critical than the structural uncertainty, which is in 31 turn mainly dependent on the decision-making context. Given the uncertainty associated to the 32 valuation structure, special attention should be given to the selection of services, benefits and 33 metrics according to a given context
Suffix Tree of Alignment: An Efficient Index for Similar Data
We consider an index data structure for similar strings. The generalized
suffix tree can be a solution for this. The generalized suffix tree of two
strings and is a compacted trie representing all suffixes in and
. It has leaves and can be constructed in time.
However, if the two strings are similar, the generalized suffix tree is not
efficient because it does not exploit the similarity which is usually
represented as an alignment of and .
In this paper we propose a space/time-efficient suffix tree of alignment
which wisely exploits the similarity in an alignment. Our suffix tree for an
alignment of and has leaves where is the sum of
the lengths of all parts of different from and is the sum of the
lengths of some common parts of and . We did not compromise the pattern
search to reduce the space. Our suffix tree can be searched for a pattern
in time where is the number of occurrences of in and
. We also present an efficient algorithm to construct the suffix tree of
alignment. When the suffix tree is constructed from scratch, the algorithm
requires time where is the sum of the lengths
of other common substrings of and . When the suffix tree of is
already given, it requires time.Comment: 12 page
Dictionary-based methods for information extraction
In this paper, we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called dictionary of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from e.g. DNA strings. We then describe a procedure of string comparison between dictionary-created sequences (or artificial texts) that gives very good results in several contexts. We finally present some results on self-consistent classification problems
- …