1,106 research outputs found

    Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

    Full text link
    We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck

    A Faster Implementation of Online Run-Length Burrows-Wheeler Transform

    Full text link
    Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in O(nlgr)O(n\lg r) time and O(rlgn)O(r\lg n) bits of space, where nn is the length of input string SS received so far and rr is the number of runs in the BWT of the reversed SS. We improve the state-of-the-art algorithm for online RLBWT in terms of empirical construction time. Adopting the dynamic list for maintaining a total order, we can replace rank queries in a dynamic wavelet tree on a run-length compressed string by the direct comparison of labels in a dynamic list. The empirical result for various benchmarks show the efficiency of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201

    Efficient LZ78 factorization of grammar compressed text

    Full text link
    We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size nn representing a text SS of length NN, our algorithm computes the LZ78 factorization of TT in O(nN+mlogN)O(n\sqrt{N}+m\log N) time and O(nN+m)O(n\sqrt{N}+m) space, where mm is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the nNn\sqrt{N} term in the time and space complexities becomes either nLnL, where LL is the length of the longest LZ78 factor, or (Nα)(N - \alpha) where α0\alpha \geq 0 is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of SS of a certain length. Since m=O(N/logσN)m = O(N/\log_\sigma N) where σ\sigma is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when σ\sigma is constant, and can be more efficient when the text is compressible, i.e. when mm and nn are small.Comment: SPIRE 201

    Dynamic Fluctuation Phenomena in Double Membrane Films

    Full text link
    Dynamics of double membrane films is investigated in the long-wavelength limit including the overdamped squeezing mode. We demonstrate that thermal fluctuations essentially modify the character of the mode due to its nonlinear coupling to the transversal shear hydrodynamic mode. The corresponding Green function acquires as a function of the frequency a cut along the imaginary semi-axis. Fluctuations lead to increasing the attenuation of the squeezing mode it becomes larger than the `bare' value.Comment: 7 pages, Revte

    Numerical Observation of a Tubular Phase in Anisotropic Membranes

    Get PDF
    We provide the first numerical evidence for the existence of a tubular phase, predicted by Radzihovsky and Toner (RT), for anisotropic tethered membranes without self-avoidance. Incorporating anisotropy into the bending rigidity of a simple model of a tethered membrane with free boundary conditions, we show that the model indeed has two phase transitions corresponding to the flat-to-tubular and tubular-to-crumpled transitions. For the tubular phase we measure the Flory exponent νF\nu_F and the roughness exponent ζ\zeta. We find νF=0.305(14)\nu_F=0.305(14) and ζ=0.895(60)\zeta=0.895(60), which are in reasonable agreement with the theoretical predictions of RT --- νF=1/4\nu_F=1/4 and ζ=1\zeta=1.Comment: 8 pages, LaTeX, REVTEX, final published versio

    Composite repetition-aware data structures

    Get PDF
    In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

    Analysis of the uncertainty in the monetary valuation of ecosystem services - a case study at the river basin scale

    Get PDF
    Ecosystem services provide multiple benefits to human wellbeing and are increasingly considered by 18 policy-makers in environmental management. However, the uncertainty related with the monetary 19 valuation of these benefits is not yet adequately defined or integrated by policy-makers. Given this 20 background, our aim was to quantify different sources of uncertainty when performing monetary 21 valuation of ecosystem services, in order to provide a series of guidelines to reduce them. With an 22 example of 4 ecosystem services (i.e., water provisioning, waste treatment, erosion protection, and 23 habitat for species) provided at the river basin scale, we quantified the uncertainty associated with 24 the following sources: (1) the number of services considered, (2) the number of benefits considered 25 for each service, (3) the valuation metrics (i.e. valuation methods) used to value benefits, and (4) the 26 uncertainty of the parameters included in the valuation metrics. Results indicate that the highest 27 uncertainty was caused by the number of services considered, as well as by the number of benefits 28 considered for each service, whereas the parametric uncertainty was similar to the one related to the 29 selection of valuation metric, thus suggesting that the parametric uncertainty, which is the only 30 uncertainty type commonly considered, was less critical than the structural uncertainty, which is in 31 turn mainly dependent on the decision-making context. Given the uncertainty associated to the 32 valuation structure, special attention should be given to the selection of services, benefits and 33 metrics according to a given context

    Suffix Tree of Alignment: An Efficient Index for Similar Data

    Full text link
    We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings AA and BB is a compacted trie representing all suffixes in AA and BB. It has A+B|A|+|B| leaves and can be constructed in O(A+B)O(|A|+|B|) time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of AA and BB. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of AA and BB has A+ld+l1|A| + l_d + l_1 leaves where ldl_d is the sum of the lengths of all parts of BB different from AA and l1l_1 is the sum of the lengths of some common parts of AA and BB. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern PP in O(P+occ)O(|P|+occ) time where occocc is the number of occurrences of PP in AA and BB. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires O(A+ld+l1+l2)O(|A| + l_d + l_1 + l_2) time where l2l_2 is the sum of the lengths of other common substrings of AA and BB. When the suffix tree of AA is already given, it requires O(ld+l1+l2)O(l_d + l_1 + l_2) time.Comment: 12 page

    Dictionary-based methods for information extraction

    Get PDF
    In this paper, we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called dictionary of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from e.g. DNA strings. We then describe a procedure of string comparison between dictionary-created sequences (or artificial texts) that gives very good results in several contexts. We finally present some results on self-consistent classification problems
    corecore