834 research outputs found

    Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

    Full text link
    This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time O(logn)O(\log n) per input symbol (as opposed to amortized O(logn)O(\log n) time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves O(logn)O(\log n) worst case time per input symbol. Searching for a pattern of length mm in the resulting suffix tree takes O(min(mlogΣ,m+logn)+tocc)O(\min(m\log |\Sigma|, m + \log n) + tocc) time, where tocctocc is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance

    Space-efficient computation of the LCP array from the Burrows-Wheeler transform

    Get PDF
    We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1, \u3c3] can be computed from the Burrows-Wheeler transformed collection in O(n log \u3c3) time using o(n log \u3c3) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM

    Wheeler graphs: A framework for BWT-based data structures

    Get PDF
    The famous Burrows\u2013Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more. We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly. We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs

    A new class of string transformations for compressed text indexing

    Get PDF
    Introduced about thirty years ago in the field of data compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the “myriad virtues” of BWT. As a further result, we show that such new string transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string

    Coding Solutions for the Secure Biometric Storage Problem

    Full text link
    The paper studies the problem of securely storing biometric passwords, such as fingerprints and irises. With the help of coding theory Juels and Wattenberg derived in 1999 a scheme where similar input strings will be accepted as the same biometric. In the same time nothing could be learned from the stored data. They called their scheme a "fuzzy commitment scheme". In this paper we will revisit the solution of Juels and Wattenberg and we will provide answers to two important questions: What type of error-correcting codes should be used and what happens if biometric templates are not uniformly distributed, i.e. the biometric data come with redundancy. Answering the first question will lead us to the search for low-rate large-minimum distance error-correcting codes which come with efficient decoding algorithms up to the designed distance. In order to answer the second question we relate the rate required with a quantity connected to the "entropy" of the string, trying to estimate a sort of "capacity", if we want to see a flavor of the converse of Shannon's noisy coding theorem. Finally we deal with side-problems arising in a practical implementation and we propose a possible solution to the main one that seems to have so far prevented real life applications of the fuzzy scheme, as far as we know.Comment: the final version appeared in Proceedings Information Theory Workshop (ITW) 2010, IEEE copyrigh

    Parameterized Strings: Algorithms and Data Structures

    Get PDF
    A parameterized string (p-string) T = T[1] T[2]...T[n] is a sophisticated string of length n composed of symbols from a constant alphabet Sigma and a parameter alphabet pi. Given a pair of p-strings S and T, the parameterized pattern matching (p-match) problem is to verify whether the individual constant symbols match and whether there exists a bijection between the parameter symbols of S and T. If the two conditions are met, S is said to be a p-match of T. A significant breakthrough in the p-match area is the prev encoding, which is proven to identify a p-match between S and T if and only if prev(S) == prev(T). In order to utilize suffix data structures in terms of p-matching, we must account for the dynamic nature of the parameterized suffixes (p-suffixes) of T, namely prev(T[ i...n]) ∀ i, 1 ≤ i ≤ n.;In this work, we propose transformative approaches to the direct parameterized suffix sorting (p-suffix sorting) problem by generating and sorting lexicographically numeric fingerprints and arithmetic codes that correspond to individual p-suffixes. Our algorithm to p-suffix sort via fingerprints is the first theoretical linear time algorithm for p-suffix sorting for non-binary parameter alphabets, which assumes that each code is represented by a practical integer. We eliminate the key problems of fingerprints by introducing an algorithm that exploits the ordering of arithmetic codes to sort p-suffixes in linear time on average.;The longest previous factor (LPF) problem is defined for traditional strings exclusively from the constant alphabet Sigma. We generalize the LPF problem to the parameterized longest previous factor (pLPF) problem defined for p-strings. Subsequently, we present a linear time solution to construct the pLPF array. Given our pLPF algorithm, we show how to construct the pLCP (parameterized longest common prefix) array in linear time. Our algorithm is further exploited to construct the standard LPF and LCP arrays all in linear time.;We then study the structural string (s-string), a variant of the p-string that extends the p-string alphabets to include complementary parameters that correspond to one another. The s-string problem involves the new encoding schemes sencode and compl in order to identify a structural match (s-match). Current s-match solutions use a structural suffix tree (s-suffix tree) to study structural matches in RNA sequences. We introduce the suffix array, LCP, and LPF data structures for the s-string encoding schemes. Using our new data structures, we identify the first suffix array solution to the s-match problem. Our algorithms and data structures are shown to apply to s-strings and also p-strings and traditional strings

    External Batched Range Minimum Queries and LCP Construction

    Get PDF
    This work deals with I/O-optimal suffix array (SA) and longest common prefix (LCP) array construction in external memory. For this purpose, the I/O-optimale DC3 algorithm is enhanced by LCP construction and adapted accordingly to the external memory model. In this context we present a method to construct the required range minimum queries (RMQs) efficiently in external memory. The core of this work is a description and implementation of the resulting external DC3-LCP algorithm using the Stxxl - the C++ Standard Template Library for Extra Large Data Sets. Experimental results based on realistic, real-world instances rounds off this work
    corecore