613 research outputs found

    Full-fledged Real-Time Indexing for Constant Size Alphabets

    Full text link
    In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to TT in O(1) worst-case time. At any moment, we can report all occurrences of a pattern PP in the current text in O(P+k)O(|P|+k) time, where P|P| is the length of PP and kk is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})

    Finger Search in Grammar-Compressed Strings

    Get PDF
    Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index ff, called the \emph{finger}, and the query index ii. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let nn be the size the grammar, and let NN be the size of the string. For the static variant we give a linear space representation that supports placing the finger in O(logN)O(\log N) time and subsequently accessing in O(logD)O(\log D) time, where DD is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in O(logN)O(\log N) time and accessing and moving the finger in O(logD+loglogN)O(\log D + \log \log N) time. Compared to the best linear space solution to random access, we improve a O(logN)O(\log N) query bound to O(logD)O(\log D) for the static variant and to O(logD+loglogN)O(\log D + \log \log N) for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars

    Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

    Full text link
    This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time O(logn)O(\log n) per input symbol (as opposed to amortized O(logn)O(\log n) time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves O(logn)O(\log n) worst case time per input symbol. Searching for a pattern of length mm in the resulting suffix tree takes O(min(mlogΣ,m+logn)+tocc)O(\min(m\log |\Sigma|, m + \log n) + tocc) time, where tocctocc is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance

    Fully-online Construction of Suffix Trees for Multiple Texts

    Get PDF
    We consider fully-online construction of indexing data structures for multiple texts. Let T = {T_1, ..., T_K} be a collection of texts. By fully-online, we mean that a new character can be appended to any text in T at any time. This is a natural generalization of semi-online construction of indexing data structures for multiple texts in which, after a new character is appended to the kth text T_k, then its previous texts T_1, ..., T_k-1 will remain static. Our fully-online scenario arises when we maintain dynamic indexes for multi-sensor data. Let N and sigma denote the total length of texts in T and the alphabet size, respectively. We first show that the algorithm by Blumer et al. [Theoretical Computer Science, 40:31-55, 1985] to construct the directed acyclic word graph (DAWG) for T can readily be extended to our fully-online setting, retaining O(N log sigma)-time and O(N)-space complexities. Then, we give a sophisticated fully-online algorithm which constructs the suffix tree for T in O(N log sigma) time and O(N) space. A key idea of this algorithm is synchronized maintenance of the DAWG and the suffix tree

    String Searching with Ranking Constraints and Uncertainty

    Get PDF
    Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model
    corecore