69 research outputs found

    Data Structure Lower Bounds for Document Indexing Problems

    Get PDF
    We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful in proving high and unconditional lower bounds that cannot be obtained in any other known model of computation with the current techniques. Often our lower bounds match the known space-query time trade-off curve and in fact for all the problems considered, there is a very good and reasonable match between the our lower bounds and the known upper bounds, at least for some choice of input parameters. The problems that we consider are set intersection queries (both the reporting variant and the semi-group counting variant), indexing a set of documents for two-pattern queries, or forbidden- pattern queries, or queries with wild-cards, and indexing an input set of gapped-patterns (or two-patterns) to find those matching a document given at the query time.Comment: Full version of the conference version that appeared at ICALP 2016, 25 page

    More Time-Space Tradeoffs for Finding a Shortest Unique Substring

    Get PDF
    We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of finding k-mismatch SUSs. Our new results include the first algorithm for finding a k-mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or in O(n(L/m)log2(L)loglogσ) time using O((L/m)log2L) sequential passes over T. We then describe how, for constant k, with high probability, we can find a k-mismatch SUS in O(n1+ϵL/m) time using O(nϵL/m) sequential passes over T, again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O(nτlogσlogn) time to find an SUS using O(n/τ) words of workspace, where τ is a parameter

    Dynamic and Internal Longest Common Substring

    Get PDF
    Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n) -time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to the fully dynamic LCS problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in O~ (n2 / 3) time, after O~ (n) -time and space preprocessing. This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, the authors presented an O~ (n) -sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in O~ (1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings; specifically, computing the longest palindrome and the Lyndon factorization of a string after a single edit operation. We develop dynamic sublinear-time algorithms for both of these problems as well. We also consider internal LCS queries, that is, queries in which we are to return an LCS of a pair of substrings of S and T. We show that answering such queries is hard in general and propose efficient data structures for several restricted cases

    Efficient data structures for range shortest unique substring queries†

    Get PDF
    Let T[1, n] be a string of length n and T[i, j] be the substring of T starting at position i and ending at position j. A substring T[i, j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α, β], return a shortest substring T[i, j] of T with exactly one occurrence in [α, β]. We present an O(n log n)-word data structure with O(logw n) query time, where w = Ω(log n) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(√ n logɛ n) query time, where ɛ > 0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012]

    Efficient Data Structures for Text Processing Applications

    Get PDF
    This thesis is devoted to designing and analyzing efficient text indexing data structures and associated algorithms for processing text data. The general problem is to preprocess a given text or a collection of texts into a space-efficient index to quickly answer various queries on this data. Basic queries such as counting/reporting a given pattern\u27s occurrences as substrings of the original text are useful in modeling critical bioinformatics applications. This line of research has witnessed many breakthroughs, such as the suffix trees, suffix arrays, FM-index, etc. In this work, we revisit the following problems: 1. The Heaviest Induced Ancestors problem 2. Range Longest Common Prefix problem 3. Range Shortest Unique Substrings problem 4. Non-Overlapping Indexing problem For the first problem, we present two new space-time trade-offs that improve the space, query time, or both of the existing solutions by roughly a logarithmic factor. For the second problem, our solution takes linear space, which improves the previous result by a logarithmic factor. The techniques developed are then extended to obtain an efficient solution for our third problem, which is newly formulated. Finally, we present a new framework that yields efficient solutions for the last problem in both cache-aware and cache-oblivious models

    Space-Time Trade-Offs for the Shortest Unique Substring Problem

    Get PDF
    Given a string X[1, n] and a position k in [1, n], the Shortest Unique Substring of X covering k, denoted by S_k, is a substring X[i, j] of X which satisfies the following conditions: (i) i leq k leq j, (ii) i is the only position where there is an occurrence of X[i, j], and (iii) j - i is minimized. The best-known algorithm [Hon et al., ISAAC 2015] can find S k for all k in [1, n] in time O(n) using the string X and additional 2n words of working space. Let tau be a given parameter. We present the following new results. For any given k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n tau) time using X and additional O(n/tau) words of working space. For every k in [1, n], we can compute S_k via a deterministic algorithm in O(n tau^2 log n/tau) time using X and additional O(n/tau) words and 4n + o(n) bits of working space. For both problems above, we present an O(n tau log^{c+1} n)-time randomized algorithm that uses n/ log c n words in addition to that mentioned above, where c geq 0 is an arbitrary constant. In this case, the reported string is unique and covers k, but with probability at most n^{-O(1)}may not be the shortest. As a consequence of our techniques, we also obtain similar space-and-time tradeoffs for a related problem of finding Maximal Unique Matches of two strings [Delcher et al., Nucleic Acids Res. 1999]

    Efficient indexing of necklaces and irreducible polynomials over finite fields

    Full text link
    We study the problem of indexing irreducible polynomials over finite fields, and give the first efficient algorithm for this problem. Specifically, we show the existence of poly(n, log q)-size circuits that compute a bijection between {1, ... , |S|} and the set S of all irreducible, monic, univariate polynomials of degree n over a finite field F_q. This has applications in pseudorandomness, and answers an open question of Alon, Goldreich, H{\aa}stad and Peralta[AGHP]. Our approach uses a connection between irreducible polynomials and necklaces ( equivalence classes of strings under cyclic rotation). Along the way, we give the first efficient algorithm for indexing necklaces of a given length over a given alphabet, which may be of independent interest

    String Searching with Ranking Constraints and Uncertainty

    Get PDF
    Strings play an important role in many areas of computer science. Searching pattern in a string or string collection is one of the most classic problems. Different variations of this problem such as document retrieval, ranked document retrieval, dictionary matching has been well studied. Enormous growth of internet, large genomic projects, sensor networks, digital libraries necessitates not just efficient algorithms and data structures for the general string indexing, but indexes for texts with fuzzy information and support for queries with different constraints. This dissertation addresses some of these problems and proposes indexing solutions. One such variation is document retrieval query for included and excluded/forbidden patterns, where the objective is to retrieve all the relevant documents that contains the included patterns and does not contain the excluded patterns. We continue the previous work done on this problem and propose more efficient solution. We conjecture that any significant improvement over these results is highly unlikely. We also consider the scenario when the query consists of more than two patterns. The forbidden pattern problem suffers from the drawback that linear space (in words) solutions are unlikely to yield a solution better than O(root(n/occ)) per document reporting time, where n is the total length of the documents and occ is the number of output documents. Continuing this path, we introduce a new variation, namely document retrieval with forbidden extension query, where the forbidden pattern is an extension of the included pattern.We also address the more general top-k version of the problem, which retrieves the top k documents, where the ranking is based on PageRank relevance metric. This problem finds motivation from search applications. It also holds theoretical interest as we show that the hardness of forbidden pattern problem is alleviated in this problem. We achieve linear space and optimal query time for this variation. We also propose succinct indexes for both these problems. Position restricted pattern matching considers the scenario where only part of the text is searched. We propose succinct index for this problem with efficient query time. An important application for this problem stems from searching in genomic sequences, where only part of the gene sequence is searched for interesting patterns. The problem of computing discriminating(resp. generic) words is to report all minimal(resp. maximal) extensions of a query pattern which are contained in at most(resp. at least) a given number of documents. These problems are motivated from applications in computational biology, text mining and automated text classification. We propose succinct indexes for these problems. Strings with uncertainty and fuzzy information play an important role in increasingly many applications. We propose a general framework for indexing uncertain strings such that a deterministic query string can be searched efficiently. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We consider two basic problems of string searching, namely substring searching and string listing. We formulate these well known problems for uncertain strings paradigm and propose exact and approximate solution for them. We also discuss a constrained variation of orthogonal range searching. Given a set of points, the task of orthogonal range searching is to build a data structure such that all the points inside a orthogonal query region can be reported. We introduce a new variation, namely shared constraint range searching which naturally arises in constrained pattern matching applications. Shared constraint range searching is a special four sided range reporting query problem where two constraints has sharing among them, effectively reducing the number of independent constraints. For this problem, we propose a linear space index that can match the best known bound for three dimensional dominance reporting problem. We extend our data structure in the external memory model

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF
    corecore