Search CORE

4 research outputs found

Range Shortest Unique Substring queries

Author: Abedin P. (Paniz)
Ganguly A. (Arnab)
Pissis S. (Solon)
Thankachan S.V. (Sharma)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/10/2019
Field of study

Let be a string of length n and be the substring of starting at position i and ending at position j. A substring of is a repeat if it occurs more than once in; otherwise, it is a unique substring of. Repeats and unique substrings are of great interest in computational biology and in information retrieval. Given string as input, the Shortest Unique Substring problem is to find a shortest substring of that does not occur elsewhere in. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over answering the following type of online queries efficiently. Given a range, return a shortest substring of with exactly one occurrence in. We present an -word data structure with query time, where is the word size. Our construction is based on a non-trivial reduction allowing us to apply a recently introduced optimal geometric data structure [Chan et al. ICALP 2018]

CWI's Institutional Repository

Suffix-prefix queries on a dictionary

Author: Loukides G. (Grigorios)
Pissis S. (Solon)
Thankachan S.V. (Sharma)
Zuba W.P. (Wiktor)
Publication venue
Publication date: 21/06/2023
Field of study

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S1, . . ., Sk, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1, k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ nO(1), APSP can be solved in the optimal O(n + k2) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992]. In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k2 usually dominates n, and thus the k2 factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries: One-to-One(i, j): output SPLi,j. One-to-All(i): output SPLi,j for every j ∈ [1, k]. Report(i, ℓ): output all distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Count(i, ℓ): output the number of distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Top(i, K): output K distinct j ∈ [1, k] with the highest values of SPLi,j breaking ties arbitrarily. We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ nO(1). We show the following upper bounds: Query Space (words) Query time Note One-to-One(i, j) O(n) O(log log k) Theorem 11 One-to-All(i) O(n) O(k) Theorem 14 Report(i, ℓ) O(n) O(log n/log log n + output) Theorem 19(i) Count(i, ℓ) O(n) O(log n/log log n) Theorem 19(ii) Top(i, K) O(n) O(log2 n/log log n + K) Theorem 22 We also present efficient algorithms for constructing these data structures

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Efficient data structures for range shortest unique substring queries†

Author: Abedin P. (Paniz)
Ganguly A. (Arnab)
Pissis S. (Solon)
Thankachan S.V. (Sharma)
Publication venue: 'MDPI AG'
Publication date: 01/11/2020
Field of study

Let T[1, n] be a string of length n and T[i, j] be the substring of T starting at position i and ending at position j. A substring T[i, j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α, β], return a shortest substring T[i, j] of T with exactly one occurrence in [α, β]. We present an O(n log n)-word data structure with O(logw n) query time, where w = Ω(log n) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(√ n logɛ n) query time, where ɛ > 0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012]

CWI's Institutional Repository

Parameterized text indexing with one wildcard

Author: Ganguly A. (Arnab)
Hon W.-K. (Wing-Kai)
Huang Y.-A. (Yu-An)
Pissis S. (Solon)
Shah R. (Rahul)
Thankachan S.V. (Sharma)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/03/2019
Field of study

Two equal-length strings X and Y over an alphabet Σ of size σ are a parameterized match iff X can be transformed to Y by renaming the character X[i] to the character Y[i] for 1 ≤ i ≤ |X| using a one-to-one function from the set of characters in X to the set of characters in Y. The parameterized text indexing problem is defined as: Index a text T of n characters over an alphabet set Σ of size σ, such that whenever a pattern P[1, p] comes as a query, we can report all occ parameterized occurrences of P in T. A position i [1, n] is a parameterized occurrence of P in T, iff P and T[i,(i+p-1)] are a parameterized match. We study an interesting generalization of this problem, where the pattern contains one wildcard character φ ∉ Σ that matches with any other character in Σ. Therefore, for a pattern P[1, p] = P1φP2, our task is to report all positions i in T, such that the string P_1 P_2 and the string obtained by concatenating T[i,(i+|P1|-1)] and T[(i+|P1|+1),(i+p-1)] are a parameterized match. We show that such queries can be answered in optimal O(p+occ) time per query using an O(n log n) space index. We then show how to compress our index into O(n log σ) space but with a higher query cost of O(p(log log n+logσ)+occ logσ)

CWI's Institutional Repository