67 research outputs found

    Finding an Optimal Alphabet Ordering for Lyndon Factorization Is Hard

    Get PDF

    30th Annual Symposium on Combinatorial Pattern Matching

    No full text
    This volume contains the papers presented at the 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019) held on June 18-20, 2019 in Pisa, Italy. The conference programme includes 29 contributed papers and three invited talks. Contributions of the invited lectures are also included in this volume

    On the Complexity of BWT-Runs Minimization via Alphabet Reordering

    Get PDF
    The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to the entropy-based lower bound. Recently, there has been the development of compact suffix trees in space proportional to "rr", the number of runs in the BWT, as well as the appearance of rr in the time complexity of new algorithms. Unlike other popular measures of compression, the parameter rr is sensitive to the lexicographic ordering given to the text's alphabet. Despite several past attempts to exploit this, a provably efficient algorithm for finding, or approximating, an alphabet ordering which minimizes rr has been open for years. We present the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time 2o(σ+n)2^{o(\sigma + \sqrt{n})} unless the Exponential Time Hypothesis fails, where σ\sigma is the size of the alphabet and nn is the length of the text. We also show that the optimization problem is APX-hard. In doing so, we relate two previously disparate topics: the optimal traveling salesperson path and the number of runs in the BWT of a text, providing a surprising connection between problems on graphs and text compression. Also, by relating recent results in the field of dictionary compression, we illustrate that an arbitrary alphabet ordering provides a O(log2n)O(\log^2 n)-approximation. We provide an optimal linear-time algorithm for the problem of finding a run minimizing ordering on a subset of symbols (occurring only once) under ordering constraints, and prove a generalization of this problem to a class of graphs with BWT like properties called Wheeler graphs is NP-complete

    Longest common substring made fully dynamic

    Get PDF
    Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n)-time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to this problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in Õ(n2/3) time, after Õ(n)-time and space preprocessing. 1 This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, they presented an Õ(n)-sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in Õ(1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings. We show that the techniques we develop can be applied to obtain fully dynamic algorithms for all of these variants. The only previously known sublinear-time dynamic algorithms for problems on strings were for maintaining a dynamic collection of strings for comparison queries and for pattern matching, with the most recent advances made by Gawrychowski et al. [SODA 2018] and by Clifford et al. [STACS 2018]. As an intermediate problem we consider computing the solution for a string with a given set of k edits, which leads us, in particular, to answering internal queries on a string. The input to such a query is specified by a substring (or substrings) of a given string. Data structures for answering internal string queries that were proposed by Kociumaka et al. [SODA 2015] and by Gagie et al. [CCCG 2013] are used, along with new ones, based on ingredients such as the suffix tree, heavy-path decomposition, orthogonal range queries, difference covers, and string periodicity

    Substring Complexity in Sublinear Space

    Full text link
    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function STS_T counting the cardinalities of the sets of substrings of each length of TT, also known as the substring complexity. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Ω(n)\Omega(n) working space. Kociumaka et al. showed that if δ\delta is given, one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ\delta using O(b)\mathcal{O}(b) space requires Ω(n2o(1)/b)\Omega(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an O(n3/b2)\mathcal{O}(n^3/b^2)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[1,n]b\in[1,n]; and an O~(n2/b)\tilde{\mathcal{O}}(n^2/b)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[n2/3,n]b\in[n^{2/3},n]

    Repetition Detection in a Dynamic String

    Get PDF
    A string UU for a non-empty string U is called a square. Squares have been well-studied both from a combinatorial and an algorithmic perspective. In this paper, we are the first to consider the problem of maintaining a representation of the squares in a dynamic string S of length at most n. We present an algorithm that updates this representation in n^o(1) time. This representation allows us to report a longest square-substring of S in O(1) time and all square-substrings of S in O(output) time. We achieve this by introducing a novel tool - maintaining prefix-suffix matches of two dynamic strings. We extend the above result to address the problem of maintaining a representation of all runs (maximal repetitions) of the string. Runs are known to capture the periodic structure of a string, and, as an application, we show that our representation of runs allows us to efficiently answer periodicity queries for substrings of a dynamic string. These queries have proven useful in static pattern matching problems and our techniques have the potential of offering solutions to these problems in a dynamic text setting

    Algorithms and Lower Bounds for Ordering Problems on Strings

    Get PDF
    This dissertation presents novel algorithms and conditional lower bounds for a collection of string and text-compression-related problems. These results are unified under the theme of ordering constraint satisfaction. Utilizing the connections to ordering constraint satisfaction, we provide hardness results and algorithms for the following: recognizing a type of labeled graph amenable to text-indexing known as Wheeler graphs, minimizing the number of maximal unary substrings occurring in the Burrows-Wheeler Transformation of a text, minimizing the number of factors occurring in the Lyndon factorization of a text, and finding an optimal reference string for relative Lempel-Ziv encoding

    Efficient Data Structures for Text Processing Applications

    Get PDF
    This thesis is devoted to designing and analyzing efficient text indexing data structures and associated algorithms for processing text data. The general problem is to preprocess a given text or a collection of texts into a space-efficient index to quickly answer various queries on this data. Basic queries such as counting/reporting a given pattern\u27s occurrences as substrings of the original text are useful in modeling critical bioinformatics applications. This line of research has witnessed many breakthroughs, such as the suffix trees, suffix arrays, FM-index, etc. In this work, we revisit the following problems: 1. The Heaviest Induced Ancestors problem 2. Range Longest Common Prefix problem 3. Range Shortest Unique Substrings problem 4. Non-Overlapping Indexing problem For the first problem, we present two new space-time trade-offs that improve the space, query time, or both of the existing solutions by roughly a logarithmic factor. For the second problem, our solution takes linear space, which improves the previous result by a logarithmic factor. The techniques developed are then extended to obtain an efficient solution for our third problem, which is newly formulated. Finally, we present a new framework that yields efficient solutions for the last problem in both cache-aware and cache-oblivious models

    Dynamic and Internal Longest Common Substring

    Get PDF
    Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n) -time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to the fully dynamic LCS problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in O~ (n2 / 3) time, after O~ (n) -time and space preprocessing. This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, the authors presented an O~ (n) -sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in O~ (1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings; specifically, computing the longest palindrome and the Lyndon factorization of a string after a single edit operation. We develop dynamic sublinear-time algorithms for both of these problems as well. We also consider internal LCS queries, that is, queries in which we are to return an LCS of a pair of substrings of S and T. We show that answering such queries is hard in general and propose efficient data structures for several restricted cases

    More Time-Space Tradeoffs for Finding a Shortest Unique Substring

    Get PDF
    We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of finding k-mismatch SUSs. Our new results include the first algorithm for finding a k-mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or in O(n(L/m)log2(L)loglogσ) time using O((L/m)log2L) sequential passes over T. We then describe how, for constant k, with high probability, we can find a k-mismatch SUS in O(n1+ϵL/m) time using O(nϵL/m) sequential passes over T, again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O(nτlogσlogn) time to find an SUS using O(n/τ) words of workspace, where τ is a parameter
    corecore