    2-Dimensional String Problems: Data Structures and Quantum Algorithms

    The field of stringology studies algorithms and data structures used for processing strings efficiently. The goal of this thesis is to investigate 2-dimensional (2D) variants of some fundamental string problems, including \textit{Exact Pattern Matching} and \textit{Longest Common Substring}. In the 2D pattern matching problem, we are given a matrix \M[1\dd n,1\dd n] that consists of N=n×nN = n \times n symbols drawn from an alphabet Σ\Sigma of size σ\sigma. The query consists of a m×m m \times m square matrix \PP[1\dd m, 1\dd m] drawn from the same alphabet, and the task is to find all the locations of \PP in \M. For such square patterns, data structures such as suffix trees and suffix arrays exist for the task of efficient pattern matching. However, a suffix tree occupies O(NlogN)O(N \log N) bits, which is significantly more than that of the original text\u27s size of NlogσN\log \sigma bits. Therefore, the design of compressed data structures, that supports pattern matching queries efficiently and occupies space close to the original text\u27s size, is imperative. In this thesis, we show an interesting result by designing a compact text index of size O(NloglogN+Nlogσ)O(N \log\log N + N \log\sigma) bits that at least supports efficient inverse suffix array queries. Although, the question of designing a compressed text index that would lead to efficient pattern matching is still evasive, this index gives a hope on the existence of a full 2D compressed text index with all functionalities similar to that of 1D case. On the other hand, the Longest Common 2D substring problem consists of two 2D strings (matrices), and the task is to report the size of the longest common 2D substring (submatrix) of these 2D strings. It is interesting to know if there exists a sublinear-time algorithm for solving this task. We answer this question positively by presenting a sublinear-time \textit{quantum} algorithm. In addition to this, we prove that any quantum algorithm requires at least Ω~(N2/3)\tilde{\Omega}(N^{2/3}) time to solve this problem

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    String Indexing with Compressed Patterns

    Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201